Extracting data pieces from records in a file
by TwoSteppinSRP from LinuxQuestions.org on (#6F5AF)
I'm trying to process a table dump file containing roughly 50 million records. I need to create a new file containing the first five fields of each record and the 6th & 7th fields from the end of the record. The length of the data for most of the fields is variable. Character fields are surrounded by single quotes, but there can also be single quotes inside a field and those will be escaped with a backslash. Fields are separated by commas, but there can also be commas inside the field values. To complicate it more, the word NULL (not surrounded by quotes) will be in place of any missing data, but the word NULL can also be found inside the quoted character fields.
Getting the first five fields is easy enough because there will never be any embedded commas or quotes in that data. I can use cut or awk to delimit by comma and get fields 1-5. However, getting the 6th & 7th fields from the end is more problematic. Coming from the end of the record initially looked easier, until I discovered that the 5th field from the end is a 'comments' field that can also contain embedded quotes and commas. And trying to get to those fields from the beginning of the record, means getting past multiple character fields that can contain the embedded commas and quotes.
While I do not know sed or awk very well, my preference would still be to do this using one or both of those commands if at all possible (or anything other than a read file loop). My attempts using sed & awk, making multiple passes and creating multiple interim files still ran much faster than a simple read/write script that was only pulling the first five fields.
Here are a few records from the file, showing some of the field contents that have given me problems. The bolded data is what I need to pull from each record. Any help on this would be greatly appreciated.
214,453,202,'QMS1',NULL,'-96.830322','32.980656','16775',NULL,NULL,'SOMETHING',NULL,NULL,'CITY','TX',NULL,'NULL,NULL,NULL,NULL ','AAAAAAA','C','0','2013-05-02','23:44:54','COMMENTS',255,'VIP','2013-05-02','17:52:08.906'
613,938,5575,'TWCBL','100000129675','-94.933388','38.793148','18516',NULL,NULL,'DATA',NULL,NULL,'CITY','KS','US',NULL,'INSIDE \' QUOTE AND, COMMA','C','3','2023-05-29','22:27:58',NULL,104,'VIP','2023-05-29','16:27:59.090'
813,938,4588,'ATTMO','100000129207','-94.961128','38.817787','17109',NULL,NULL,'STUFF',NULL,NULL,'CITY','KS','US',NULL,'DDDD EEEE','V','3','2010-08-20','01:46:45','COMMENTS \' AND , MORE',104,'VIP',NULL,NULL
922,111,5005,'ILAB',NULL,'-119.072044','34.350845','131',NULL,'N','STREET',NULL,NULL,'CITY','CA',NULL,NULL,'TEST MEC','V','3','2014-06-09','17:56:15',NULL,204,'VIP','2014-06-09','21:00:58.835'
Thanks much !!
Getting the first five fields is easy enough because there will never be any embedded commas or quotes in that data. I can use cut or awk to delimit by comma and get fields 1-5. However, getting the 6th & 7th fields from the end is more problematic. Coming from the end of the record initially looked easier, until I discovered that the 5th field from the end is a 'comments' field that can also contain embedded quotes and commas. And trying to get to those fields from the beginning of the record, means getting past multiple character fields that can contain the embedded commas and quotes.
While I do not know sed or awk very well, my preference would still be to do this using one or both of those commands if at all possible (or anything other than a read file loop). My attempts using sed & awk, making multiple passes and creating multiple interim files still ran much faster than a simple read/write script that was only pulling the first five fields.
Here are a few records from the file, showing some of the field contents that have given me problems. The bolded data is what I need to pull from each record. Any help on this would be greatly appreciated.
214,453,202,'QMS1',NULL,'-96.830322','32.980656','16775',NULL,NULL,'SOMETHING',NULL,NULL,'CITY','TX',NULL,'NULL,NULL,NULL,NULL ','AAAAAAA','C','0','2013-05-02','23:44:54','COMMENTS',255,'VIP','2013-05-02','17:52:08.906'
613,938,5575,'TWCBL','100000129675','-94.933388','38.793148','18516',NULL,NULL,'DATA',NULL,NULL,'CITY','KS','US',NULL,'INSIDE \' QUOTE AND, COMMA','C','3','2023-05-29','22:27:58',NULL,104,'VIP','2023-05-29','16:27:59.090'
813,938,4588,'ATTMO','100000129207','-94.961128','38.817787','17109',NULL,NULL,'STUFF',NULL,NULL,'CITY','KS','US',NULL,'DDDD EEEE','V','3','2010-08-20','01:46:45','COMMENTS \' AND , MORE',104,'VIP',NULL,NULL
922,111,5005,'ILAB',NULL,'-119.072044','34.350845','131',NULL,'N','STREET',NULL,NULL,'CITY','CA',NULL,NULL,'TEST MEC','V','3','2014-06-09','17:56:15',NULL,204,'VIP','2014-06-09','21:00:58.835'
Thanks much !!