I want to have a regular expression that identifies the items in a line from a comma-separate values (CSV) file.
Imagine one style of CSV, in which such items are all quoted (Format 1):
"First Item","String","0","Yes","Yes","No","The contents of the string in the first item" "Authority","ID","0","Yes","No","No","ID of the person ""responsible"" for the item, if known"
In CSV, double-quotes permit embedding commas (and spaces?) in record fields. Double-quotes in such fields are escaped by doubling the character.
The regex that matches the full text of the item is fairly straightforward: "((""|[^"]*)*)" # In quotes, a run of double-quotes and anything else not a quote; make $1 hold the unquoted string However, a field may be empty (represented by no characters between the commas). This a special case of the less-paranoid (and arguably more standard) way of writing the file (Format 2):
"First Item",String,0,,Yes,No,"The contents of the string in the first item" Authority,ID,0,Yes,No,No,"ID of the person ""responsible"" for the item, if known"
The something-between-quotes regex doesn't pick up the nonquoted fields (obviously).
So make the regex fancier, to make the quotes optional and recognize the field separator (which does not exist at the end of the record): ("?((""|[^"]*)*)"?),? This still works for Format 1, but in Format 2 it matches the whole of any run of records that aren't quoted (String,0,Yes,Yes,No,").
Start from the other end, and try a regex that matches fields not quoted: ([^,[:cntrl:]]*),? # any run of characters, including blanks, that aren't controls or commas, and may end in comma The exclusion of control characters prevents the matching of: "The contents of the string in the first item" Authority
If the next field is a quoted string with a comma in the middle, this pattern stops at the embedded comma.
So maybe a pattern that combines the two patterns would work: (("?((""|[^"]*)*)"?)|([^,[:cntrl:]]*)),? # match quoted fields if you can, unquoted fields if you must. No: This pattern matches String,0,,Yes,No," in the first line of the Format 2 example. It's the same behavior as the quoted-only pattern (matches runs of nonquoted strings). Reversing it: (([^,[:cntrl:]]*)|("?((""|[^"]*)*)"?)),? behaves the same as the nonquoted pattern (matching stops at commas within quoted strings).
I'm out of ideas. Does anybody have a suggestion?
— F