Regular expression conundrum (CSV) - TextMate

22 Feb 2007


      I want to have a regular expression that identifies the items in a  
line from a comma-separate values (CSV) file.
Imagine one style of CSV, in which such items are all quoted (Format 1):
"First Item","String","0","Yes","Yes","No","The contents of the  
string in the first item"
    "Authority","ID","0","Yes","No","No","ID of the person  
""responsible"" for the item, if known"
In CSV, double-quotes permit embedding commas (and spaces?) in record  
fields. Double-quotes in such fields are escaped by doubling the  
character.
The regex that matches the full text of the item is fairly  
straightforward:
    "((""|[^"]*)*)"	# In quotes, a run of double-quotes and anything  
else not a quote; make $1 hold the unquoted string
However, a field may be empty (represented by no characters between  
the commas). This a special case of the less-paranoid (and arguably  
more standard) way of writing the file (Format 2):
"First Item",String,0,,Yes,No,"The contents of the string in the  
first item"
    Authority,ID,0,Yes,No,No,"ID of the person ""responsible"" for the  
item, if known"
The something-between-quotes regex doesn't pick up the nonquoted  
fields (obviously).
So make the regex fancier, to make the quotes optional and recognize  
the field separator (which does not exist at the end of the record):
    ("?((""|[^"]*)*)"?),?
This still works for Format 1, but in Format 2 it matches the whole  
of any run of records that aren't quoted (String,0,Yes,Yes,No,").
Start from the other end, and try a regex that matches fields not  
quoted:
([^,[:cntrl:]]*),?	# any run of characters, including blanks, that  
aren't controls or commas, and may end in comma
The exclusion of control characters prevents the matching of:
    "The contents of the string in the first item"
    Authority
If the next field is a quoted string with a comma in the middle, this  
pattern stops at the embedded comma.
So maybe a pattern that combines the two patterns would work:
    (("?((""|[^"]*)*)"?)|([^,[:cntrl:]]*)),?	# match quoted fields if  
you can, unquoted fields if you must.
No: This pattern matches
    String,0,,Yes,No,"
in the first line of the Format 2 example. It's the same behavior as  
the quoted-only pattern (matches runs of nonquoted strings).  
Reversing it:
    (([^,[:cntrl:]]*)|("?((""|[^"]*)*)"?)),?
behaves the same as the nonquoted pattern (matching stops at commas  
within quoted strings).
I'm out of ideas. Does anybody have a suggestion?
— F