[TxMt] Regular expression conundrum (CSV)
Fritz Anderson
fritza at manoverboard.org
Thu Feb 22 17:18:32 UTC 2007
I want to have a regular expression that identifies the items in a
line from a comma-separate values (CSV) file.
Imagine one style of CSV, in which such items are all quoted (Format 1):
"First Item","String","0","Yes","Yes","No","The contents of the
string in the first item"
"Authority","ID","0","Yes","No","No","ID of the person
""responsible"" for the item, if known"
In CSV, double-quotes permit embedding commas (and spaces?) in record
fields. Double-quotes in such fields are escaped by doubling the
character.
The regex that matches the full text of the item is fairly
straightforward:
"((""|[^"]*)*)" # In quotes, a run of double-quotes and anything
else not a quote; make $1 hold the unquoted string
However, a field may be empty (represented by no characters between
the commas). This a special case of the less-paranoid (and arguably
more standard) way of writing the file (Format 2):
"First Item",String,0,,Yes,No,"The contents of the string in the
first item"
Authority,ID,0,Yes,No,No,"ID of the person ""responsible"" for the
item, if known"
The something-between-quotes regex doesn't pick up the nonquoted
fields (obviously).
So make the regex fancier, to make the quotes optional and recognize
the field separator (which does not exist at the end of the record):
("?((""|[^"]*)*)"?),?
This still works for Format 1, but in Format 2 it matches the whole
of any run of records that aren't quoted (String,0,Yes,Yes,No,").
Start from the other end, and try a regex that matches fields not
quoted:
([^,[:cntrl:]]*),? # any run of characters, including blanks, that
aren't controls or commas, and may end in comma
The exclusion of control characters prevents the matching of:
"The contents of the string in the first item"
Authority
If the next field is a quoted string with a comma in the middle, this
pattern stops at the embedded comma.
So maybe a pattern that combines the two patterns would work:
(("?((""|[^"]*)*)"?)|([^,[:cntrl:]]*)),? # match quoted fields if
you can, unquoted fields if you must.
No: This pattern matches
String,0,,Yes,No,"
in the first line of the Format 2 example. It's the same behavior as
the quoted-only pattern (matches runs of nonquoted strings).
Reversing it:
(([^,[:cntrl:]]*)|("?((""|[^"]*)*)"?)),?
behaves the same as the nonquoted pattern (matching stops at commas
within quoted strings).
I'm out of ideas. Does anybody have a suggestion?
— F
More information about the textmate
mailing list