I want to have a regular expression that identifies the items in a line from a comma-separate values (CSV) file.
Imagine one style of CSV, in which such items are all quoted (Format 1):
"First Item","String","0","Yes","Yes","No","The contents of the string in the first item" "Authority","ID","0","Yes","No","No","ID of the person ""responsible"" for the item, if known"
In CSV, double-quotes permit embedding commas (and spaces?) in record fields. Double-quotes in such fields are escaped by doubling the character.
The regex that matches the full text of the item is fairly straightforward: "((""|[^"]*)*)" # In quotes, a run of double-quotes and anything else not a quote; make $1 hold the unquoted string However, a field may be empty (represented by no characters between the commas). This a special case of the less-paranoid (and arguably more standard) way of writing the file (Format 2):
"First Item",String,0,,Yes,No,"The contents of the string in the first item" Authority,ID,0,Yes,No,No,"ID of the person ""responsible"" for the item, if known"
The something-between-quotes regex doesn't pick up the nonquoted fields (obviously).
So make the regex fancier, to make the quotes optional and recognize the field separator (which does not exist at the end of the record): ("?((""|[^"]*)*)"?),? This still works for Format 1, but in Format 2 it matches the whole of any run of records that aren't quoted (String,0,Yes,Yes,No,").
Start from the other end, and try a regex that matches fields not quoted: ([^,[:cntrl:]]*),? # any run of characters, including blanks, that aren't controls or commas, and may end in comma The exclusion of control characters prevents the matching of: "The contents of the string in the first item" Authority
If the next field is a quoted string with a comma in the middle, this pattern stops at the embedded comma.
So maybe a pattern that combines the two patterns would work: (("?((""|[^"]*)*)"?)|([^,[:cntrl:]]*)),? # match quoted fields if you can, unquoted fields if you must. No: This pattern matches String,0,,Yes,No," in the first line of the Format 2 example. It's the same behavior as the quoted-only pattern (matches runs of nonquoted strings). Reversing it: (([^,[:cntrl:]]*)|("?((""|[^"]*)*)"?)),? behaves the same as the nonquoted pattern (matching stops at commas within quoted strings).
I'm out of ideas. Does anybody have a suggestion?
— F
On Feb 22, 2007, at 11:18 AM, Fritz Anderson wrote:
I'm out of ideas. Does anybody have a suggestion?
I can tell you from experience that it's impossible to correctly parse all valid CSV with just one regular expression. The second edition of Mastering Regular Expressions gives an example pattern that gets very close, but I've found edge cases it fails on.
You can do it with as little as two regular expressions though. If you want to see how, you can examine the source of my FasterCSV Ruby library:
http://rubyforge.org/projects/fastercsv
It is a regular expression based parser.
James Edward Gray II
James Edward Gray II wrote:
I can tell you from experience that it's impossible to correctly parse all valid CSV with just one regular expression. The second edition of Mastering Regular Expressions gives an example pattern that gets very close, but I've found edge cases it fails on.
Well, that's not quite accurate. There is no single 'csv' spec, so there is no bright line between 'valid' and 'invalid' csv here. Most people try to conform to what MS Excel does (excel being the 800 pound gorilla in the spreadsheet market), but many other apps/people make csv that is close to, but not quite, the same as the excel format.
-Jacob
On Feb 25, 2007, at 6:17 PM, Jacob Rus wrote:
There is no single 'csv' spec, so there is no bright line between 'valid' and 'invalid' csv here.
http://www.ietf.org/rfc/rfc4180.txt
James Edward Gray II
James Edward Gray II wrote:
On Feb 25, 2007, at 6:17 PM, Jacob Rus wrote:
There is no single 'csv' spec, so there is no bright line between 'valid' and 'invalid' csv here.
http://www.ietf.org/rfc/rfc4180.txt
James Edward Gray II
Okay, but notice that's from 2005, which means that lots of the 'csv' documents from decades of history before that don't conform. It doesn't look to me like that spec even covers the full range of MS Excel generated/accepted csv files.