Dear list menbers,
I have the following problem:
I received a plain text file utf-8 encoded written on a Windows PC. I could open this file and I could edit this file perfectly. After my modifications I saved this as utf-8 with LF. OK.
Then I tried to import the content of that file in a database. This didn't work, because the database couldn't parse the first line. Then I opened that file in a HexEditor and I saw that the first line begins with EF BB BF. After looking at this I remembered that these bytes are the BOM (Byte Order Marker) for utf-8 and Windows PC's make often use of it to save utf-8 text files.
My problem is now that I couldn't find a way to save my text file as utf-8 without BOM.
I had to use jEdit for that, because in jEdit you can select the encoding utf-8 or utf-8y (meaning with BOM).
Is there any chance to implement this in TextMate? Or, may be better, that TextMate saves all utf-8 files without BOM, because I think this marker is irrelevant within utf-8. It only makes sense in utf-16/32.
All the best,
Hans
I have the following problem:
I received a plain text file utf-8 encoded written on a Windows PC. I could open this file and I could edit this file perfectly. After my modifications I saved this as utf-8 with LF. OK.
Then I tried to import the content of that file in a database. This didn't work, because the database couldn't parse the first line. Then I opened that file in a HexEditor and I saw that the first line begins with EF BB BF. After looking at this I remembered that these bytes are the BOM (Byte Order Marker) for utf-8 and Windows PC's make often use of it to save utf-8 text files.
My problem is now that I couldn't find a way to save my text file as utf-8 without BOM.
I had to use jEdit for that, because in jEdit you can select the encoding utf-8 or utf-8y (meaning with BOM).
Is there any chance to implement this in TextMate? Or, may be better, that TextMate saves all utf-8 files without BOM, because I think this marker is irrelevant within utf-8. It only makes sense in utf-16/32.
By myself, I found at least an option.
I open an UTF-8(BOM) file, select all, copy it, open a new document, paste it, and save it under UTF-8. After doing this my new document has no BOM at all.
But I don't know, if it would be too difficult to implement this within TextMate. I could image that some users doesn't know the issue of BOM.
Best,
Hans
On 28/9/2006, at 13:44, Hans-Joerg Bibiko wrote:
[...] But I don't know, if it would be too difficult to implement this within TextMate. I could image that some users doesn't know the issue of BOM.
I do indeed think that a BOM in UTF-8 files is misguided at best [1] (which is why you can’t enable it in TM).
The reason TM will preserve it is only that some users actually do rely on them [2] -- so to preserve it seemed like the best compromise between not wanting to endorse them or even acknowledge that BOMs have a valid role in UTF-8 files, but at the same time not screw up files where the user went out of his way to place a BOM there.
Though I realize there is a problem for users who get their hands on BOM infested files and want an easy way to cleanse them. I am leaning toward bringing up a “warning” when encountering UTF-8 files with a BOM, which would then allow the user to select to get rid of it.
[1] UTF-8 exists so that files can use unicode but still be backwards compatible, a BOM does destroy some of this backward compatibility, e.g. the shell won’t pickup the shebang line of a script if it has a BOM, and e.g. grep’ing through multiple files with BOMs will produce “bad” output if something on the first line of one of the files is matched.
[2] The users who I have spoken with which rely on them do so only because they do not send the proper encoding in the http headers and then assume that the user agent will still treat the received file as UTF-8 on the sight of the BOM.