Hello,
I've been seeing a weird issue with non-ASCII characters appearing messed up when QuickLooking a utf-8 file saved from TextMate. Here is a screenshot: http://vnoel.files.wordpress.com/2008/06/ql-bug.png The same problem appears when opening the file with TextEdit -- non-ASCII characters are messed up. Unix tools (file, cat) show the file correctly. After digging around I found a thread [1] on the vim mailing list that says it is related to extended attributes that are set by TextEdit and that need to be present for QL to correctly recognize the utf8 encoding.
If you set the extended attribute manually on the utf8 file, non-ASCII characters appear fine in QL. This looks like a bug in QL which, given its lack of publicity, is rarely triggered and probably due to something specific to my setup. I'm not sure what to do next, apart from applying 'xattr' to every text file I save with TextMate :-) Does anyone has advice ?
I'm sorry if this issue has already been discussed previously on this list or elsewhere, but I couldn't find any other reference using Google.
Thanks !
[1] http://www.nabble.com/MacVim-file-encoding-and-Quicklook-td17289501.html
Cheers, Vincent Noel
Hello, I'm ashamed of answering to myself, but I thought maybe this had slipped through the cracks.
Am I the only one with this issue, or am I doing something really stupid and obvious?
The non-ability to display properly utf8 files saved from TextMate using standard tools from the OS seems like a reasonable concern to me...
Thanks Cheers, Vincent Noel
On Wed, Jun 18, 2008 at 20:10, Vincent Noel vincent.noel@gmail.com wrote:
Hello,
I've been seeing a weird issue with non-ASCII characters appearing messed up when QuickLooking a utf-8 file saved from TextMate. Here is a screenshot: http://vnoel.files.wordpress.com/2008/06/ql-bug.png The same problem appears when opening the file with TextEdit -- non-ASCII characters are messed up. Unix tools (file, cat) show the file correctly. After digging around I found a thread [1] on the vim mailing list that says it is related to extended attributes that are set by TextEdit and that need to be present for QL to correctly recognize the utf8 encoding.
If you set the extended attribute manually on the utf8 file, non-ASCII characters appear fine in QL. This looks like a bug in QL which, given its lack of publicity, is rarely triggered and probably due to something specific to my setup. I'm not sure what to do next, apart from applying 'xattr' to every text file I save with TextMate :-) Does anyone has advice ?
I'm sorry if this issue has already been discussed previously on this list or elsewhere, but I couldn't find any other reference using Google.
Thanks !
[1] http://www.nabble.com/MacVim-file-encoding-and-Quicklook-td17289501.html
Cheers, Vincent Noel
On 29 Jun 2008, at 21:50, Vincent Noel wrote:
I'm ashamed of answering to myself, but I thought maybe this had slipped through the cracks.
Am I the only one with this issue, or am I doing something really stupid and obvious?
Sadly I find that some parts of the OS still assume stuff to be in MacRoman¹ if not explictly told otherwise (and some stuff can’t even safely be told otherwise, like pbcopy/pbpaste).
I’d encourage you to file an enhancement report with http://bugreport.apple.com/ -- Leopard has moved several things to UTF-8 (like osascript), but some stuff is still lacking. Explictly tagging text files with extended attributes to tell the systme that they are UTF-8 is IMO very wrong, especially given that UTF-8 can be recognized with 99.999999% certainty (so even if one disagrees about making it the standard encoding, it can still be at least detected safely without mistakenly treating e.g. a MacRoman file as UTF-8).
¹ Really the “system encoding” which for US/Western systems will be MacRoman, a thing that comes from Classic.
On Jun 29, 2008, at 3:07 PM, Allan Odgaard wrote:
Explictly tagging text files with extended attributes to tell the systme that they are UTF-8 is IMO very wrong, especially given that UTF-8 can be recognized with 99.999999% certainty (so even if one disagrees about making it the standard encoding, it can still be at least detected safely without mistakenly treating e.g. a MacRoman file as UTF-8).
According to the Foundation release notes, 10.5 now tries UTF-8 as a fallback if -[NSString initWithContentsOfURL:usedEncoding:error] is used (presumably if xattrs and BOM are absent). Does that answer your suggestion? Personally, I like the idea of using xattrs for this, but we were using them for that purpose before Leopard.
MacRoman is a good fallback encoding if UTF-8 fails, but QL is evidently using it too early. In fact, it looks like NSAttributedString doesn't try UTF-8 and goes right to MacRoman, which probably explains the OP's problem with TextEdit. Very unfortunate.
On 30 Jun 2008, at 00:57, Adam R. Maxwell wrote:
[...] According to the Foundation release notes, 10.5 now tries UTF-8 as a fallback if -[NSString initWithContentsOfURL:usedEncoding:error] is used (presumably if xattrs and BOM are absent). Does that answer your suggestion?
Answer my suggestion? Not sure what you mean by that.
Personally, I like the idea of using xattrs for this, but we were using them for that purpose before Leopard.
Using it to make UTF-8 work is bad! As said, UTF-8 can be safely detected without such attribute, so the attribute is 100% redundant.
Also, UTF-8 is the only sane encoding to use for new stuff, so text files should be UTF-8 today, having to do special tricks to make the system detect them as that, sucks, especially since majority of text files are not generated in CoreFoundation-using applications -- even on a Mac they might come from shell redirects, be generated by scripts, or similar. So majority of files will not have this attribute, and will never get such attribute.
http://x.nest.jp/mac/080109_0217.htm
Looks like this guy made the QuickLook plugin that accords with UTI in public.plain-text.
I don't know what I'm talking about. Please check the source because this may help.
Takaaki
On Jun 29, 2008, at 10:39 PM, Allan Odgaard wrote:
On 30 Jun 2008, at 00:57, Adam R. Maxwell wrote:
[...] According to the Foundation release notes, 10.5 now tries UTF-8 as a fallback if -[NSString initWithContentsOfURL:usedEncoding:error] is used (presumably if xattrs and BOM are absent). Does that answer your suggestion?
Answer my suggestion? Not sure what you mean by that.
UTF-8 is now tried as a fallback before giving up completely. Unfortunately, it's limited to two methods on NSString AFAIK.
Personally, I like the idea of using xattrs for this, but we were using them for that purpose before Leopard.
Using it to make UTF-8 work is bad! As said, UTF-8 can be safely detected without such attribute, so the attribute is 100% redundant.
Yes, it's redundant for files with a BOM and for UTF-8. I'm curious as to you think it's entire purpose is to make UTF-8 work? I would suggest that it's purpose is to make Latin-1, Shift-JIS, and all the other weird encodings work. Using UTF-8 is certainly preferable, but not everyone can do that!
Also, UTF-8 is the only sane encoding to use for new stuff, so text files should be UTF-8 today, having to do special tricks to make the system detect them as that, sucks, especially since majority of text files are not generated in CoreFoundation-using applications -- even on a Mac they might come from shell redirects, be generated by scripts, or similar. So majority of files will not have this attribute, and will never get such attribute.
I agree; UTF-8 /should/ be used. If files are loaded into a Cocoa app using the method I noted, it'll try UTF-8 before giving up, regardless of whether the attribute is present. I have to deal with other encodings fairly regularly, unfortunately.
On Mon, Jun 30, 2008 at 00:07, Allan Odgaard mailinglist@textmate.org wrote:
Sadly I find that some parts of the OS still assume stuff to be in MacRoman¹ if not explictly told otherwise (and some stuff can't even safely be told otherwise, like pbcopy/pbpaste).
I'd encourage you to file an enhancement report with http://bugreport.apple.com/ -- Leopard has moved several things to UTF-8 (like osascript), but some stuff is still lacking. Explictly tagging text files with extended attributes to tell the systme that they are UTF-8 is IMO very wrong, especially given that UTF-8 can be recognized with 99.999999% certainty (so even if one disagrees about making it the standard encoding, it can still be at least detected safely without mistakenly treating e.g. a MacRoman file as UTF-8).
I will file a bug. Thanks for the advice.
Still, it's seems like a pretty big screwup on Apple's part, especially in user-visible 'new' features like Quicklook. It's especially weird that it seems somebody noticed the problem, and decided to fix it using extended attributes when more standards tools (e.g. the 'file' command) are perfectly able to identify utf8 without non-standard trickery...
Cheers V
On 30 Jun 2008, at 12:33, Vincent Noel wrote:
It's especially weird that it seems somebody noticed the problem, and decided to fix it using extended attributes when more standards tools (e.g. the 'file' command) are perfectly able to identify utf8 without non-standard trickery...
To use 'file' is a good idea, BUT it looks only for the first (I don't know how many) characters in a file. I.e. if you have a rather large UTF-8 file containing 'normal' ASCII and the last character is e.g. a ü, 'file' will output: "test.txt: ASCII text, with very long lines" or similar.
The only definite way to get the encoding is to parse the ENTIRE file or parse to the first byte sequence which determine the used encoding one-to-one. Or one uses the obsolete UTF-8 BOM (byte order marker at the beginning of a file).
--Hans
On Mon, Jun 30, 2008 at 12:49, Hans-Joerg Bibiko bibiko@eva.mpg.de wrote:
The only definite way to get the encoding is to parse the ENTIRE file or parse to the first byte sequence which determine the used encoding one-to-one. Or one uses the obsolete UTF-8 BOM (byte order marker at the beginning of a file).
Ok... So I guess the real bug is that Quicklook and other utilities decide to fall back on MacRoman instead of utf8.
Thanks for the clarification :-)
Cheers, V
On 30.06.2008, at 13:04, Vincent Noel wrote:
On Mon, Jun 30, 2008 at 12:49, Hans-Joerg Bibiko bibiko@eva.mpg.de wrote:
The only definite way to get the encoding is to parse the ENTIRE file or parse to the first byte sequence which determine the used encoding one-to-one. Or one uses the obsolete UTF-8 BOM (byte order marker at the beginning of a file).
Ok... So I guess the real bug is that Quicklook and other utilities decide to fall back on MacRoman instead of utf8.
This would be one possibility. But the whole issue is much more complicated. E.g. it is not possible to distinguish the text encoding if the text is stored in ISO-8859-1..12. Each byte sequence would be valid, but each byte represents different glyphs according to its encoding. Even for UTF-8 it is very complex. For instance the UTF-8 byte sequence C3 A4 (ä) could also be ISO-8859-1 (ä) [it could be that this makes sense].
My general suggestion to Apple would be to introduce an unique attribute 'encoding'. By doing so each application could store the correct text encoding in that attribute file.
--Hans