On 30.06.2008, at 13:04, Vincent Noel wrote:
On Mon, Jun 30, 2008 at 12:49, Hans-Joerg Bibiko bibiko@eva.mpg.de wrote:
The only definite way to get the encoding is to parse the ENTIRE file or parse to the first byte sequence which determine the used encoding one-to-one. Or one uses the obsolete UTF-8 BOM (byte order marker at the beginning of a file).
Ok... So I guess the real bug is that Quicklook and other utilities decide to fall back on MacRoman instead of utf8.
This would be one possibility. But the whole issue is much more complicated. E.g. it is not possible to distinguish the text encoding if the text is stored in ISO-8859-1..12. Each byte sequence would be valid, but each byte represents different glyphs according to its encoding. Even for UTF-8 it is very complex. For instance the UTF-8 byte sequence C3 A4 (ä) could also be ISO-8859-1 (ä) [it could be that this makes sense].
My general suggestion to Apple would be to introduce an unique attribute 'encoding'. By doing so each application could store the correct text encoding in that attribute file.
--Hans