On 30 Jun 2008, at 12:33, Vincent Noel wrote:
It's especially weird that it seems somebody noticed the problem, and decided to fix it using extended attributes when more standards tools (e.g. the 'file' command) are perfectly able to identify utf8 without non-standard trickery...
To use 'file' is a good idea, BUT it looks only for the first (I don't know how many) characters in a file. I.e. if you have a rather large UTF-8 file containing 'normal' ASCII and the last character is e.g. a ΓΌ, 'file' will output: "test.txt: ASCII text, with very long lines" or similar.
The only definite way to get the encoding is to parse the ENTIRE file or parse to the first byte sequence which determine the used encoding one-to-one. Or one uses the obsolete UTF-8 BOM (byte order marker at the beginning of a file).
--Hans