On 26/3/2006, at 13:53, Yvon Thoraval wrote:
You can’t tell whether a file is in MacRoman, cp-1252/iso-8859-1, or iso-8859-15.
i thought their are char codes in Mac Roman not used in cp1252 and else ...
Cp-1252 uses the 0x80-0x9F range, which is unused in iso-8859-x and MacRoman (at least for printable characters). But that’s it, so e.g. text which contains accented characters using one of the 8 bit encodings will be valid in any of the 8 bit legacy encodings.
That’s one of the nice properties of UTF-8: if it wasn’t encoded as UTF-8, it will not be a valid UTF-8 byte sequence (in all practical cases).
The approach I use is frequency analysis, but that’s based on my assumption of which characters would be the more frequently used characters.
then it is based upon the langage rather the string itself ?
I don’t think I understand this. But the detection heuristic will be based on the content and expected content.
The problem with this is that it’s not very universal. For example an English text containing 8 bit characters is likely doing so because of extra typographic characters like em-dash, curly quotes, and similar, where a French text is likely to use the 8 bit characters for the accented letters.
If the text contains an 0xC9 character then decoded as latin-1 it will be an accented uppercase E (‘É’) and decoded as Mac Roman it will be an ellipsis (‘…’). There is no simple way to tell just from the 0xC9 what the original encoding of the text was.