[TxMt] encoding detection
Allan Odgaard
throw-away-1 at macromates.com
Sun Mar 26 19:57:29 UTC 2006
On 26/3/2006, at 13:53, Yvon Thoraval wrote:
>> You can’t tell whether a file is in MacRoman, cp-1252/iso-8859-1,
>> or iso-8859-15.
> i thought their are char codes in Mac Roman not used in cp1252 and
> else ...
Cp-1252 uses the 0x80-0x9F range, which is unused in iso-8859-x and
MacRoman (at least for printable characters). But that’s it, so e.g.
text which contains accented characters using one of the 8 bit
encodings will be valid in any of the 8 bit legacy encodings.
That’s one of the nice properties of UTF-8: if it wasn’t encoded as
UTF-8, it will not be a valid UTF-8 byte sequence (in all practical
cases).
>> The approach I use is frequency analysis, but that’s based on my
>> assumption of which characters would be the more frequently used
>> characters.
> then it is based upon the langage rather the string itself ?
I don’t think I understand this. But the detection heuristic will be
based on the content and expected content.
The problem with this is that it’s not very universal. For example an
English text containing 8 bit characters is likely doing so because
of extra typographic characters like em-dash, curly quotes, and
similar, where a French text is likely to use the 8 bit characters
for the accented letters.
If the text contains an 0xC9 character then decoded as latin-1 it
will be an accented uppercase E (‘É’) and decoded as Mac Roman it
will be an ellipsis (‘…’). There is no simple way to tell just from
the 0xC9 what the original encoding of the text was.
More information about the textmate
mailing list