[TxMt] encoding detection

Allan Odgaard throw-away-1 at macromates.com
Sun Mar 26 19:57:29 UTC 2006


On 26/3/2006, at 13:53, Yvon Thoraval wrote:

>> You can’t tell whether a file is in MacRoman, cp-1252/iso-8859-1,  
>> or iso-8859-15.
> i thought their are char codes in Mac Roman not used in cp1252 and  
> else ...

Cp-1252 uses the 0x80-0x9F range, which is unused in iso-8859-x and  
MacRoman (at least for printable characters). But that’s it, so e.g.  
text which contains accented characters using one of the 8 bit  
encodings will be valid in any of the 8 bit legacy encodings.

That’s one of the nice properties of UTF-8: if it wasn’t encoded as  
UTF-8, it will not be a valid UTF-8 byte sequence (in all practical  
cases).

>> The approach I use is frequency analysis, but that’s based on my  
>> assumption of which characters would be the more frequently used  
>> characters.
> then it is based upon the langage rather the string itself ?

I don’t think I understand this. But the detection heuristic will be  
based on the content and expected content.

The problem with this is that it’s not very universal. For example an  
English text containing 8 bit characters is likely doing so because  
of extra typographic characters like em-dash, curly quotes, and  
similar, where a French text is likely to use the 8 bit characters  
for the accented letters.

If the text contains an 0xC9 character then decoded as latin-1 it  
will be an accented uppercase E (‘É’) and decoded as Mac Roman it  
will be an ellipsis (‘…’). There is no simple way to tell just from  
the 0xC9 what the original encoding of the text was.






More information about the textmate mailing list