Re: [TxMt] encoding detection

26 Mar 2006


      On 26/3/2006, at 13:53, Yvon Thoraval wrote:
...
...
You can’t tell whether a file is in MacRoman, cp-1252/iso-8859-1,  
or iso-8859-15.
i thought their are char codes in Mac Roman not used in cp1252 and  
else ...
Cp-1252 uses the 0x80-0x9F range, which is unused in iso-8859-x and  
MacRoman (at least for printable characters). But that’s it, so e.g.  
text which contains accented characters using one of the 8 bit  
encodings will be valid in any of the 8 bit legacy encodings.
That’s one of the nice properties of UTF-8: if it wasn’t encoded as  
UTF-8, it will not be a valid UTF-8 byte sequence (in all practical  
cases).
...
...
The approach I use is frequency analysis, but that’s based on my  
assumption of which characters would be the more frequently used  
characters.
then it is based upon the langage rather the string itself ?
I don’t think I understand this. But the detection heuristic will be  
based on the content and expected content.
The problem with this is that it’s not very universal. For example an  
English text containing 8 bit characters is likely doing so because  
of extra typographic characters like em-dash, curly quotes, and  
similar, where a French text is likely to use the 8 bit characters  
for the accented letters.
If the text contains an 0xC9 character then decoded as latin-1 it  
will be an accented uppercase E (‘É’) and decoded as Mac Roman it  
will be an ellipsis (‘…’). There is no simple way to tell just from  
the 0xC9 what the original encoding of the text was.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [TxMt] encoding detection