encoding detection

List overview All Threads
Download

newer

older

Weird Symbol?

Re: Language Grammar & more

Yvon Thoraval

26 Mar 2006 26 Mar '06

11:51 a.m.

Hey all,

i wonder if their is encoding detection in TextMate usable in anothe app.

i'd like to discriminate between :

iso-8859-1, iso-8859-15, cp-1252 and Mac Roman

for the time being, using a ruby regex i'm only able to detect US- ASCII and UT-8....

in case u have some light upon that...

Yvon

Show replies by date

Allan Odgaard

26 Mar 26 Mar

1:02 p.m.

On 26/3/2006, at 11:51, Yvon Thoraval wrote:

...

i wonder if their is encoding detection in TextMate usable in anothe app. [...]

You can’t tell whether a file is in MacRoman, cp-1252/iso-8859-1, or iso-8859-15.

The approach I use is frequency analysis, but that’s based on my assumption of which characters would be the more frequently used characters.

Yvon Thoraval

1:53 p.m.

Le 26 mars 06 à 13:02, Allan Odgaard a écrit :

...

You can’t tell whether a file is in MacRoman, cp-1252/iso-8859-1, or iso-8859-15.

i thought their are char codes in Mac Roman not used in cp1252 and else ...

...

The approach I use is frequency analysis, but that’s based on my assumption of which characters would be the more frequently used characters.

then it is based upon the langage rather the string itself ?

best,

Yvon

Allan Odgaard

9:57 p.m.

On 26/3/2006, at 13:53, Yvon Thoraval wrote:

...

...
You can’t tell whether a file is in MacRoman, cp-1252/iso-8859-1, or iso-8859-15.

i thought their are char codes in Mac Roman not used in cp1252 and else ...

Cp-1252 uses the 0x80-0x9F range, which is unused in iso-8859-x and MacRoman (at least for printable characters). But that’s it, so e.g. text which contains accented characters using one of the 8 bit encodings will be valid in any of the 8 bit legacy encodings.

That’s one of the nice properties of UTF-8: if it wasn’t encoded as UTF-8, it will not be a valid UTF-8 byte sequence (in all practical cases).

...

...
The approach I use is frequency analysis, but that’s based on my assumption of which characters would be the more frequently used characters.

then it is based upon the langage rather the string itself ?

I don’t think I understand this. But the detection heuristic will be based on the content and expected content.

The problem with this is that it’s not very universal. For example an English text containing 8 bit characters is likely doing so because of extra typographic characters like em-dash, curly quotes, and similar, where a French text is likely to use the 8 bit characters for the accented letters.

If the text contains an 0xC9 character then decoded as latin-1 it will be an accented uppercase E (‘É’) and decoded as Mac Roman it will be an ellipsis (‘…’). There is no simple way to tell just from the 0xC9 what the original encoding of the text was.

Sune Foldager

27 Mar 27 Mar

11:05 a.m.

Yvon Thoraval wrote:

...

i thought their are char codes in Mac Roman not used in cp1252 and else ...

cp-1252 uses close to the entire 8 bit range (except the lower command area, which neither of the charsets use), so it's really impossible to single it out. I don't know about Mac Roman, but latin-1 and latin-9 (8859-15) are likely well over 99% identical.

So I guess you can: If any illegal latin-1 char is used and the text is not UTF-8 (which can be detected reliably), then it's one of Mac Roman, latin-9 or cp-1252 :p. Not exactly very useful.

-- Sune.

7062

days inactive

7063

days old

textmate@lists.macromates.com

4 comments

participants

tags (0)

participants (3)

Allan Odgaard
Sune Foldager
Yvon Thoraval