Unicode encodings (was Re: [TxMt] Tidy and XML)

Erwan David erwan at rail.eu.org
Fri Dec 2 14:51:38 UTC 2005


Le Fri  2/12/2005, Chris Thomas disait
> If you want additional detail about Unicode encoding geekery:
> 
> http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
> ...
> >UTF-16 is probably what most people thought most programmers would  
> >use for Unicode; this is reflected in the fact that the native  
> >character type in both Java and C# is a sixteen-bit quantity. Of  
> >course, it doesn't really represent a Unicode character, exactly  
> >(although it does most times), it represents a UTF-16 codepoint.
> >
> >UTF-16 is about the most efficient way possible of representing  
> >Asian character strings, each character nestling snugly into two  
> >bytes of storage. For ASCII characters, of course, you end up using  
> >two bytes to represent what would actually fit into one.
> ...

except there are 2 different UTF-16 (big and little endian) and that a
bunch of preexisting software considers the 0 byte as end of string...

-- 
Erwan David



More information about the textmate mailing list