[TxMt] Tidy and XML

Allan Odgaard throw-away-1 at macromates.com
Fri Dec 2 06:31:39 UTC 2005


On 30/11/2005, at 3:05, Nicholas Orr wrote:

> [...] Thanks for taking the time to reply, but you've lost me.   
> Unicode is one thing I haven't kept track of at all.

Don't know if it helps (or has interest), but here's my quick summary  
of the unicode encoding situation (factually unchecked):

Unicode code points (which I'll just call characters for simplicity)  
were originally 16 bits each. This means to write a text in unicode,  
each character would require 16 bits.

UTF-8 is a representation (encoding) of unicode which was devised to  
make a way to transition to unicode while still remaining backwards  
compatibility (which amongst other means that all our unix tools work  
gracefully with the files).

 From a users POV UTF-8 has (IMHO) all the advantages, most notable  
the ASCII compatibility. From a developers POV UTF-8 is a little more  
difficult to work with because each character can be represented with  
multiple bytes, so for example it's not (always) legal to just cut a  
string in half, and to access a given index in a string, one needs to  
iterate through the string, rather than just do string[index].

For this reason, it might seem [1] more attractive to just work with  
the 16 bit characters, at least in memory.

But time passed, and it turned out that 16 bit characters were  
pessimistic, so today a unicode character is 32 bits, and a 16 bit  
unicode text is said to be in UTF-16, which is now just en encoding  
for 32 bit unicode (ucs-4). So all the “problems” of UTF-8 now exist  
for UTF-16, plus the lack of backwards compatibility, the generally  
double file size, and the endian issues (i.e. are the 16 bit  
characters in big or little endian format, and despite the byte-order- 
marker, the source that reads UTF-16 needs to know the endianess of  
the architecture on which it's compiled).

So all in all, today UTF-16 is a pretty crappy encoding! :)

> In the end I did #3, and it worked fine.  It worked so well I went  
> and bought a copy.

Thanks!


[1] I say seems because in my experience this is mostly a theoretical  
disadvantage, as the need to split/index a string based on index is  
rare, when it happens, one is generally already iterating the string  
character by character (to find where to split it etc.).




More information about the textmate mailing list