On 30/11/2005, at 3:05, Nicholas Orr wrote:
[...] Thanks for taking the time to reply, but you've lost me. Unicode is one thing I haven't kept track of at all.
Don't know if it helps (or has interest), but here's my quick summary of the unicode encoding situation (factually unchecked):
Unicode code points (which I'll just call characters for simplicity) were originally 16 bits each. This means to write a text in unicode, each character would require 16 bits.
UTF-8 is a representation (encoding) of unicode which was devised to make a way to transition to unicode while still remaining backwards compatibility (which amongst other means that all our unix tools work gracefully with the files).
From a users POV UTF-8 has (IMHO) all the advantages, most notable the ASCII compatibility. From a developers POV UTF-8 is a little more difficult to work with because each character can be represented with multiple bytes, so for example it's not (always) legal to just cut a string in half, and to access a given index in a string, one needs to iterate through the string, rather than just do string[index].
For this reason, it might seem [1] more attractive to just work with the 16 bit characters, at least in memory.
But time passed, and it turned out that 16 bit characters were pessimistic, so today a unicode character is 32 bits, and a 16 bit unicode text is said to be in UTF-16, which is now just en encoding for 32 bit unicode (ucs-4). So all the “problems” of UTF-8 now exist for UTF-16, plus the lack of backwards compatibility, the generally double file size, and the endian issues (i.e. are the 16 bit characters in big or little endian format, and despite the byte-order- marker, the source that reads UTF-16 needs to know the endianess of the architecture on which it's compiled).
So all in all, today UTF-16 is a pretty crappy encoding! :)
In the end I did #3, and it worked fine. It worked so well I went and bought a copy.
Thanks!
[1] I say seems because in my experience this is mostly a theoretical disadvantage, as the need to split/index a string based on index is rare, when it happens, one is generally already iterating the string character by character (to find where to split it etc.).