On 30/11/2005, at 3:05, Nicholas Orr wrote:
[...] Thanks for taking the time to reply, but
you've lost me.
Unicode is one thing I haven't kept track of at all.
Don't know if it helps (or has interest), but here's my quick summary
of the unicode encoding situation (factually unchecked):
Unicode code points (which I'll just call characters for simplicity)
were originally 16 bits each. This means to write a text in unicode,
each character would require 16 bits.
UTF-8 is a representation (encoding) of unicode which was devised to
make a way to transition to unicode while still remaining backwards
compatibility (which amongst other means that all our unix tools work
gracefully with the files).
From a users POV UTF-8 has (IMHO) all the advantages, most notable
the ASCII compatibility. From a developers POV UTF-8 is a little more
difficult to work with because each character can be represented with
multiple bytes, so for example it's not (always) legal to just cut a
string in half, and to access a given index in a string, one needs to
iterate through the string, rather than just do string[index].
For this reason, it might seem  more attractive to just work with
the 16 bit characters, at least in memory.
But time passed, and it turned out that 16 bit characters were
pessimistic, so today a unicode character is 32 bits, and a 16 bit
unicode text is said to be in UTF-16, which is now just en encoding
for 32 bit unicode (ucs-4). So all the “problems” of UTF-8 now exist
for UTF-16, plus the lack of backwards compatibility, the generally
double file size, and the endian issues (i.e. are the 16 bit
characters in big or little endian format, and despite the byte-order-
marker, the source that reads UTF-16 needs to know the endianess of
the architecture on which it's compiled).
So all in all, today UTF-16 is a pretty crappy encoding! :)
In the end I did #3, and it worked fine. It worked so
well I went
and bought a copy.
 I say seems because in my experience this is mostly a theoretical
disadvantage, as the need to split/index a string based on index is
rare, when it happens, one is generally already iterating the string
character by character (to find where to split it etc.).