On 26/07/2005, at 6.26, Patrice Neff wrote:
[...] while with English and most European languages you will save a lot of space using UTF-8 compared to UTF-16. And the latter was IMHO one of the main reasons for developing UTF-8.
Well, at best you'll save 50%, where enabling gzip as transfer- compression will likely save you >75% :)
The motivation for UTF-8 is that ASCII characters are encoded as they would have been, had it been a plain ASCII document.
This means that a lot of existing software doesn't need to be updated to actually handle UTF-8 (as long as they are 8 bit clean). For example I use UTF-8 for my source code, even though my compiler isn't UTF-8 aware, this means I can use non-ASCII in strings and comments -- some compilers/interpreters (e.g. PHP) will also allow user defined variables to be in UTF-8 (while still only knowing about the ASCII tokens).
So UTF-8 exists because a lot of software is made to work with 8-bit sequences (not 16 bit, as UTF-16 would have called for), and some software will look for tokens encoded as ASCII in these 8-bit sequences.
UTF-8 is a brilliant way to give this software access to the full unicode range.