James Edward Gray II james at grayproductions.net
Tue Jul 26 14:13:39 UTC 2005

On Jul 26, 2005, at 7:01 AM, Patrice Neff wrote:

>>> UTF-8 sucks for Japanese and Chinese texts mainly due to space
>>> reasons. If anything makes sense, then it is UTF-16, which
>>> Textmate also supports.
>> Could you explain what you mean by "space reasons"?
> Due to the way UTF-8 works, it used 1 byte for US-ASCII characters,
> but up to four bytes depending on the Unicode number. Many alphabets
> can be encoded with two bytes (especially the European ones, but also
> Hebrew or Arabic). Chinese and Japanese characters will require three
> or four bytes.

This size issue is largely a myth.

Remember, we are discussing "pictograph" languages.  In English, the  
word "forest" requires six characters.  In Kanji (Chinese/Japanese  
pictographs), it requires one.  Even if we need four bytes to encode  
that one character, it will still be smaller than the resulting  
English encoding.

There are some encodings that squeeze Kanji into a smaller space,  
it's true, but to say that the files balloon in size without this is  
not really accurate, in comparison with other languages.

James Edward Gray II

