On Jul 26, 2005, at 7:01 AM, Patrice Neff wrote:
UTF-8 sucks for Japanese and Chinese texts mainly due to space reasons. If anything makes sense, then it is UTF-16, which Textmate also supports.
Could you explain what you mean by "space reasons"?
Due to the way UTF-8 works, it used 1 byte for US-ASCII characters, but up to four bytes depending on the Unicode number. Many alphabets can be encoded with two bytes (especially the European ones, but also Hebrew or Arabic). Chinese and Japanese characters will require three or four bytes.
This size issue is largely a myth.
Remember, we are discussing "pictograph" languages. In English, the word "forest" requires six characters. In Kanji (Chinese/Japanese pictographs), it requires one. Even if we need four bytes to encode that one character, it will still be smaller than the resulting English encoding.
There are some encodings that squeeze Kanji into a smaller space, it's true, but to say that the files balloon in size without this is not really accurate, in comparison with other languages.
James Edward Gray II