[TxMt] Re: textmate Digest, Vol 10, Issue 33
James Edward Gray II
james at grayproductions.net
Tue Jul 26 14:13:39 UTC 2005
On Jul 26, 2005, at 7:01 AM, Patrice Neff wrote:
>>> UTF-8 sucks for Japanese and Chinese texts mainly due to space
>>> reasons. If anything makes sense, then it is UTF-16, which
>>> Textmate also supports.
>>>
>>
>> Could you explain what you mean by "space reasons"?
>>
>
> Due to the way UTF-8 works, it used 1 byte for US-ASCII characters,
> but up to four bytes depending on the Unicode number. Many alphabets
> can be encoded with two bytes (especially the European ones, but also
> Hebrew or Arabic). Chinese and Japanese characters will require three
> or four bytes.
This size issue is largely a myth.
Remember, we are discussing "pictograph" languages. In English, the
word "forest" requires six characters. In Kanji (Chinese/Japanese
pictographs), it requires one. Even if we need four bytes to encode
that one character, it will still be smaller than the resulting
English encoding.
There are some encodings that squeeze Kanji into a smaller space,
it's true, but to say that the files balloon in size without this is
not really accurate, in comparison with other languages.
James Edward Gray II
More information about the textmate
mailing list