NEED Japanese Text Encoding! (pretty please?)

List overview All Threads
Download

newer

older

Vanishing project window under...

Features request: columnar...

Sean Schertell

25 Jul 2005 25 Jul '05

12:19 p.m.

Hi Guys,

This is my first post. I just wanna say: Wow! This thing looks truly awesome! But... for those of us doing non-western and non- UTF-8 encoding web pages, it's unusable.

Oh please please PLEASE add support for Shift-JIS and EUC-JP! (japanese)

Cheers, Sean

:::: DataFly.Net :::: Complete Web Services http://www.datafly.net

Show replies by date

Sune Foldager

25 Jul 25 Jul

5:07 p.m.

On 25/07/2005, at 14.19, Sean Schertell wrote:

...

This is my first post. I just wanna say: Wow! This thing looks truly awesome! But... for those of us doing non-western and non- UTF-8 encoding web pages, it's unusable. Oh please please PLEASE add support for Shift-JIS and EUC-JP! (japanese)

I am almost positive it won't happen; you can use external tools such as iconv or similar to convert to/from instead. The best would be to use UTF-8 on those web pages of course.

-- Sune.

Sean Schertell

12:14 a.m.

...

...
This is my first post. I just wanna say: Wow! This thing looks truly awesome! But... for those of us doing non-western and non- UTF-8 encoding web pages, it's unusable. Oh please please PLEASE add support for Shift-JIS and EUC-JP! (japanese)

I am almost positive it won't happen; you can use external tools such as iconv or similar to convert to/from instead. The best would be to use UTF-8 on those web pages of course.

Unfortunately, UTF-8 doesn't work for Japanese in some cases. For example Mac IE (still used by all those poor souls still stuck on OS 9), shows lots of wacky characters. And trying to code a page full of nonsense characters, then use iConv everytime you upload -- not exactly a smooth workflow :-(

Why do you say it won't happen? Are you one of the developers? Remember, Japan is a huge (possibly #2) market for the web so this isn't some obscure out in left field feature ;-)

Cheers, Sean

:::: DataFly.Net :::: Complete Web Services http://www.datafly.net

Allan Odgaard

26 Jul 26 Jul

3:55 a.m.

On 25/07/2005, at 2.14, Sean Schertell wrote:

...

Unfortunately, UTF-8 doesn't work for Japanese in some cases. For example Mac IE (still used by all those poor souls still stuck on OS 9), shows lots of wacky characters.

Sorry to sound skeptical, but is that just a subset of the japanese characters, which it shows as wacky?

I tried launching Internet Explorer 5.2 (which is on my Panther partition) and went to http://en.wikipedia.org/wiki/Japanese_language -- this page is in UTF-8 and contain several japanese characters, and although I don't read japanese, it does look alright to me.

...

Remember, Japan is a huge (possibly #2) market for the web so this isn't some obscure out in left field feature ;-)

I'm going to add better/actual international support to TM eventually. Though currently it looks a little like it'll be a 1.4- thing (but 1.2 and 1.3 should take shorter time than the current 1.1).

Sean Schertell

25 Jul 25 Jul

3:53 a.m.

...

...
Unfortunately, UTF-8 doesn't work for Japanese in some cases. For example Mac IE (still used by all those poor souls still stuck on OS 9), shows lots of wacky characters.

...

...
Sorry to sound skeptical, but is that just a subset of the japanese characters, which it shows as wacky?

It seemed to only affect form elements -- I'm not entirely sure that maybe my meta tags or apache config weren't the problem. Interesting sidenote however: If you go to google.co.jp using Safari (or almost any other browser), it serves up a tasty helping of UTF-8 encoded Japanese. But if you go to the same url using Mac IE, it uses Shift- JIS. Weird, huh?

...

I tried launching Internet Explorer 5.2 (which is on my Panther partition) and went to http://en.wikipedia.org/wiki/ Japanese_language -- this page is in UTF-8 and contain several japanese characters, and although I don't read japanese, it does look alright to me.

Interesting. It look okay to me too. Even the form buttons.

...

...
Remember, Japan is a huge (possibly #2) market for the web so this isn't some obscure out in left field feature ;-)

I'm going to add better/actual international support to TM eventually. Though currently it looks a little like it'll be a 1.4- thing (but 1.2 and 1.3 should take shorter time than the current 1.1).

That would be great! Even if I'm able to make a clean break and start all new projects as UTF-8, I'll still need my code editor not to barf on the tons and tons of existing Shift-JIS and EUC-JP docs that I may need to work on someday.

Thanks! Sean

:::: DataFly.Net :::: Complete Web Services http://www.datafly.net

Robert Deaton

11:22 p.m.

I personally would doubt it would happen either, considering the fact that UTF-8 handles japanese input, and the reason unicode exists is to make other character encodings obsolete in favor of one encoding that handles everything.

On 7/25/05, Sune Foldager cryo@cyanite.org wrote:

...

On 25/07/2005, at 14.19, Sean Schertell wrote:

...
This is my first post. I just wanna say: Wow! This thing looks truly awesome! But... for those of us doing non-western and non- UTF-8 encoding web pages, it's unusable. Oh please please PLEASE add support for Shift-JIS and EUC-JP! (japanese)

I am almost positive it won't happen; you can use external tools such as iconv or similar to convert to/from instead. The best would be to use UTF-8 on those web pages of course.

-- Sune.

For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

-- --Robert Deaton http://somethingunpredictable.com

Patrice Neff

26 Jul 26 Jul

3:37 a.m.

Am 25.07.2005 um 19:07 schrieb Sune Foldager:

...

I am almost positive it won't happen; you can use external tools such as iconv or similar to convert to/from instead. The best would be to use UTF-8 on those web pages of course.

UTF-8 sucks for Japanese and Chinese texts mainly due to space reasons. If anything makes sense, then it is UTF-16, which Textmate also supports.

Apart from that, it seems that Unicode is not actually able to handle 100% of Chinese (and maybe also Japanese) script. But I'm not a Unicode expert.

Patrice

Sean Schertell

25 Jul 25 Jul

3:32 a.m.

...

...
I am almost positive it won't happen; you can use external tools such as iconv or similar to convert to/from instead. The best would be to use UTF-8 on those web pages of course.

UTF-8 sucks for Japanese and Chinese texts mainly due to space reasons. If anything makes sense, then it is UTF-16, which Textmate also supports.

Apart from that, it seems that Unicode is not actually able to handle 100% of Chinese (and maybe also Japanese) script. But I'm not a Unicode expert.

Could you explain what you mean by "space reasons"?

Also, does anyone else have any experience using UTF-8 for Japanese web pages? It seems that none of the major Japanese sites use it (they mostly use Shift-JIS). I tried to use it once but discovered that Mac IE 5 rendered form elements such as pull down lists and buttons as "mojibake" (garbage characters). Maybe there's a workaround for that?

If anyone has success stories with Japanese UTF-8, I'd love to hear them. I certainly don't have any personal qualms about leaving Shift- JIS behind -- as long as I know UTF-8 is viable for commercial Japanese web sites.

Sean

:::: DataFly.Net :::: Complete Web Services http://www.datafly.net

Patrice Neff

26 Jul 26 Jul

4:26 a.m.

Am 25.07.2005 um 05:32 schrieb Sean Schertell:

...

...
UTF-8 sucks for Japanese and Chinese texts mainly due to space reasons. If anything makes sense, then it is UTF-16, which Textmate also supports.

Could you explain what you mean by "space reasons"?

Due to the way UTF-8 works, it used 1 byte for US-ASCII characters, but up to four bytes depending on the Unicode number. Many alphabets can be encoded with two bytes (especially the European ones, but also Hebrew or Arabic). Chinese and Japanese characters will require three or four bytes.

UTF-16 on the other hand encodes everything in two bytes. So that's why for Chinese or Japanese texts you will waste space when using UTF-8 compared to UTF-16, while with English and most European languages you will save a lot of space using UTF-8 compared to UTF-16. And the latter was IMHO one of the main reasons for developing UTF-8.

You can read some more about it at http://en.wikipedia.org/wiki/UTF-8.

Patrice

Allan Odgaard

5:11 a.m.

On 26/07/2005, at 6.26, Patrice Neff wrote:

...

[...] while with English and most European languages you will save a lot of space using UTF-8 compared to UTF-16. And the latter was IMHO one of the main reasons for developing UTF-8.

Well, at best you'll save 50%, where enabling gzip as transfer- compression will likely save you >75% :)

The motivation for UTF-8 is that ASCII characters are encoded as they would have been, had it been a plain ASCII document.

This means that a lot of existing software doesn't need to be updated to actually handle UTF-8 (as long as they are 8 bit clean). For example I use UTF-8 for my source code, even though my compiler isn't UTF-8 aware, this means I can use non-ASCII in strings and comments -- some compilers/interpreters (e.g. PHP) will also allow user defined variables to be in UTF-8 (while still only knowing about the ASCII tokens).

So UTF-8 exists because a lot of software is made to work with 8-bit sequences (not 16 bit, as UTF-16 would have called for), and some software will look for tokens encoded as ASCII in these 8-bit sequences.

UTF-8 is a brilliant way to give this software access to the full unicode range.

Sune Foldager

9:34 a.m.

On 26/07/2005, at 7.11, Allan Odgaard wrote:

...

This means that a lot of existing software doesn't need to be updated to actually handle UTF-8 (as long as they are 8 bit clean). For example I use UTF-8 for my source code, even though my compiler isn't UTF-8 aware

Just a thought; what happens when/if a unicode character is coded as <something><the code for "> in a string? I suppose C will fail in that case?

-- Sune.

Vincent Isambart

9:49 a.m.

Hi,

...

...
This means that a lot of existing software doesn't need to be updated to actually handle UTF-8 (as long as they are 8 bit clean). For example I use UTF-8 for my source code, even though my compiler isn't UTF-8 aware

Just a thought; what happens when/if a unicode character is coded as <something><the code for "> in a string? I suppose C will fail in that case?

It's not possible. When you get a byte part of an UTF-8 sequence, you can know if its position in the sequence by checking the value of a few bits. An ASCII character (its value is less than 127) can only be alone.

Cheers, Vincent ISAMBART

7280

days inactive

7281

days old

textmate@lists.macromates.com

11 comments

participants

tags (0)

participants (6)

Allan Odgaard
Patrice Neff
Robert Deaton
Sean Schertell
Sune Foldager
Vincent Isambart