[TxMt] New "Unicode" bundle in the Review trunk

Walter Dörwald walter at livinglogic.de
Mon Jun 2 13:58:01 UTC 2008


Hans-Jörg Bibiko wrote:

> On 02.06.2008, at 00:04, Walter Dörwald wrote:
>> Here's another patch (against the current version). It shows both the 
>> codepoint and the name.
>>
>> BTW, you don't have to use a regular expression to split a string into 
>> characters, simply iterating through it does the trick:
>>
>> Index: Commands/Show Unicode Names.tmCommand
>> -for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), 
>> "UTF-8")):
>> -     if (len(a)==1) and (a != '\n'):
>> -          res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))
>> +for a in unicode(sys.stdin.read(), "UTF-8"):
>> +     if a != '\n':
>> +          res = u"%s : U+%04X" % (a, ord(a))
>> +          name = unicodedata.name(a, None)
>> +          if name:
>> +              res += u" : %s" % name
>>            print res.encode("UTF-8")</string>
>>      <key>fallbackInput</key>
>>      <string>character</string>
> Thanks! Just committed to the trunk.
> 
>> >> Furthermore it would be great if this script could display all
>> >> information there is in the Python Unicode database, i.e. stuff like
>> >>
>> >>    unicodedata.category()
>> >>    unicodedata.bidrectional()
>> >>    unicodedata.decimal()
>> > Yes. I have such a script in Perl which also shows up info about 
>> Unicode
>> > code points etc.
> Just added to the bundle a prototype of 'Show Unicode Properties'
> 
> 
>> Another problem: Using Ctrl-Shift-U as the shortcut hides the "Convert 
>> To Lowercase" command.
> Yes. This was a bad key combo. I changed it temporally to CTRL+OPT+APPLE+U
> 
> BTW: Can Python handle Unicode codepoints which are specified in Unicode 
> pane B, meaning greater U+FFFF? I tried it out. I found out that Python 
> uses UTF-16 internally.

At least the Python that ships with the OS uses 2 byte Unicode character 
with partial UTF-16 support:

Python 2.5.2 (r252:60911, Apr  8 2008, 18:54:00)
[GCC 3.3.5 (Debian 1:3.3.5-13)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> import sys
 >>> sys.maxunicode
65535

The size of a Unicode character is specified at compile time with the 
--enable-unicode option, so you *could* compile a wide Python with:
./configure --enable-unicode=ucs4

> But e.g. UCS hex: 20000 ; UTF-16: D840 DC00 .
> I can print that character to TM but unicodedata fails because it 
> expects one character but not two (?)

There are some spots in the Python code base where in narrow builds 
surrogate pairs are interpreted properly as characters outside the BMP, 
but unicodedata isn't one of them (so it's not actually real UTF-16 
throughout). There's an open issue on the Python bugtracker about that:

http://bugs.python.org/issue1706460

So there are two options:

1) Apple starts compiling its Python with --enable-unicode=ucs4
2) Python gets fixed so that surrogate pairs can be passed to 
unicodedata functions.

I think I might give 2) a try.

Servus,
    Walter



More information about the textmate mailing list