Hans-Jörg Bibiko wrote:
On 02.06.2008, at 00:04, Walter Dörwald wrote:
Here's another patch (against the current version). It shows both the codepoint and the name.
BTW, you don't have to use a regular expression to split a string into characters, simply iterating through it does the trick:
Index: Commands/Show Unicode Names.tmCommand -for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), "UTF-8")):
if (len(a)==1) and (a != '\n'):
res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))
+for a in unicode(sys.stdin.read(), "UTF-8"):
if a != '\n':
res = u"%s : U+%04X" % (a, ord(a))
name = unicodedata.name(a, None)
if name:
<key>fallbackInput</key> <string>character</string>res += u" : %s" % name print res.encode("UTF-8")</string>
Thanks! Just committed to the trunk.
Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff like
unicodedata.category() unicodedata.bidrectional() unicodedata.decimal()
Yes. I have such a script in Perl which also shows up info about
Unicode
code points etc.
Just added to the bundle a prototype of 'Show Unicode Properties'
Another problem: Using Ctrl-Shift-U as the shortcut hides the "Convert To Lowercase" command.
Yes. This was a bad key combo. I changed it temporally to CTRL+OPT+APPLE+U
BTW: Can Python handle Unicode codepoints which are specified in Unicode pane B, meaning greater U+FFFF? I tried it out. I found out that Python uses UTF-16 internally.
At least the Python that ships with the OS uses 2 byte Unicode character with partial UTF-16 support:
Python 2.5.2 (r252:60911, Apr 8 2008, 18:54:00) [GCC 3.3.5 (Debian 1:3.3.5-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import sys sys.maxunicode
65535
The size of a Unicode character is specified at compile time with the --enable-unicode option, so you *could* compile a wide Python with: ./configure --enable-unicode=ucs4
But e.g. UCS hex: 20000 ; UTF-16: D840 DC00 . I can print that character to TM but unicodedata fails because it expects one character but not two (?)
There are some spots in the Python code base where in narrow builds surrogate pairs are interpreted properly as characters outside the BMP, but unicodedata isn't one of them (so it's not actually real UTF-16 throughout). There's an open issue on the Python bugtracker about that:
http://bugs.python.org/issue1706460
So there are two options:
1) Apple starts compiling its Python with --enable-unicode=ucs4 2) Python gets fixed so that surrogate pairs can be passed to unicodedata functions.
I think I might give 2) a try.
Servus, Walter