Dear all,
there's a new bundle called "Unicode" in the review trunk. It is meant to be a place where we can gather any kind of scripts, commands, etc. which are related to general Unicode issue, meaning non-ASCII. This should also a place where we can gather scripts related to specific languages like Japanese, Chinese, Greek etc. This bundle is the first stage. How do we separate this bundle is a future task.
Thus, if there is someone who already has such scripts or is willing to support, please let us/me know.
Up to now there are the following stuff in:
- Normalize according canonical (de)composition of accented characters - Delete Diacritics: façadë έ だ => facade ε た - Convert to a similar Unicode Character: type the letter 'c' to get a list of "cçćĉċčƈ¢ɕʗḉ⒞ⓒc¢" - Convert to Greek Character: type 'n' to get "ν" - Show Unicode Name: select some letters to get a list of the Unicode names like LATIN SMALL LETTER A
I have many other scripts, but I need some time to polish them up.
To get this bundle, simply use the Subversion Bundle's checkout
http://macromates.com/svn/Bundles/trunk/Review/Bundles/Unicode.tmbundle
save this to the Desktop or whatever.
I know, to deal with non-ASCII scripts in TM 1.x is a bit tricky, but TM 2.0 will come ;)
Cheers,
--Hans
Hans-Joerg Bibiko wrote:
Dear all,
there's a new bundle called "Unicode" in the review trunk. It is meant to be a place where we can gather any kind of scripts, commands, etc. which are related to general Unicode issue, meaning non-ASCII. This should also a place where we can gather scripts related to specific languages like Japanese, Chinese, Greek etc. This bundle is the first stage. How do we separate this bundle is a future task.
Thus, if there is someone who already has such scripts or is willing to support, please let us/me know.
Up to now there are the following stuff in:
- Normalize according canonical (de)composition of accented characters
- Delete Diacritics: façadë έ だ => facade ε た
- Convert to a similar Unicode Character: type the letter 'c' to get a
list of "cçćĉċčƈ¢ɕʗḉ⒞ⓒc¢"
- Convert to Greek Character: type 'n' to get "ν"
- Show Unicode Name: select some letters to get a list of the Unicode
names like LATIN SMALL LETTER A
I have many other scripts, but I need some time to polish them up.
To get this bundle, simply use the Subversion Bundle's checkout
http://macromates.com/svn/Bundles/trunk/Review/Bundles/Unicode.tmbundle
save this to the Desktop or whatever.
I know, to deal with non-ASCII scripts in TM 1.x is a bit tricky, but TM 2.0 will come ;)
One small note:
In the character name script you should probably call unicodedata.name() with a second argument in case the character has no name, i.e. replace
res = a + " : " + unicodedata.name(a)
with
res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))
Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff like
unicodedata.category() unicodedata.bidrectional() unicodedata.decimal()
etc.
Servus, Walter
On 30.05.2008, at 17:32, Walter Dörwald wrote:
Hans-Joerg Bibiko wrote:
Dear all, there's a new bundle called "Unicode" in the review trunk. It is meant to be a place where we can gather any kind of scripts, commands, etc. which are related to general Unicode issue, meaning non-ASCII. This should also a place ...
One small note:
In the character name script you should probably call unicodedata.name() with a second argument in case the character has no name, i.e. replace
res = a + " : " + unicodedata.name(a)
with
res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))
Thanks for the hint! These are more or less the first scripts which I wrote in python ;) Caused by the issue that python has installed some Unicode data per default.
Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff like
unicodedata.category() unicodedata.bidrectional() unicodedata.decimal()
Yes. I have such a script in Perl which also shows up info about Unicode code points etc.
Servus,
--Hans
Hans-Jörg Bibiko wrote:
On 30.05.2008, at 17:32, Walter Dörwald wrote:
Hans-Joerg Bibiko wrote:
Dear all, there's a new bundle called "Unicode" in the review trunk. It is meant to be a place where we can gather any kind of scripts, commands, etc. which are related to general Unicode issue, meaning non-ASCII. This should also a place ...
One small note:
In the character name script you should probably call unicodedata.name() with a second argument in case the character has no name, i.e. replace
res = a + " : " + unicodedata.name(a)
with
res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))
Thanks for the hint! These are more or less the first scripts which I wrote in python ;) Caused by the issue that python has installed some Unicode data per default.
Here's another patch (against the current version). It shows both the codepoint and the name.
BTW, you don't have to use a regular expression to split a string into characters, simply iterating through it does the trick:
Index: Commands/Show Unicode Names.tmCommand =================================================================== --- Commands/Show Unicode Names.tmCommand (revision 9813) +++ Commands/Show Unicode Names.tmCommand (working copy) @@ -8,11 +8,13 @@ <string>#!/usr/bin/python import unicodedata import sys -import re
-for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), "UTF-8")): - if (len(a)==1) and (a != '\n'): - res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a)) +for a in unicode(sys.stdin.read(), "UTF-8"): + if a != '\n': + res = u"%s : U+%04X" % (a, ord(a)) + name = unicodedata.name(a, None) + if name: + res += u" : %s" % name print res.encode("UTF-8")</string> <key>fallbackInput</key> <string>character</string>
Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff like
unicodedata.category() unicodedata.bidrectional() unicodedata.decimal()
Yes. I have such a script in Perl which also shows up info about Unicode code points etc.
OK, now I see that the script displays information about every character in the selection. Adding more info might be a space problem.
Another problem: Using Ctrl-Shift-U as the shortcut hides the "Convert To Lowercase" command.
Servus, Walter
On 02.06.2008, at 00:04, Walter Dörwald wrote:
Here's another patch (against the current version). It shows both the codepoint and the name.
BTW, you don't have to use a regular expression to split a string into characters, simply iterating through it does the trick:
Index: Commands/Show Unicode Names.tmCommand -for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), "UTF-8")):
if (len(a)==1) and (a != '\n'):
res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))
+for a in unicode(sys.stdin.read(), "UTF-8"):
if a != '\n':
res = u"%s : U+%04X" % (a, ord(a))
name = unicodedata.name(a, None)
if name:
<key>fallbackInput</key> <string>character</string>res += u" : %s" % name print res.encode("UTF-8")</string>
Thanks! Just committed to the trunk.
Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff
like
unicodedata.category() unicodedata.bidrectional() unicodedata.decimal()
Yes. I have such a script in Perl which also shows up info about
Unicode
code points etc.
Just added to the bundle a prototype of 'Show Unicode Properties'
Another problem: Using Ctrl-Shift-U as the shortcut hides the "Convert To Lowercase" command.
Yes. This was a bad key combo. I changed it temporally to CTRL+OPT +APPLE+U
BTW: Can Python handle Unicode codepoints which are specified in Unicode pane B, meaning greater U+FFFF? I tried it out. I found out that Python uses UTF-16 internally. But e.g. UCS hex: 20000 ; UTF-16: D840 DC00 . I can print that character to TM but unicodedata fails because it expects one character but not two (?)
Servus,
--der Hans
Sorry but do i miss something? I have that error ------------------------- Traceback (most recent call last): File "/tmp/temp_textmate.0WYiu4", line 50, in <module> result=dialog.menu([re.sub(r"(?=[^a-zA-Z0-9_ ./-\x7F-\xFF\n])", r'\', a) + "\t" + unicodedata.name(a, "U+%04X" % ord(a)) for a in suggestions]) File "/Applications/TextMate.app/Contents/SharedSupport/Support/lib/dialog.py", line 51, in menu plist = to_plist(menu) UnboundLocalError: local variable 'menu' referenced before assignment ------------------------- when try to "Convert to Greek..." or "Convert to Similar..."
Alexey Blinov
On Mon, Jun 2, 2008 at 3:09 AM, Hans-Jörg Bibiko bibiko@eva.mpg.de wrote:
On 02.06.2008, at 00:04, Walter Dörwald wrote:
Here's another patch (against the current version). It shows both the codepoint and the name.
BTW, you don't have to use a regular expression to split a string into characters, simply iterating through it does the trick:
Index: Commands/Show Unicode Names.tmCommand -for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), "UTF-8")):
if (len(a)==1) and (a != '\n'):
res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))
+for a in unicode(sys.stdin.read(), "UTF-8"):
if a != '\n':
res = u"%s : U+%04X" % (a, ord(a))
name = unicodedata.name(a, None)
if name:
res += u" : %s" % name print res.encode("UTF-8")</string> <key>fallbackInput</key> <string>character</string>
Thanks! Just committed to the trunk.
Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff like
unicodedata.category() unicodedata.bidrectional() unicodedata.decimal()
Yes. I have such a script in Perl which also shows up info about Unicode code points etc.
Just added to the bundle a prototype of 'Show Unicode Properties'
Another problem: Using Ctrl-Shift-U as the shortcut hides the "Convert To Lowercase" command.
Yes. This was a bad key combo. I changed it temporally to CTRL+OPT+APPLE+U
BTW: Can Python handle Unicode codepoints which are specified in Unicode pane B, meaning greater U+FFFF? I tried it out. I found out that Python uses UTF-16 internally. But e.g. UCS hex: 20000 ; UTF-16: D840 DC00 . I can print that character to TM but unicodedata fails because it expects one character but not two (?)
Servus,
--der Hans ______________________________________________________________________ For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate
On 2 Jun 2008, at 15:26, Alexey Blinov wrote:
Sorry but do i miss something? I have that error
Traceback (most recent call last): File "/tmp/temp_textmate.0WYiu4", line 50, in <module> result=dialog.menu([re.sub(r"(?=[^a-zA-Z0-9_ ./-\x7F-\xFF\n])", r'\', a) + "\t" + unicodedata.name(a, "U+%04X" % ord(a)) for a in suggestions]) File "/Applications/TextMate.app/Contents/SharedSupport/Support/lib/ dialog.py", line 51, in menu plist = to_plist(menu) UnboundLocalError: local variable 'menu' referenced before assignment
when try to "Convert to Greek..." or "Convert to Similar..."
You have to upgrade dialog.py in /Applications/TextMate.app/Contents/ SharedSupport/Support/lib
The old version didn't support UTF-8.
Cheers,
Hans
Hans-Jörg Bibiko wrote:
On 02.06.2008, at 00:04, Walter Dörwald wrote:
Here's another patch (against the current version). It shows both the codepoint and the name.
BTW, you don't have to use a regular expression to split a string into characters, simply iterating through it does the trick:
Index: Commands/Show Unicode Names.tmCommand -for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), "UTF-8")):
if (len(a)==1) and (a != '\n'):
res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))
+for a in unicode(sys.stdin.read(), "UTF-8"):
if a != '\n':
res = u"%s : U+%04X" % (a, ord(a))
name = unicodedata.name(a, None)
if name:
<key>fallbackInput</key> <string>character</string>res += u" : %s" % name print res.encode("UTF-8")</string>
Thanks! Just committed to the trunk.
Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff like
unicodedata.category() unicodedata.bidrectional() unicodedata.decimal()
Yes. I have such a script in Perl which also shows up info about
Unicode
code points etc.
Just added to the bundle a prototype of 'Show Unicode Properties'
Another problem: Using Ctrl-Shift-U as the shortcut hides the "Convert To Lowercase" command.
Yes. This was a bad key combo. I changed it temporally to CTRL+OPT+APPLE+U
BTW: Can Python handle Unicode codepoints which are specified in Unicode pane B, meaning greater U+FFFF? I tried it out. I found out that Python uses UTF-16 internally.
At least the Python that ships with the OS uses 2 byte Unicode character with partial UTF-16 support:
Python 2.5.2 (r252:60911, Apr 8 2008, 18:54:00) [GCC 3.3.5 (Debian 1:3.3.5-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import sys sys.maxunicode
65535
The size of a Unicode character is specified at compile time with the --enable-unicode option, so you *could* compile a wide Python with: ./configure --enable-unicode=ucs4
But e.g. UCS hex: 20000 ; UTF-16: D840 DC00 . I can print that character to TM but unicodedata fails because it expects one character but not two (?)
There are some spots in the Python code base where in narrow builds surrogate pairs are interpreted properly as characters outside the BMP, but unicodedata isn't one of them (so it's not actually real UTF-16 throughout). There's an open issue on the Python bugtracker about that:
http://bugs.python.org/issue1706460
So there are two options:
1) Apple starts compiling its Python with --enable-unicode=ucs4 2) Python gets fixed so that surrogate pairs can be passed to unicodedata functions.
I think I might give 2) a try.
Servus, Walter
Walter Dörwald wrote:
Hans-Jörg Bibiko wrote:
On 02.06.2008, at 00:04, Walter Dörwald wrote:
Here's another patch (against the current version). It shows both the codepoint and the name. [...]
Here's another suggestions on the current Bundle version:
To get the UTF-8 bytes of a character, you're doing the following:
print " UTF-8 : " + " ".join(repr(char.encode("UTF-8")).split('\x')).lstrip("' ").rstrip("'").upper()
This only works for characters with a codepoint >= 128. The following code should work better:
print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() for c in char)
Furthermore the code:
decomp = unicodedata.decomposition(char).lstrip(' ').rstrip(' ')
can be simplyfied to:
decomp = unicodedata.decomposition(char).strip()
(strip() strips from both ends and stripping all whitespace is the default when no argument is given.)
Hope that helps.
Servus, Walter
Walter Dörwald wrote:
Walter Dörwald wrote:
Hans-Jörg Bibiko wrote:
On 02.06.2008, at 00:04, Walter Dörwald wrote:
Here's another patch (against the current version). It shows both the codepoint and the name. [...]
Here's another suggestions on the current Bundle version:
To get the UTF-8 bytes of a character, you're doing the following:
print " UTF-8 : " + "
".join(repr(char.encode("UTF-8")).split('\x')).lstrip("' ").rstrip("'").upper()
This only works for characters with a codepoint >= 128. The following code should work better:
print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() for
c in char)
Oops, that was of course supposed to be:
print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() for c in char.encode("utf-8"))
Servus, Walter
On 3 Jun 2008, at 17:29, Walter Dörwald wrote:
Walter Dörwald wrote:
Walter Dörwald wrote:
Hans-Jörg Bibiko wrote:
On 02.06.2008, at 00:04, Walter Dörwald wrote:
Here's another patch (against the current version). It shows both the codepoint and the name. [...]
Here's another suggestions on the current Bundle version: To get the UTF-8 bytes of a character, you're doing the following: print " UTF-8 : " + " ".join(repr(char.encode("UTF-8")).split('\x')).lstrip("' ").rstrip("'").upper() This only works for characters with a codepoint >= 128. The following code should work better: print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() for c in char)
Oops, that was of course supposed to be:
print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper()
for c in char.encode("utf-8"))
Once again, thanks a lot for teaching me Python ;) The code changes are in the SVN trunk.
--Hans
On 03.06.2008, at 17:29, Walter Dörwald wrote:
Walter Dörwald wrote:
Walter Dörwald wrote:
Hans-Jörg Bibiko wrote:
On 02.06.2008, at 00:04, Walter Dörwald wrote:
Here's another patch (against the current version). It shows both the codepoint and the name. [...]
Here's another suggestions on the current Bundle version: To get the UTF-8 bytes of a character, you're doing the following: print " UTF-8 : " + " ".join(repr(char.encode ("UTF-8")).split('\x')).lstrip("' ").rstrip("'").upper() This only works for characters with a codepoint >= 128. The following code should work better: print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper () for c in char)
Oops, that was of course supposed to be:
print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper
() for c in char.encode("utf-8"))
Could it be that this isn't allowed in Python for Tiger? I get an error for invalid syntax referring to 'for' Mac OSX 10.4.11 ppc; Python 2.4.2
On my 10.5.3 Mac it works(?)
--Hans
Hans-Jörg Bibiko wrote:
On 03.06.2008, at 17:29, Walter Dörwald wrote:
Walter Dörwald wrote:
Walter Dörwald wrote:
Hans-Jörg Bibiko wrote:
On 02.06.2008, at 00:04, Walter Dörwald wrote:
Here's another patch (against the current version). It shows both the codepoint and the name. [...]
Here's another suggestions on the current Bundle version: To get the UTF-8 bytes of a character, you're doing the following: print " UTF-8 : " + " ".join(repr(char.encode("UTF-8")).split('\x')).lstrip("' ").rstrip("'").upper() This only works for characters with a codepoint >= 128. The following code should work better: print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() for c in char)
Oops, that was of course supposed to be:
print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper()
for c in char.encode("utf-8"))
Could it be that this isn't allowed in Python for Tiger? I get an error for invalid syntax referring to 'for' Mac OSX 10.4.11 ppc; Python 2.4.2
On my 10.5.3 Mac it works(?)
AFAICR Tiger has Python 2.3, which didn't support generator expressions.
The following should work:
print " UTF-8 : %s" % " ".join([hex(ord(c))[2:].upper() for c in char.encode("utf-8")])
(i.e. replace the generator expression with a list comprehension by adding [] around the join argument.)
Hope that helps!
Servus, Walter
Hi,
there are some more commands available:
Furthermore I wrote some basic syntax highlighting stuff to display 'no ASCII', 'no Latin', 'all combining diacritics' characters.
Most of the commands also support Unicode higher than U+FFFF. [But be careful! Up to now TM 1 can display (more or less) these characters, but each char are in TM 1 two chars! If you place the caret in between and invoke a command TM will crash immediately! But TM 2 supports these chars ;) ]
I wrote a new Chinese Traditional <> Simplified Converter. It also converts characters > U+FFFF (Apple's not ;) ), and it show up all those characters which have more than one counterpart as snippets {A=B|C}. There's a command which shows a menu displaying B and C etc. to disambiguate (Apple does not do that).
All Unicode data are coming from the latest Unicode 5.1 and it's easy to upgrade.
Show Unicode Properties also shows all known information about Chinese/Japanese/Korean ideographs, like Radical, readings, Wubi Xing codes, etc. All these data are coming from Apple's Character Palette internals ;) But I think about to integrate Unicode's UniHan database. This zip file (6MB) won't be part of that bundle. Anyone who wants to use it can download it (I will provide a command for that).
Last but not least I want to say thank you to Walter Dörwald who helped me a lot with the Python scripts.
Cheers,
--Hans