New "Unicode" bundle in the Review trunk

List overview All Threads
Download

newer

older

Concatenating multiple Files

syntax highlighting in perl bundle

Hans-Joerg Bibiko

30 May 2008 30 May '08

4:11 p.m.

Dear all,

there's a new bundle called "Unicode" in the review trunk. It is meant to be a place where we can gather any kind of scripts, commands, etc. which are related to general Unicode issue, meaning non-ASCII. This should also a place where we can gather scripts related to specific languages like Japanese, Chinese, Greek etc. This bundle is the first stage. How do we separate this bundle is a future task.

Thus, if there is someone who already has such scripts or is willing to support, please let us/me know.

Up to now there are the following stuff in:

- Normalize according canonical (de)composition of accented characters - Delete Diacritics: façadë έ だ => facade ε た - Convert to a similar Unicode Character: type the letter 'c' to get a list of "cçćĉċčƈ¢ɕʗḉ⒞ⓒｃ￠" - Convert to Greek Character: type 'n' to get "ν" - Show Unicode Name: select some letters to get a list of the Unicode names like LATIN SMALL LETTER A

I have many other scripts, but I need some time to polish them up.

To get this bundle, simply use the Subversion Bundle's checkout

http://macromates.com/svn/Bundles/trunk/Review/Bundles/Unicode.tmbundle

save this to the Desktop or whatever.

I know, to deal with non-ASCII scripts in TM 1.x is a bit tricky, but TM 2.0 will come ;)

Cheers,

--Hans

Show replies by date

Walter Dörwald

30 May 30 May

5:32 p.m.

Hans-Joerg Bibiko wrote:

...

Dear all,

there's a new bundle called "Unicode" in the review trunk. It is meant to be a place where we can gather any kind of scripts, commands, etc. which are related to general Unicode issue, meaning non-ASCII. This should also a place where we can gather scripts related to specific languages like Japanese, Chinese, Greek etc. This bundle is the first stage. How do we separate this bundle is a future task.

Thus, if there is someone who already has such scripts or is willing to support, please let us/me know.

Up to now there are the following stuff in:

Normalize according canonical (de)composition of accented characters

Delete Diacritics: façadë έ だ => facade ε た

Convert to a similar Unicode Character: type the letter 'c' to get a

list of "cçćĉċčƈ¢ɕʗḉ⒞ⓒｃ￠"

Convert to Greek Character: type 'n' to get "ν"

Show Unicode Name: select some letters to get a list of the Unicode

names like LATIN SMALL LETTER A

I have many other scripts, but I need some time to polish them up.

To get this bundle, simply use the Subversion Bundle's checkout

http://macromates.com/svn/Bundles/trunk/Review/Bundles/Unicode.tmbundle

save this to the Desktop or whatever.

I know, to deal with non-ASCII scripts in TM 1.x is a bit tricky, but TM 2.0 will come ;)

One small note:

In the character name script you should probably call unicodedata.name() with a second argument in case the character has no name, i.e. replace

res = a + " : " + unicodedata.name(a)

with

res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))

Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff like

unicodedata.category() unicodedata.bidrectional() unicodedata.decimal()

etc.

Servus, Walter

Hans-Jörg Bibiko

5:43 p.m.

On 30.05.2008, at 17:32, Walter Dörwald wrote:

...

Hans-Joerg Bibiko wrote:

...
Dear all, there's a new bundle called "Unicode" in the review trunk. It is meant to be a place where we can gather any kind of scripts, commands, etc. which are related to general Unicode issue, meaning non-ASCII. This should also a place ...

One small note:

In the character name script you should probably call unicodedata.name() with a second argument in case the character has no name, i.e. replace
res = a + " : " + unicodedata.name(a)
with
res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))

Thanks for the hint! These are more or less the first scripts which I wrote in python ;) Caused by the issue that python has installed some Unicode data per default.

...

Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff like

unicodedata.category() unicodedata.bidrectional() unicodedata.decimal()

Yes. I have such a script in Perl which also shows up info about Unicode code points etc.

Servus,

--Hans

Walter Dörwald

2 Jun 2 Jun

12:04 a.m.

Hans-Jörg Bibiko wrote:

...

On 30.05.2008, at 17:32, Walter Dörwald wrote:

...
Hans-Joerg Bibiko wrote:

...
Dear all, there's a new bundle called "Unicode" in the review trunk. It is meant to be a place where we can gather any kind of scripts, commands, etc. which are related to general Unicode issue, meaning non-ASCII. This should also a place ...

One small note:

In the character name script you should probably call unicodedata.name() with a second argument in case the character has no name, i.e. replace
res = a + " : " + unicodedata.name(a)
with
res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))
Thanks for the hint! These are more or less the first scripts which I wrote in python ;) Caused by the issue that python has installed some Unicode data per default.

Here's another patch (against the current version). It shows both the codepoint and the name.

BTW, you don't have to use a regular expression to split a string into characters, simply iterating through it does the trick:

Index: Commands/Show Unicode Names.tmCommand =================================================================== --- Commands/Show Unicode Names.tmCommand (revision 9813) +++ Commands/Show Unicode Names.tmCommand (working copy) @@ -8,11 +8,13 @@ <string>#!/usr/bin/python import unicodedata import sys -import re

-for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), "UTF-8")): - if (len(a)==1) and (a != '\n'): - res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a)) +for a in unicode(sys.stdin.read(), "UTF-8"): + if a != '\n': + res = u"%s : U+%04X" % (a, ord(a)) + name = unicodedata.name(a, None) + if name: + res += u" : %s" % name print res.encode("UTF-8")</string> <key>fallbackInput</key> <string>character</string>

...

...
Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff like

unicodedata.category() unicodedata.bidrectional() unicodedata.decimal()

Yes. I have such a script in Perl which also shows up info about Unicode code points etc.

OK, now I see that the script displays information about every character in the selection. Adding more info might be a space problem.

Another problem: Using Ctrl-Shift-U as the shortcut hides the "Convert To Lowercase" command.

Servus, Walter

Hans-Jörg Bibiko

1:09 a.m.

On 02.06.2008, at 00:04, Walter Dörwald wrote:

...

Here's another patch (against the current version). It shows both the codepoint and the name.

BTW, you don't have to use a regular expression to split a string into characters, simply iterating through it does the trick:

Index: Commands/Show Unicode Names.tmCommand -for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), "UTF-8")):
if (len(a)==1) and (a != '\n'):
     res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))
+for a in unicode(sys.stdin.read(), "UTF-8"):
if a != '\n':
     res = u"%s : U+%04X" % (a, ord(a))
     name = unicodedata.name(a, None)
     if name:
         res += u" : %s" % name
     print res.encode("UTF-8")</string>
<key>fallbackInput</key> <string>character</string>

Thanks! Just committed to the trunk.

...

...
...
Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff

like

...
...
unicodedata.category() unicodedata.bidrectional() unicodedata.decimal()

Yes. I have such a script in Perl which also shows up info about

Unicode

...
code points etc.

Just added to the bundle a prototype of 'Show Unicode Properties'

...

Another problem: Using Ctrl-Shift-U as the shortcut hides the "Convert To Lowercase" command.

Yes. This was a bad key combo. I changed it temporally to CTRL+OPT +APPLE+U

BTW: Can Python handle Unicode codepoints which are specified in Unicode pane B, meaning greater U+FFFF? I tried it out. I found out that Python uses UTF-16 internally. But e.g. UCS hex: 20000 ; UTF-16: D840 DC00 . I can print that character to TM but unicodedata fails because it expects one character but not two (?)

Servus,

--der Hans

Alexey Blinov

3:26 p.m.

Sorry but do i miss something? I have that error ------------------------- Traceback (most recent call last): File "/tmp/temp_textmate.0WYiu4", line 50, in <module> result=dialog.menu([re.sub(r"(?=[^a-zA-Z0-9_ ./-\x7F-\xFF\n])", r'\', a) + "\t" + unicodedata.name(a, "U+%04X" % ord(a)) for a in suggestions]) File "/Applications/TextMate.app/Contents/SharedSupport/Support/lib/dialog.py", line 51, in menu plist = to_plist(menu) UnboundLocalError: local variable 'menu' referenced before assignment ------------------------- when try to "Convert to Greek..." or "Convert to Similar..."

Alexey Blinov

On Mon, Jun 2, 2008 at 3:09 AM, Hans-Jörg Bibiko bibiko@eva.mpg.de wrote:

...

On 02.06.2008, at 00:04, Walter Dörwald wrote:

...
Here's another patch (against the current version). It shows both the codepoint and the name.

BTW, you don't have to use a regular expression to split a string into characters, simply iterating through it does the trick:

Index: Commands/Show Unicode Names.tmCommand -for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), "UTF-8")):
if (len(a)==1) and (a != '\n'):
     res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))
+for a in unicode(sys.stdin.read(), "UTF-8"):
if a != '\n':
     res = u"%s : U+%04X" % (a, ord(a))
     name = unicodedata.name(a, None)
     if name:
         res += u" : %s" % name
    print res.encode("UTF-8")</string>
 <key>fallbackInput</key>
 <string>character</string>
Thanks! Just committed to the trunk.

...
...
...
Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff like

unicodedata.category() unicodedata.bidrectional() unicodedata.decimal()

Yes. I have such a script in Perl which also shows up info about Unicode code points etc.

Just added to the bundle a prototype of 'Show Unicode Properties'

...
Another problem: Using Ctrl-Shift-U as the shortcut hides the "Convert To Lowercase" command.

Yes. This was a bad key combo. I changed it temporally to CTRL+OPT+APPLE+U

BTW: Can Python handle Unicode codepoints which are specified in Unicode pane B, meaning greater U+FFFF? I tried it out. I found out that Python uses UTF-16 internally. But e.g. UCS hex: 20000 ; UTF-16: D840 DC00 . I can print that character to TM but unicodedata fails because it expects one character but not two (?)

Servus,

--der Hans ______________________________________________________________________ For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

Hans-Joerg Bibiko

3:31 p.m.

On 2 Jun 2008, at 15:26, Alexey Blinov wrote:

...

Sorry but do i miss something? I have that error

Traceback (most recent call last): File "/tmp/temp_textmate.0WYiu4", line 50, in <module> result=dialog.menu([re.sub(r"(?=[^a-zA-Z0-9_ ./-\x7F-\xFF\n])", r'\', a) + "\t" + unicodedata.name(a, "U+%04X" % ord(a)) for a in suggestions]) File "/Applications/TextMate.app/Contents/SharedSupport/Support/lib/ dialog.py", line 51, in menu plist = to_plist(menu) UnboundLocalError: local variable 'menu' referenced before assignment

when try to "Convert to Greek..." or "Convert to Similar..."

You have to upgrade dialog.py in /Applications/TextMate.app/Contents/ SharedSupport/Support/lib

The old version didn't support UTF-8.

Cheers,

Hans

Walter Dörwald

3:58 p.m.

Hans-Jörg Bibiko wrote:

...

On 02.06.2008, at 00:04, Walter Dörwald wrote:

...
Here's another patch (against the current version). It shows both the codepoint and the name.

BTW, you don't have to use a regular expression to split a string into characters, simply iterating through it does the trick:

Index: Commands/Show Unicode Names.tmCommand -for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), "UTF-8")):
if (len(a)==1) and (a != '\n'):
     res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a))
+for a in unicode(sys.stdin.read(), "UTF-8"):
if a != '\n':
     res = u"%s : U+%04X" % (a, ord(a))
     name = unicodedata.name(a, None)
     if name:
         res += u" : %s" % name
     print res.encode("UTF-8")</string>
<key>fallbackInput</key> <string>character</string>
Thanks! Just committed to the trunk.

...
...
...
Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff like

unicodedata.category() unicodedata.bidrectional() unicodedata.decimal()

Yes. I have such a script in Perl which also shows up info about

Unicode

...
code points etc.

Just added to the bundle a prototype of 'Show Unicode Properties'

...
Another problem: Using Ctrl-Shift-U as the shortcut hides the "Convert To Lowercase" command.

Yes. This was a bad key combo. I changed it temporally to CTRL+OPT+APPLE+U

BTW: Can Python handle Unicode codepoints which are specified in Unicode pane B, meaning greater U+FFFF? I tried it out. I found out that Python uses UTF-16 internally.

At least the Python that ships with the OS uses 2 byte Unicode character with partial UTF-16 support:

Python 2.5.2 (r252:60911, Apr 8 2008, 18:54:00) [GCC 3.3.5 (Debian 1:3.3.5-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information.

...

...
...
import sys sys.maxunicode

65535

The size of a Unicode character is specified at compile time with the --enable-unicode option, so you *could* compile a wide Python with: ./configure --enable-unicode=ucs4

...

But e.g. UCS hex: 20000 ; UTF-16: D840 DC00 . I can print that character to TM but unicodedata fails because it expects one character but not two (?)

There are some spots in the Python code base where in narrow builds surrogate pairs are interpreted properly as characters outside the BMP, but unicodedata isn't one of them (so it's not actually real UTF-16 throughout). There's an open issue on the Python bugtracker about that:

http://bugs.python.org/issue1706460

So there are two options:

1) Apple starts compiling its Python with --enable-unicode=ucs4 2) Python gets fixed so that surrogate pairs can be passed to unicodedata functions.

I think I might give 2) a try.

Servus, Walter

Walter Dörwald

3 Jun 3 Jun

5:28 p.m.

Walter Dörwald wrote:

...

Hans-Jörg Bibiko wrote:

...
On 02.06.2008, at 00:04, Walter Dörwald wrote:

...
Here's another patch (against the current version). It shows both the codepoint and the name. [...]

Here's another suggestions on the current Bundle version:

To get the UTF-8 bytes of a character, you're doing the following:

print " UTF-8 : " + " ".join(repr(char.encode("UTF-8")).split('\x')).lstrip("' ").rstrip("'").upper()

This only works for characters with a codepoint >= 128. The following code should work better:

print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() for c in char)

Furthermore the code:

decomp = unicodedata.decomposition(char).lstrip(' ').rstrip(' ')

can be simplyfied to:

decomp = unicodedata.decomposition(char).strip()

(strip() strips from both ends and stripping all whitespace is the default when no argument is given.)

Hope that helps.

Servus, Walter

Walter Dörwald

5:29 p.m.

Walter Dörwald wrote:

...

Walter Dörwald wrote:

...
Hans-Jörg Bibiko wrote:

...
On 02.06.2008, at 00:04, Walter Dörwald wrote:

...
Here's another patch (against the current version). It shows both the codepoint and the name. [...]

Here's another suggestions on the current Bundle version:

To get the UTF-8 bytes of a character, you're doing the following:
print "  UTF-8         : " + " 
".join(repr(char.encode("UTF-8")).split('\x')).lstrip("' ").rstrip("'").upper()

This only works for characters with a codepoint >= 128. The following code should work better:
print "  UTF-8         : %s" % " ".join(hex(ord(c))[2:].upper() for 
c in char)

Oops, that was of course supposed to be:

print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() for c in char.encode("utf-8"))

Servus, Walter

Hans-Joerg Bibiko

5:36 p.m.

On 3 Jun 2008, at 17:29, Walter Dörwald wrote:

...

Walter Dörwald wrote:

...
Walter Dörwald wrote:

...
Hans-Jörg Bibiko wrote:

...
On 02.06.2008, at 00:04, Walter Dörwald wrote:

...
Here's another patch (against the current version). It shows both the codepoint and the name. [...]

Here's another suggestions on the current Bundle version: To get the UTF-8 bytes of a character, you're doing the following: print " UTF-8 : " + " ".join(repr(char.encode("UTF-8")).split('\x')).lstrip("' ").rstrip("'").upper() This only works for characters with a codepoint >= 128. The following code should work better: print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() for c in char)

Oops, that was of course supposed to be:
print "  UTF-8         : %s" % " ".join(hex(ord(c))[2:].upper()  
for c in char.encode("utf-8"))

Once again, thanks a lot for teaching me Python ;) The code changes are in the SVN trunk.

--Hans

Hans-Jörg Bibiko

8:21 p.m.

On 03.06.2008, at 17:29, Walter Dörwald wrote:

...

Walter Dörwald wrote:

...
Walter Dörwald wrote:

...
Hans-Jörg Bibiko wrote:

...
On 02.06.2008, at 00:04, Walter Dörwald wrote:

...
Here's another patch (against the current version). It shows both the codepoint and the name. [...]

Here's another suggestions on the current Bundle version: To get the UTF-8 bytes of a character, you're doing the following: print " UTF-8 : " + " ".join(repr(char.encode ("UTF-8")).split('\x')).lstrip("' ").rstrip("'").upper() This only works for characters with a codepoint >= 128. The following code should work better: print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper () for c in char)

Oops, that was of course supposed to be:
 print "  UTF-8         : %s" % " ".join(hex(ord(c))[2:].upper 
() for c in char.encode("utf-8"))

Could it be that this isn't allowed in Python for Tiger? I get an error for invalid syntax referring to 'for' Mac OSX 10.4.11 ppc; Python 2.4.2

On my 10.5.3 Mac it works(?)

--Hans

Walter Dörwald

10:44 p.m.

Hans-Jörg Bibiko wrote:

...

On 03.06.2008, at 17:29, Walter Dörwald wrote:

...
Walter Dörwald wrote:

...
Walter Dörwald wrote:

...
Hans-Jörg Bibiko wrote:

...
On 02.06.2008, at 00:04, Walter Dörwald wrote:

...
Here's another patch (against the current version). It shows both the codepoint and the name. [...]

Here's another suggestions on the current Bundle version: To get the UTF-8 bytes of a character, you're doing the following: print " UTF-8 : " + " ".join(repr(char.encode("UTF-8")).split('\x')).lstrip("' ").rstrip("'").upper() This only works for characters with a codepoint >= 128. The following code should work better: print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() for c in char)

Oops, that was of course supposed to be:
 print "  UTF-8         : %s" % " ".join(hex(ord(c))[2:].upper() 
for c in char.encode("utf-8"))
Could it be that this isn't allowed in Python for Tiger? I get an error for invalid syntax referring to 'for' Mac OSX 10.4.11 ppc; Python 2.4.2

On my 10.5.3 Mac it works(?)

AFAICR Tiger has Python 2.3, which didn't support generator expressions.

The following should work:

print " UTF-8 : %s" % " ".join([hex(ord(c))[2:].upper() for c in char.encode("utf-8")])

(i.e. replace the generator expression with a list comprehension by adding [] around the join argument.)

Hope that helps!

Servus, Walter

Hans-Jörg Bibiko

11:38 p.m.

On 03.06.2008, at 22:44, Walter Dörwald wrote:

...

print " UTF-8 : %s" % " ".join([hex(ord(c))[2:].upper() for c in char.encode("utf-8")])

Thanks. This did the trick. I thought that I did try out this [ ]-notation, but anyway ... the main thing is that it works :)

--Hans

Hans-Jörg Bibiko

5 Jun 5 Jun

11:24 p.m.

Hi,

there are some more commands available:

Furthermore I wrote some basic syntax highlighting stuff to display 'no ASCII', 'no Latin', 'all combining diacritics' characters.

Most of the commands also support Unicode higher than U+FFFF. [But be careful! Up to now TM 1 can display (more or less) these characters, but each char are in TM 1 two chars! If you place the caret in between and invoke a command TM will crash immediately! But TM 2 supports these chars ;) ]

I wrote a new Chinese Traditional <> Simplified Converter. It also converts characters > U+FFFF (Apple's not ;) ), and it show up all those characters which have more than one counterpart as snippets {A=B|C}. There's a command which shows a menu displaying B and C etc. to disambiguate (Apple does not do that).

All Unicode data are coming from the latest Unicode 5.1 and it's easy to upgrade.

Show Unicode Properties also shows all known information about Chinese/Japanese/Korean ideographs, like Radical, readings, Wubi Xing codes, etc. All these data are coming from Apple's Character Palette internals ;) But I think about to integrate Unicode's UniHan database. This zip file (6MB) won't be part of that bundle. Anyone who wants to use it can download it (I will provide a command for that).

Last but not least I want to say thank you to Walter Dörwald who helped me a lot with the Python scripts.

Cheers,

--Hans

6261

days inactive

6267

days old

textmate@lists.macromates.com

14 comments

participants

tags (0)

participants (4)

Alexey Blinov
Hans-Joerg Bibiko
Hans-Jörg Bibiko
Walter Dörwald