dear everyone,
I have added a couple of commands to the Invisibles.tmbundle. One turns the 'smart' quotes that look awful in HTML, etc., into plain quotes. The other deletes all 'non-ASCII' characters.
Now your favorite non-English language might have some characters that Invisibles thinks are gremlins and which get colored red and deleted in the Zap. You can change the pattern yourself to better suit your needs, in the Invisibles.plist and the Zap Command. You may want to refer to an ASCII chart (just google it).
You can get this last bundle from http://math.sfsu.edu/hsu/textmate/ but most of the fun stuff is happening at the TextMate SVN site. http://anon:anon@macromates.com/svn/Bundles/trunk
Perhaps soon there will be a cron script that zips up the latest Bundles automatically...
best, Eric
On 20.01.2005, at 10:36, Eric Hsu wrote:
You may want to refer to an ASCII chart (just google it).
Or type "man ascii" into Terminal.app (without the quotes, of course).
Cheers, -Ralph.
At 10:47 AM +0100 1/20/05, Ralph Pöllath wrote:
You may want to refer to an ASCII chart (just google it).
Or type "man ascii" into Terminal.app (without the quotes, of course).
Wow, I didn't know that was there! However, the man page is American-centric and stops at x7F. The accented letters live in the lower part of x80-xFF. Hence, a chart like http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm might be more helpful. On the other hand, extended ASCII does depend on encoding, and I'm not sure how standard x80-xFF are.
best, Eric
On Jan 20, 2005, at 18:57, Eric Hsu wrote:
On the other hand, extended ASCII does depend on encoding, and I'm not sure how standard x80-xFF are.
Since TM uses UTF-8 to talk with external commands, you don't have to worry about encodings. The non-printable high-bit characters are 0x80-0x9F, but in UTF-8 that corresponds to this pattern: “\xC2[\x80-\x9F]” (obtained using: “printf \x80\x9F|iconv -f iso-8859-1 -t utf-8|xxd”).
So my candidate for an UTF-8 friendly zap gremlins becomes: perl -pe 's/[^\t\n\x20-\xFF]|\xC2[\x80-\x9F]//g'
Does anyone actually have a document with 'gremlins' to test this stuff? ;)
sorry for going slightly offtopic here; but is "show invisibles" something that we'll see "natively" in textmate someday? Or does the bundle in question offer the same functionality? (I'm blatantly admitting here that I havent tried it)
-- johan
At 7:32 PM +0100 1/20/05, Allan Odgaard wrote:
Does anyone actually have a document with 'gremlins' to test this stuff? ;)
Sadly, my actual from-the-wild gremlin document was in English and only had \x00's, a couple of high-bit characters, and the annoying 'smart' quotes and hyphens. (Preview copy/paste from a PDF from someone's Word document.)
- Eric
On Jan 20, 2005, at 19:32, Allan Odgaard wrote:
Does anyone actually have a document with 'gremlins' to test this stuff? ;)
Problem fixed:
for (( i = 0; i < 256; i++ )); do printf \x$(printf "obase=16\n$i\n"|bc); done|iconv -f iso-8859-1 -t utf-8|perl -pe 's/[^\t\n\x20-\xFF]|\xC2[\x80-\x9F]//g'|iconv -f utf-8 -t iso-8859-1|xxd
Outputs:
0000000: 090a 2021 2223 2425 2627 2829 2a2b 2c2d .. !"#$%&'()*+,- 0000010: 2e2f 3031 3233 3435 3637 3839 3a3b 3c3d ./0123456789:;<= 0000020: 3e3f 4041 4243 4445 4647 4849 4a4b 4c4d >?@ABCDEFGHIJKLM 0000030: 4e4f 5051 5253 5455 5657 5859 5a5b 5c5d NOPQRSTUVWXYZ[] 0000040: 5e5f 6061 6263 6465 6667 6869 6a6b 6c6d ^_`abcdefghijklm 0000050: 6e6f 7071 7273 7475 7677 7879 7a7b 7c7d nopqrstuvwxyz{|} 0000060: 7e7f a0a1 a2a3 a4a5 a6a7 a8a9 aaab acad ~............... 0000070: aeaf b0b1 b2b3 b4b5 b6b7 b8b9 babb bcbd ................ 0000080: bebf c0c1 c2c3 c4c5 c6c7 c8c9 cacb cccd ................ 0000090: cecf d0d1 d2d3 d4d5 d6d7 d8d9 dadb dcdd ................ 00000a0: dedf e0e1 e2e3 e4e5 e6e7 e8e9 eaeb eced ................ 00000b0: eeef f0f1 f2f3 f4f5 f6f7 f8f9 fafb fcfd ................ 00000c0: feff ..
Probably 0x7F should also be stripped... also, I didn't check if everything between 0xA0-0xFF should actually be preserved -- I'll check UCD later...
On 20. jan 2005, at 18:57, Eric Hsu wrote:
At 10:47 AM +0100 1/20/05, Ralph Pöllath wrote:
You may want to refer to an ASCII chart (just google it).
Or type "man ascii" into Terminal.app (without the quotes, of course).
Wow, I didn't know that was there! However, the man page is American-centric and stops at x7F. The accented letters live in the lower part of x80-xFF.
Yes, but ASCII is a 7-bit code, hence only contains characters 0x00-0x7F. Common 8-bit codes are ISO-8859-1 and Mac Roman etc. ... so I guess it's just a matter of terms, sorry for me being pedantic here ;-).
At 9:50 PM +0100 1/20/05, Sune Foldager wrote:
On 20. jan 2005, at 18:57, Eric Hsu wrote: At 10:47 AM +0100 1/20/05, Ralph Pöllath wrote:
You may want to refer to an ASCII chart (just google it).
Or type "man ascii" into Terminal.app (without the quotes, of course).
Wow, I didn't know that was there! However, the man page is American-centric and stops at x7F. The accented letters live in the lower part of x80-xFF.
Yes, but ASCII is a 7-bit code, hence only contains characters 0x00-0x7F. Common 8-bit codes are ISO-8859-1 and Mac Roman etc. ... so I guess it's just a matter of terms, sorry for me being pedantic here
You are completely correct. If one can't be pedantic in computer science, where can one? :)
The original sentence meant to convey the idea that some people don't like having accented characters treated as gremlins. In order to avoid that, they can look at the 8-bit extensions to ASCII and take their favorite characters and adapt Zap non-ASCII preserve them.
For this purpose "man ascii" isn't helpful, and they'll need either to roll their own chart for their favorite encoding or google one.
best, Eric