Hi.
I'm having some problems with textmate jumping in the encoding. I have e.g. a website which is saved in iso-8859-1 and correctly opened in the same format. I then have some text from a word file that i copy. This text is in danish, so it have some of the danish letters (æøå). When I paste this into the before mentioned document in Textmate, the whole document changes to UTF-8. Because of this when I save it and upload it, all the dansih letters becomes garbage text (because it tries to open it as a iso-8859-1 as I have specified). The same thing happens in Textmate if i try reopen it as iso-8859-1
But is there any reason for Textmate to change the encoding of the document just because I paste some text, and can I avoid it ?
Regards Danny Krøger
Hi,
On 30.03.2007, at 13:59, Danny Krøger wrote:
When I paste this into the before mentioned document in Textmate, the whole document changes to UTF-8. Because of this when I save it and upload it, all the dansih letters becomes garbage text (because it tries to open it as a iso-8859-1 as I have specified). The same thing happens in Textmate if i try reopen it as iso-8859-1
I'm seeing exactly the same problem here. It's as if Textmate was not able to encode some of the characters into latin 1 - probably because even they look like their lat1-counterparts, they are some special unicode representations.
TextMate should warn the user though if it cannot encode the document into the currently selected character encoding.
Philip
According to Danny Krøger:
But is there any reason for Textmate to change the encoding of the document just because I paste some text, and can I avoid it ?
TextMate will change the encoding if it detects that among the characters you're pasting don't exist in latin1 (which is the right thing to do as it can't possibly which of the various latinN it should use).
Allan, being a die hard utf-8 fan, won't support anything other than latin1 (which is a bit of a pity but I understand him).
That's easy for me to say but converting everything over to utf-8 would be better in the long run.
I'm also a supporter for UTF-8, but as a webdeveloper I also has to take IE6 lousy support into this and go the latin 1 way. I also maintain some websites which is already coded in latin 1.
But I think it's bad behaviour that it doesn't warn you that it's changing the encoding. It messes up the whole document unless you undo it. If you try to revert to latin 1 (by reopen by encoding) after it has swapped it, the old text is still messed up.
It would be nice to have an option to paste text at the current encoding and truncate characters not availible. That is a better option than destroying a document (when you are forced to keep it in latin 1). It costs so much time to change all the garbaged text by hand afterwards.
On 30/03/2007, at 16.34, Ollivier Robert wrote:
Allan, being a die hard utf-8 fan, won't support anything other than latin1 (which is a bit of a pity but I understand him).
That's easy for me to say but converting everything over to utf-8 would be better in the long run.
On Mar 30, 2007, at 10:58 AM, Danny Krøger wrote:
It would be nice to have an option to paste text at the current encoding and truncate characters not availible. That is a better option than destroying a document (when you are forced to keep it in latin 1). It costs so much time to change all the garbaged text by hand afterwards.
Bundle Editor -> New Command Input: None Output: Replace Selected Text Key Equivalent: <your choice> Command(s):
__CFUSERTEXT_ENCODING=0×1F5:0×8000100:0×8000100 /usr/bin/pbpaste | / usr/bin/iconv -c -s -f UTF-8 -t ISO-8859-1
Then use that command for pasting instead of cmd-v.
Caution: that will silently discard characters which cannot be converted. If you want to do better (say, by replacing them with '?) you'll need to install GNU iconv (MacPorts, Fink, etc).
j.
Thank you SO MUCH!. this is the best tip of the year so far. It will save me so much time you won't believe it!
May the almighty editor god be with you :-)
Regards Danny Krøger
On 30/03/2007, at 19.09, Jay Soffian wrote:
__CFUSERTEXT_ENCODING=0×1F5:0×8000100:0×8000100 /usr/bin/pbpaste | / usr/bin/iconv -c -s -f UTF-8 -t ISO-8859-1
On 30. Mar 2007, at 19:09, Jay Soffian wrote:
On Mar 30, 2007, at 10:58 AM, Danny Krøger wrote:
It would be nice to have an option to paste text at the current encoding and truncate characters not availible. That is a better option than destroying a document (when you are forced to keep it in latin 1). It costs so much time to change all the garbaged text by hand afterwards.
Bundle Editor -> New Command Input: None Output: Replace Selected Text Key Equivalent: <your choice> Command(s):
__CFUSERTEXT_ENCODING=0×1F5:0×8000100:0×8000100 /usr/bin/pbpaste | /usr/bin/iconv -c -s -f UTF-8 -t ISO-8859-1
Then use that command for pasting instead of cmd-v.
That is indeed clever :) One addition though, you need to convert back to utf-8, since TM expects the command result to be in utf-8 (but we got the non-latin 1 superset pruned, so it will still work).
One can also add //TRANSLIT to the target encoding, that will make iconv try to “downgrade” the characters which could not be converted. For example curly quotes become straight quotes, ellipsis becomes three dots, etc.
So the command could read:
__CFUSERTEXT_ENCODING=0×1F5:0×8000100:0×8000100 /usr/bin/ pbpaste \ | /usr/bin/iconv -c -s -f UTF-8 -t ISO-8859-1//TRANSLIT \ | /usr/bin/iconv -f ISO-8859-1 -t UTF-8
Answering a few other things from this thread:
1) IE6 (and IE4 + IE5 for that matter) supports utf-8 just fine, as long as you send the proper charset-encoding header.
2) I am a diehard utf-8 fan and I do want you all to switch to utf-8 if you haven’t already!!! but 2.0 will also have better encoding support in general, like presenting errors/warnings at the proper times, making it more explicit when there are problems with encodings (like loading non-utf 8 files with 8 bit characters), etc.
3) If you do insist on using latin-1 for whatever project you are working on, be sure to switch to ISO-8859-1 in Preferences → Advanced → Saving. By default it is utf-8, and I think that is why it switches to utf-8 when you paste æøå from Word. If you set it to ISO-8859-1, then it should pick latin-1 instead.
Finally a question: If your web-site is all in latin-1, how do you deal with user input, if any? I.e. if I can post comments or in some other way submit arbitrary plain text to your site, you just pray I restrain myself to latin-1, and that the browser sends my text as latin-1? ;)
Allan Odgaard wrote:
Finally a question: If your web-site is all in latin-1, how do you deal with user input, if any? I.e. if I can post comments or in some other way submit arbitrary plain text to your site, you just pray I restrain myself to latin-1, and that the browser sends my text as latin-1? ;)
You can specify the accept-charset attribute for a form; otherwise it will default to the document's encoding, with varying behaviour for unknown sequences: http://www.intertwingly.net/blog/1761.html
On 31. Mar 2007, at 17:28, Henrik Nyh wrote:
Allan Odgaard wrote:
Finally a question: If your web-site is all in latin-1, how do you deal with user input, if any? I.e. if I can post comments or in some other way submit arbitrary plain text to your site, you just pray I restrain myself to latin-1, and that the browser sends my text as latin-1? ;)
You can specify the accept-charset attribute for a form; otherwise it will default to the document's encoding, with varying behaviour for unknown sequences: http://www.intertwingly.net/blog/1761.html
That it defaults to the documents encoding is the suggested behavior, not required.
However, my question was meant for someone who actually sends pages as latin-1 (as I would guess the OP does).
If you accept user content and display it on your site (like his login name), you sort of have to set the accept encoding to utf-8 and then entity-encode that accepted text (so that it can be displayed on the pages you serve).
Hi,
I have a tiny problem with that. Maybe caused by an other version of iconv??
On 31.03.2007, at 17:10, Allan Odgaard wrote:
On 30. Mar 2007, at 19:09, Jay Soffian wrote:
On Mar 30, 2007, at 10:58 AM, Danny Krøger wrote:
__CFUSERTEXT_ENCODING=0×1F5:0×8000100:0×8000100 /usr/bin/pbpaste | /usr/bin/iconv -c -s -f UTF-8 -t ISO-8859-1
Then use that command for pasting instead of cmd-v. Caution: that will silently discard characters which cannot be converted. If you want to do better (say, by replacing them with '?) you'll need to install GNU iconv (MacPorts, Fink, etc).
... So the command could read:
__CFUSERTEXT_ENCODING=0×1F5:0×8000100:0×8000100 /usr/bin/pbpaste \ | /usr/bin/iconv -c -s -f UTF-8 -t ISO-8859-1//TRANSLIT \ | /usr/bin/iconv -f ISO-8859-1 -t UTF-8
The point here is that even characters which could be converted, like 'ö', will discard. If you want to copy from Word, let's say 'ör ‚tås’', this command only will insert 'r ts' although 'ö' and 'å' are defined in Latin 1 and ‚tås’ should be replaced by '. On the other hand, for non-Latin 1 characters it will show a '?'. Fine, but if you want find/replace all '?' in a document - well, this could be very tricky.
Best,
Hans
On 31. Mar 2007, at 17:46, Hans-Jörg Bibiko wrote:
I have a tiny problem with that. Maybe caused by an other version of iconv??
Ah, there were missing underscores in the first line, the command should read:
__CF_USER_TEXT_ENCODING=$UID:0x8000100:0x8000100 /usr/bin/pbpaste \ | /usr/bin/iconv -cs -f UTF-8 -t ISO-8859-1 \ | /usr/bin/iconv -f ISO-8859-1 -t UTF-8
Hi Allan.
It is set to Latin 1 in the preference pane. Fortunately the website I administer where there is user input. There is taken care of stupidities that latin 1 introduces. So no problems there. But as I stated earlier on I'm also a supporter for utf-8 myself, I'm just not always in the position to decide what to use :-(
Looking forward to the encoding changes in v2. I'm sure I still need to work with Latin 1 documents for some time.
And once again thanks for the workarounds Jay & Allan.
Regards Danny Krøger
On 31/03/2007, at 17.10, Allan Odgaard wrote:
- If you do insist on using latin-1 for whatever project you are
working on, be sure to switch to ISO-8859-1 in Preferences → Advanced → Saving. By default it is utf-8, and I think that is why it switches to utf-8 when you paste æøå from Word. If you set it to ISO-8859-1, then it should pick latin-1 instead.
Finally a question: If your web-site is all in latin-1, how do you deal with user input, if any? I.e. if I can post comments or in some other way submit arbitrary plain text to your site, you just pray I restrain myself to latin-1, and that the browser sends my text as latin-1? ;)