Textencoding problems renders scripts useless

List overview All Threads
Download

newer

older

Go Murl Yourself

How to create new untitled doc...

Stefan

11 Mar 2007 11 Mar '07

5:34 p.m.

Hi list,

generally, I use textmate to write source code for various languages and systems. Everything works nicely.

Textmate is set to use ISO-Latin1.

From time to time, I need to copy text fragments from Word oder PDF documents to PHP scripts. As soon save the file with the copied text, the encoding of textmate document suddenly changes. All diacritical characters get converted to funny chars. All in all, the resulting PHP isn't any longer useable.

It took me some time to exactly figure out the reasons. But, I failed to solve it.

Does someone has an idea regarding this?

BTW: If I try this using Eclipse, the IDE reports a problem saving the file, since the encoding changed. I have to start windows and copy the Word text fragments - since Eclipse on Windows works just fine.

An ideas?

Kind regards,

Stefan

Show replies by date

Stefan

11 Mar 11 Mar

8:01 p.m.

no one?

Am 11.03.2007 um 18:34 schrieb Stefan:

...

Hi list,

generally, I use textmate to write source code for various languages and systems. Everything works nicely.

Textmate is set to use ISO-Latin1.

From time to time, I need to copy text fragments from Word oder PDF documents to PHP scripts. As soon save the file with the copied text, the encoding of textmate document suddenly changes. All diacritical characters get converted to funny chars. All in all, the resulting PHP isn't any longer useable.

It took me some time to exactly figure out the reasons. But, I failed to solve it.

Does someone has an idea regarding this?

BTW: If I try this using Eclipse, the IDE reports a problem saving the file, since the encoding changed. I have to start windows and copy the Word text fragments - since Eclipse on Windows works just fine.

An ideas?

Kind regards,

Stefan

For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

Charley Tiggs

8:23 p.m.

Would the "Transliterate Word to ASCII" command in the Text bundle be what you're looking for? I think it takes text that comes from word and converts the necessary chars to the ASCII equivalents (curly quotes to straight, etc).

Charley

Stefan wrote:

...

no one?

Am 11.03.2007 um 18:34 schrieb Stefan:

...
Hi list,

generally, I use textmate to write source code for various languages and systems. Everything works nicely.

Textmate is set to use ISO-Latin1.

From time to time, I need to copy text fragments from Word oder PDF documents to PHP scripts. As soon save the file with the copied text, the encoding of textmate document suddenly changes. All diacritical characters get converted to funny chars. All in all, the resulting PHP isn't any longer useable.

It took me some time to exactly figure out the reasons. But, I failed to solve it.

Does someone has an idea regarding this?

BTW: If I try this using Eclipse, the IDE reports a problem saving the file, since the encoding changed. I have to start windows and copy the Word text fragments - since Eclipse on Windows works just fine.

An ideas?

Kind regards,

Stefan

Stefan

8:45 p.m.

Thx, Charley! I'll try it soon.

Am 11.03.2007 um 21:23 schrieb Charley Tiggs:

...

Would the "Transliterate Word to ASCII" command in the Text bundle be what you're looking for? I think it takes text that comes from word and converts the necessary chars to the ASCII equivalents (curly quotes to straight, etc).

Charley

Stefan wrote:

...
no one? Am 11.03.2007 um 18:34 schrieb Stefan:

...
Hi list,

generally, I use textmate to write source code for various languages and systems. Everything works nicely.

Textmate is set to use ISO-Latin1.

From time to time, I need to copy text fragments from Word oder PDF documents to PHP scripts. As soon save the file with the copied text, the encoding of textmate document suddenly changes. All diacritical characters get converted to funny chars. All in all, the resulting PHP isn't any longer useable.

It took me some time to exactly figure out the reasons. But, I failed to solve it.

Does someone has an idea regarding this?

BTW: If I try this using Eclipse, the IDE reports a problem saving the file, since the encoding changed. I have to start windows and copy the Word text fragments - since Eclipse on Windows works just fine.

An ideas?

Kind regards,

Stefan

For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

Xavier Noria

8:40 p.m.

On Mar 11, 2007, at 9:01 PM, Stefan wrote:

...

no one?

Why do you think pasting from Word non-ASCII should work out of the box?

-- fxn

Stefan

8:45 p.m.

Am 11.03.2007 um 21:40 schrieb Xavier Noria:

...

On Mar 11, 2007, at 9:01 PM, Stefan wrote:

...
no one?

Why do you think pasting from Word non-ASCII should work out of the box?

Because if I type the same non-ASCII using the keyboard, everything works fine.

Xavier Noria

9:03 p.m.

On Mar 11, 2007, at 9:45 PM, Stefan wrote:

...

Am 11.03.2007 um 21:40 schrieb Xavier Noria:

...
On Mar 11, 2007, at 9:01 PM, Stefan wrote:

...
no one?

Why do you think pasting from Word non-ASCII should work out of the box?

Because if I type the same non-ASCII using the keyboard, everything works fine.

Sure, but their flow is different.

When you type the computer maps keystrokes to glyphs. That interpretation is the one you configure via the Input Menu. Then, when text is saved the encoding configured in the editor is used to map again from glyphs to bytes. So the chain is well-defined.

On the other hand, the clipboard is a different story. I am not a Cocoa programmer but reading pages like this one:

http://www.bekkoame.ne.jp/~n-iyanag/researchTools/clip_utils_osx.html

looks like the flow is not so clear as to be able to robustly handle pastes from any source, as Word. Can anyone elaborate this point if it is correct?

-- fxn

Stefan

9:54 p.m.

OK, I see. Thx!

I'll prepare a small utility, which saves texts of the clip using different encodings. Shouldn't be a problem. Thx!

Technically speaking, I would suppose, TextMate should read the clipboard using the current clipboard's encoding [likely to be UTF*], then convert it to the encoding of the current document and paste it in the current document.

Since TextMate obviously does a transformation, I wonder, why it convert this way - which breaks the current document somehow.

Am 11.03.2007 um 22:03 schrieb Xavier Noria:

...

On Mar 11, 2007, at 9:45 PM, Stefan wrote:

...
Am 11.03.2007 um 21:40 schrieb Xavier Noria:

...
On Mar 11, 2007, at 9:01 PM, Stefan wrote:

...
no one?

Why do you think pasting from Word non-ASCII should work out of the box?

Because if I type the same non-ASCII using the keyboard, everything works fine.

Sure, but their flow is different.

When you type the computer maps keystrokes to glyphs. That interpretation is the one you configure via the Input Menu. Then, when text is saved the encoding configured in the editor is used to map again from glyphs to bytes. So the chain is well-defined.

On the other hand, the clipboard is a different story. I am not a Cocoa programmer but reading pages like this one:

http://www.bekkoame.ne.jp/~n-iyanag/researchTools/ clip_utils_osx.html

looks like the flow is not so clear as to be able to robustly handle pastes from any source, as Word. Can anyone elaborate this point if it is correct?

Allan Odgaard

12 Mar 12 Mar

8:02 a.m.

On 11. Mar 2007, at 22:54, Stefan wrote:

...

Technically speaking, I would suppose, TextMate should read the clipboard using the current clipboard's encoding [likely to be UTF*], then convert it to the encoding of the current document and paste it in the current document.

It does -- and the current document is unicode.

...

Since TextMate obviously does a transformation, I wonder, why it convert this way - which breaks the current document somehow.

No, what happens is, that while you type, you refrain from using characters outside what’s defined in Latin-1.

When you paste from Word, you get (presumably) curly quotes inserted into your document. These characters does not exist in Latin-1, and so, the next time you save your document, TextMate will “upgrade” it to UTF-8.

My advice, whenever people bring up non-UTF-8 encodings, is to stop resisting, and go with UTF-8 :) UTF-8 can represent all the characters you can type or paste into your document, it is identified with 99.9999% certainty when the file is loaded from disk, it is an ASCII superset and thus compatible with basically all programs that just expect ASCII (compilers, script interpreters), it is compact, it was recommended in 98 by IETF for all future internet protocols, etc.

6945

days inactive

6946

days old

textmate@lists.macromates.com

8 comments

participants

tags (0)

participants (4)

Allan Odgaard
Charley Tiggs
Stefan
Xavier Noria