On 29/11/2005, at 21:45, Nicholas Orr wrote:
<?xml version="1.0" encoding="UTF-16"?>
That's the problem. This will make xmllint (used by the Tidy command) return the result as utf-16, but TextMate expects it to be utf-8, and so it'll show up wrong.
There's a few options: 1) convert the files from a script instead 2) change to "UTF-8" (remember to convert the files to utf-8 as well, if they're currently in utf-16 format) 3) change the “xmllint --format -” line to “xmllint --format -| iconv -f ucs-2 -t utf-8” in the Tidy command (Bundle Editor -> Show Bundle Editor -> XML -> Tidy).
Since it sounds like you have a lot of files to convert, option 1 has advantages in itself over manually selecting Tidy from TM.
Personally I'd use a script, but also let the script convert the files to utf-8 (utf-8 is generally a better encoding than utf-16, and while my reference to this is vague, I think part of the Unicode consortium do see utf-16 as legacy, especially now that it no longer has a 1:1 mapping to unicode code points (because of ucs-4)).
On 30/11/2005, at 8:19 AM, Allan Odgaard wrote:
On 29/11/2005, at 21:45, Nicholas Orr wrote:
<?xml version="1.0" encoding="UTF-16"?>
That's the problem. This will make xmllint (used by the Tidy command) return the result as utf-16, but TextMate expects it to be utf-8, and so it'll show up wrong.
There's a few options:
- convert the files from a script instead
- change to "UTF-8" (remember to convert the files to utf-8 as
well, if they're currently in utf-16 format) 3) change the “xmllint --format -” line to “xmllint --format -| iconv -f ucs-2 -t utf-8” in the Tidy command (Bundle Editor -> Show Bundle Editor -> XML -> Tidy).
Since it sounds like you have a lot of files to convert, option 1 has advantages in itself over manually selecting Tidy from TM.
Personally I'd use a script, but also let the script convert the files to utf-8 (utf-8 is generally a better encoding than utf-16, and while my reference to this is vague, I think part of the Unicode consortium do see utf-16 as legacy, especially now that it no longer has a 1:1 mapping to unicode code points (because of ucs-4)).
Ok, you go along running a computer business, thinking you know a fair bit, and then something just goes straight over your head. Thanks for taking the time to reply, but you've lost me. Unicode is one thing I haven't kept track of at all.
In the end I did #3, and it worked fine. It worked so well I went and bought a copy.
Thanks for your help.
Nick
On 30/11/2005, at 3:05, Nicholas Orr wrote:
[...] Thanks for taking the time to reply, but you've lost me. Unicode is one thing I haven't kept track of at all.
Don't know if it helps (or has interest), but here's my quick summary of the unicode encoding situation (factually unchecked):
Unicode code points (which I'll just call characters for simplicity) were originally 16 bits each. This means to write a text in unicode, each character would require 16 bits.
UTF-8 is a representation (encoding) of unicode which was devised to make a way to transition to unicode while still remaining backwards compatibility (which amongst other means that all our unix tools work gracefully with the files).
From a users POV UTF-8 has (IMHO) all the advantages, most notable the ASCII compatibility. From a developers POV UTF-8 is a little more difficult to work with because each character can be represented with multiple bytes, so for example it's not (always) legal to just cut a string in half, and to access a given index in a string, one needs to iterate through the string, rather than just do string[index].
For this reason, it might seem [1] more attractive to just work with the 16 bit characters, at least in memory.
But time passed, and it turned out that 16 bit characters were pessimistic, so today a unicode character is 32 bits, and a 16 bit unicode text is said to be in UTF-16, which is now just en encoding for 32 bit unicode (ucs-4). So all the “problems” of UTF-8 now exist for UTF-16, plus the lack of backwards compatibility, the generally double file size, and the endian issues (i.e. are the 16 bit characters in big or little endian format, and despite the byte-order- marker, the source that reads UTF-16 needs to know the endianess of the architecture on which it's compiled).
So all in all, today UTF-16 is a pretty crappy encoding! :)
In the end I did #3, and it worked fine. It worked so well I went and bought a copy.
Thanks!
[1] I say seems because in my experience this is mostly a theoretical disadvantage, as the need to split/index a string based on index is rare, when it happens, one is generally already iterating the string character by character (to find where to split it etc.).
If you want additional detail about Unicode encoding geekery:
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF ...
UTF-16 is probably what most people thought most programmers would use for Unicode; this is reflected in the fact that the native character type in both Java and C# is a sixteen-bit quantity. Of course, it doesn't really represent a Unicode character, exactly (although it does most times), it represents a UTF-16 codepoint.
UTF-16 is about the most efficient way possible of representing Asian character strings, each character nestling snugly into two bytes of storage. For ASCII characters, of course, you end up using two bytes to represent what would actually fit into one.
...
Chris
One more and I'm done. In reference to strings written in East Asian languages (which require three bytes for each character when encoded in UTF-8), there's a good point made in passing here:
http://jroller.com/page/bloritsch?entry=obsessed_with_speed
Not to mention UTF16 is vastly simpler to work with when you have to manipulate the strings than UTF8.
Chris
Le Fri 2/12/2005, Chris Thomas disait
If you want additional detail about Unicode encoding geekery:
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF ...
UTF-16 is probably what most people thought most programmers would use for Unicode; this is reflected in the fact that the native character type in both Java and C# is a sixteen-bit quantity. Of course, it doesn't really represent a Unicode character, exactly (although it does most times), it represents a UTF-16 codepoint.
UTF-16 is about the most efficient way possible of representing Asian character strings, each character nestling snugly into two bytes of storage. For ASCII characters, of course, you end up using two bytes to represent what would actually fit into one.
...
except there are 2 different UTF-16 (big and little endian) and that a bunch of preexisting software considers the 0 byte as end of string...
On 2/12/2005, at 15:42, Chris Thomas wrote:
UTF-16 is about the most efficient way possible of representing Asian character strings [...]
I would have put my money on bzip2 for “the most efficient way possible” if the parameter was size :p
UTF-16 is about the most efficient way possible of representing Asian character strings [...]
I would have put my money on bzip2 for “the most efficient way possible” if the parameter was size :p
google for "criscione 100 bytes". I place my bets on this method ^_^
Gentlemen,
Please read the entire article before sniping, if you're so inclined. I quoted a very small section of it. :-)
Please note that I'm _not_ taking a stand on UTF-16, although I could see why you might think that from the quotes I provided. I'm just trying to point out additional information.
Allan:
I would have put my money on bzip2 for “the most efficient way possible” if the parameter was size :p
Given that bzip2 requires decompression before you can address individual characters, I don't think it can lay claim to being "most efficient." :)
Chris