[TxMt] Character-set/encoding in HTML documents.
Sune Foldager
cryo at cyanite.org
Wed Mar 2 15:15:58 UTC 2005
There has been some talk here and on the bundle dev list about
character sets and entities in HTML documents. Here is some information
that might be useful.
First of all, there is no need to use entities in HTML documents for
such things as é è ê ë etc., as it only makes them harder to read, to
edit and to parse. When using utf-8, the entire utf-8 range CAN be used
directly. As far as I know, the only entities needed are:
& > <
In order to not confuse the parser. No, all this only works if the
document is sent by the server as UTF-8. Someone wrote on the dev-list
that he makes sure this is the case by putting a meta-tag in the
document (using http-equiv).
Unfortunately, most popular web-servers (well, at least Apache ;-))
doesn't look at the document to see if you included such a tag, and
will always add a char-set header to the HTTP response. This defaults
to iso-8859-1 and _overrides_ the one specified in the document
meta-tag. Maybe this is not in line with the standard, but it is
nevertheless what happens in practice.
So we need to make the server send the content as utf-8 instead. With
Apache we have several alternatives:
1) Enable MultiViews using (in .htaccess) Options +MultiViews and
rename the document to name.html.utf8 or name.php.utf8 etc. MultiViews
also allow for content-type and language negotiation so you can refer
to a picture with 'name' and have several versions on disk 'name.jpeg',
'name.png' etc. The same goes for omitting the .php and .html
extensions. Note that some web-servers may be setup to prevent you from
enabling MultiViews like this.
2) Set the default charset for all text/html and text/plain content.
This will of course also include php, ruby and cgi in general. You can
do this by putting: AddDefaultCharset utf-8 in your .htaccess file.
This is probably the easiest way.
3) Set the charset to utf-8 for some file-extensions only. This hardly
seem useful, and could probably be of more use by setting the charset
to something ELSE than utf-8 for certain files, e.g. name.txt or
similar. The syntax is (in the .htaccess file): AddCharset <charset>
<extensions...>, e.g. AddCharset iso-8859-1 .txt .text.
Note that there doesn't seem to be any way to add charsets to specific
mime-types, but only extensions. Of the methods above I recommend 2,
although enabling MultiViews is generally a Good Thing™, in my opinion.
--
Sune :: http://cyanite.org/
"And now there is merely silence, silence,
silence, saying all we did not know."
More information about the textmate
mailing list