[TxMt] Character-set/encoding in HTML documents.

Sune Foldager cryo at cyanite.org
Wed Mar 2 15:15:58 UTC 2005


There has been some talk here and on the bundle dev list about 
character sets and entities in HTML documents. Here is some information 
that might be useful.

First of all, there is no need to use entities in HTML documents for 
such things as é è ê ë etc., as it only makes them harder to read, to 
edit and to parse. When using utf-8, the entire utf-8 range CAN be used 
directly. As far as I know, the only entities needed are:

&  >  <

In order to not confuse the parser. No, all this only works if the 
document is sent by the server as UTF-8. Someone wrote on the dev-list 
that he makes sure this is  the case by putting a meta-tag in the 
document (using http-equiv).

Unfortunately, most popular web-servers (well, at least Apache ;-)) 
doesn't look at the document to see if you included such a tag, and 
will always add a char-set header to the HTTP response. This defaults 
to iso-8859-1 and _overrides_ the one specified in the document 
meta-tag. Maybe this is not in line with the standard, but it is 
nevertheless what happens in practice.

So we need to make the server send the content as utf-8 instead. With 
Apache we have several alternatives:

1) Enable MultiViews using (in .htaccess) Options +MultiViews and 
rename the document to name.html.utf8 or name.php.utf8 etc. MultiViews 
also allow for content-type and language negotiation so you can refer 
to a picture with 'name' and have several versions on disk 'name.jpeg', 
'name.png' etc. The same goes for omitting the .php and .html 
extensions. Note that some web-servers may be setup to prevent you from 
enabling MultiViews like this.

2) Set the default charset for all text/html and text/plain content. 
This will of course also include php, ruby and cgi in general. You can 
do this by putting: AddDefaultCharset utf-8  in your .htaccess file. 
This is probably the easiest way.

3) Set the charset to utf-8 for some file-extensions only. This hardly 
seem useful, and could probably be of more use by setting the charset 
to something ELSE than utf-8 for certain files, e.g. name.txt or 
similar. The syntax is (in the .htaccess file): AddCharset <charset> 
<extensions...>, e.g. AddCharset iso-8859-1 .txt .text.

Note that there doesn't seem to be any way to add charsets to specific 
mime-types, but only extensions. Of the methods above I recommend 2, 
although enabling MultiViews is generally a Good Thing™, in my opinion.


-- 
Sune    ::    http://cyanite.org/
     "And now there is merely silence, silence,
         silence, saying all we did not know."



More information about the textmate mailing list