There has been some talk here and on the bundle dev list about character sets and entities in HTML documents. Here is some information that might be useful.
First of all, there is no need to use entities in HTML documents for such things as é è ê ë etc., as it only makes them harder to read, to edit and to parse. When using utf-8, the entire utf-8 range CAN be used directly. As far as I know, the only entities needed are:
& > <
In order to not confuse the parser. No, all this only works if the document is sent by the server as UTF-8. Someone wrote on the dev-list that he makes sure this is the case by putting a meta-tag in the document (using http-equiv).
Unfortunately, most popular web-servers (well, at least Apache ;-)) doesn't look at the document to see if you included such a tag, and will always add a char-set header to the HTTP response. This defaults to iso-8859-1 and _overrides_ the one specified in the document meta-tag. Maybe this is not in line with the standard, but it is nevertheless what happens in practice.
So we need to make the server send the content as utf-8 instead. With Apache we have several alternatives:
1) Enable MultiViews using (in .htaccess) Options +MultiViews and rename the document to name.html.utf8 or name.php.utf8 etc. MultiViews also allow for content-type and language negotiation so you can refer to a picture with 'name' and have several versions on disk 'name.jpeg', 'name.png' etc. The same goes for omitting the .php and .html extensions. Note that some web-servers may be setup to prevent you from enabling MultiViews like this.
2) Set the default charset for all text/html and text/plain content. This will of course also include php, ruby and cgi in general. You can do this by putting: AddDefaultCharset utf-8 in your .htaccess file. This is probably the easiest way.
3) Set the charset to utf-8 for some file-extensions only. This hardly seem useful, and could probably be of more use by setting the charset to something ELSE than utf-8 for certain files, e.g. name.txt or similar. The syntax is (in the .htaccess file): AddCharset <charset> <extensions...>, e.g. AddCharset iso-8859-1 .txt .text.
Note that there doesn't seem to be any way to add charsets to specific mime-types, but only extensions. Of the methods above I recommend 2, although enabling MultiViews is generally a Good Thing™, in my opinion.
On 02.03.2005, at 16:15, Sune Foldager wrote:
First of all, there is no need to use entities in HTML documents for such things as é è ê ë etc., as it only makes them harder to read, to edit and to parse. When using utf-8, the entire utf-8 range CAN be used directly. As far as I know, the only entities needed are:
& > <
It is also sometimes necessary to use " for quotation marks, as in this example:
<input value=""Right," said Fred." />
It can also sometimes be helpful to use ' for single-quote marks.
On 4. mar 2005, at 12:23, Ryan Schmidt wrote:
On 02.03.2005, at 16:15, Sune Foldager wrote:
First of all, there is no need to use entities in HTML documents for such things as é è ê ë etc., as it only makes them harder to read, to edit and to parse. When using utf-8, the entire utf-8 range CAN be used directly. As far as I know, the only entities needed are: & > <
It is also sometimes necessary to use " for quotation marks, as in this example:
<input value=""Right," said Fred." /> It can also sometimes be helpful to use ' for single-quote marks.
Ah, yes indeed :-). I was a little in doubt if I had included all the strictly necessary ones :-p.