Re: [xml-dev] Transforming ™ to ™

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: "andrew welch" <andrew.j.welch@gmail.com>
To: "Chris Burdess" <d09@hush.ai>
Date: Thu, 20 Jul 2006 09:59:12 +0100

On 7/20/06, Chris Burdess <d09@hush.ai> wrote:

Sanjay Goel wrote:
> ... if I put &#x2122; or if I define a entity, the output in html
> is �. So this html gets displayed differently on different
> browsers. I need &trade; or &#x2122; in the final html so that the
> browsers read it correctly.

This may be because you specified "xml" as the XSL output method but
serve the result as text/html. If you specify "html" as the output
method the transformer should include a content type with a charset
parameter in an http-equiv instruction in the generated HTML.

Ensure that you are serving the result correctly, with a charset
parameter the same as the charset you serialised the XSL result to.
So if you serialised to UTF-8 and you are serving as text/html you
should include the header

   Content-Type: text/html; charset=UTF-8

See http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.4.1
for why you need to do this. The default for HTTP, without a charset
parameter, is ISO-8859-1, but this encoding does not contain the
trademark symbol and will therefore not work for you.


One thing to be aware of here is browsers auto-switching between
ISO-8859-1 and Windows 1252.  Although ISO-8859-1 doesn't contain the
TM chararacter, Windows 1252 does in the C1 control range at #153
(x99).

If a browser (html parser) is given a page apparently encoded using
ISO-8859-1 but contains characters in the C1 control range (such as
x99) it will auto-switch the read encoding to Windows 1252 and
automagically display the characters.  This ability to be "sloppy"
with the correct encoding and have the browser detect the one you
really meant doesn't follow with XML parsers, where the policy has
rightly shifted towards being strict.

So what does this mean?  Given the following page, where the meta
states the encoding is ISO-8859-1 but a C1 control character has been
used (#153):

<html xmlns="http://www.w3.org/1999/xhtml";>
  <head>
     <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
     <title>Encoding example</title>
  </head>
  <body>somebrand&#153;</body>
</html>

When served to an HTML parser the auto-switch of the read encoding
takes place to Windows-1252 and the TM character is displayed:

somebrand�

When the same file is served to an XML parser (which is what will
happen in an XHTML browser) the file is read using ISO-8859-1 and the
non-displayed C1 control character "Single Graphic Character
Introducer" is output (it's there, you just cant see it):

somebrand

I'm highlighting this here as it caught me - creating test files and
opening them in the browser was only compounding the issue because of
the silent auto-switching giving the impression everything was ok.  A
real pain.

cheers
andrew

References:
- Transforming ™ to ™
  - From: "Sanjay Goel" <sanjay.goel@gmail.com>
- Re: [xml-dev] Transforming ™ to ™
  - From: Chris Burdess <d09@hush.ai>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]