Re: [xml-dev] C1 characters in XML 1.0 and HTML 4

XML.org

XML.org

FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] C1 characters in XML 1.0 and HTML 4

From: Michael Kay <mike@saxonica.com>
To: xml-dev@lists.xml.org
Date: Sat, 12 Mar 2011 23:22:18 +0000

On 12/03/2011 23:02, Waters, Michael, Springer US wrote:

D4735EBD435DF94A9E020238DE29CF6701A82D30@SEUSNESP0015.springer-sbm.com" type="cite">

I found some related material in the list archives, but I wanted to check my understanding of the use of C1 characters in XML 1.0 and in HTML 4.

We have a UTF-8 encoded XML document that has gone through a number of conversions and import/export routines into/out of a CMS. At all times, the XML document was valid against the DTD, and in Oxygen everything seems fine. No errors were reported in the workflow until a late stage, where in rendering to HTML Saxon reported:

net.sf.saxon.trans.DynamicError: Illegal HTML character: decimal 146

I traced the error to an article title, where there was an embedded hex character reference:

Language rights versus speakers rights

Unicode character U+0092 is given as a control character in a private use area. I can’t see our vendor or any workflow step (un)intentionally adding that character. About the only thing that makes sense to me is that at some point (probably the source document), Windows-1252 encoding was used, where decimal 146 is, I think, a right single quote. (Whether that’s the appropriate character in this case is another matter.)

So, in all the XML processes, character U+0092 was passed through as legal, but in outputting to HTML it is illegal? I’m missing something here, surely.

Your analysis is quite correct. Occasionally the internationalization working group in W3C decides to flex its muscles, and one instance of this was there insistence that XSLT should not generate HTML that contains characters which HTML defines to be illegal. It's probably a mistake that XML allowed these C1 characters, because they are nearly always miscoded CP1252 characters. XML 1.1 tried to fix this problem but we all know what happened to that. In the meantime, the result is that you feed a bad character nto the start of your processing pipeline and you discover the problem at the final stage when HTML emerges.

D4735EBD435DF94A9E020238DE29CF6701A82D30@SEUSNESP0015.springer-sbm.com" type="cite">

Curiously, in my readings, HTML 5 seems to be special-casing Windows-1252 encoding, along with UTF-8, in that it must be supported:

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0

Best regards,

Mike Waters

Whatwg adopts the principle that the browser accepts any input. That doesn't quite mean that all input is legal, but it amounts to much the same thing. The reasoning of course is that the end user shouldn't pay the price for the content provider's carelessness. This is very different from the culture in W3C which tries to improve data quality by insisting that software should reject bad data.

Michael Kay
Saxonica

D4735EBD435DF94A9E020238DE29CF6701A82D30@SEUSNESP0015.springer-sbm.com" type="cite">

Follow-Ups:
- RE: [xml-dev] C1 characters in XML 1.0 and HTML 4
  - From: "Waters, Michael, Springer US" <Mike.Waters@springer.com>

References:
- C1 characters in XML 1.0 and HTML 4
  - From: "Waters, Michael, Springer US" <Mike.Waters@springer.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS