I found
some related material in the list archives, but I wanted to check my
understanding of the use of C1 characters in XML 1.0 and in HTML 4. We have
a UTF-8 encoded XML document that has gone through a number of conversions and
import/export routines into/out of a CMS. At all times, the XML document was
valid against the DTD, and in Oxygen everything seems fine. No errors were
reported in the workflow until a late stage, where in rendering to HTML Saxon
reported:
net.sf.saxon.trans.DynamicError: Illegal HTML character: decimal 146 I traced
the error to an article title, where there was an embedded hex character
reference:
Language rights versus speakers’ rights Unicode
character U+0092 is given as a control character in a private use area. I can’t
see our vendor or any workflow step (un)intentionally adding that character. About
the only thing that makes sense to me is that at some point (probably the
source document), Windows-1252 encoding was used, where decimal 146 is, I
think, a right single quote. (Whether that’s the appropriate character in
this case is another matter.) So, in
all the XML processes, character U+0092 was passed through as legal, but in
outputting to HTML it is illegal? I’m missing something here, surely. Curiously,
in my readings, HTML 5 seems to be special-casing Windows-1252 encoding, along
with UTF-8, in that it must be supported: http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0 Best
regards, Mike
Waters |