[
Lists Home |
Date Index |
Thread Index
]
Tedd wrote
> Now, to draw this thread back to on-topic, I know how code-points are
> used in url's, html, and such, but I would like to see how xml
> incorporates/uses Unicode code-points. Anyone? Please enlighten me.
XML character handling is best thought of as having two parts:
1) A parsing function: it has Unicode characters in text entities
as its input, and a document containing data and markup as its
output. This is XML proper.
So:
* A numeric character reference (e.g., ꯍ ) in XML is
always in terms of Unicode characters (not UTF-8, ISO8859-1,
UTF-16, etc.)
* XML does not constrain the actual coded character set that
an implementation uses internally.
* Being defined in terms of Unicode, XML is text, not
binary. In other words, the character € means exactly
what ISO 10646 or the Unicode Conosortium says it means.
If I provide an XML document in CP1252, as used in the US
and here in Australia, and I encode a byte P as binary data,
that will be mapped before the XML parser to U+20AC, the Euro
character (assuming that the XML processor accepts CP1252.)
In that case, I will need to know which encoding
was actually used for my data in order for my application to
map the data back to the original code. Consequently, attempting
to overload XML characters as binary code points is probably
unworkable except for tightly coupled processes or where
UTF-16 is used. Even using UTF-16 for binary overloaded transmission
is probably not reliable for Japanese systems (see Japanese
XML Profile at W3C technical report site). And XML 1.0 does not
allow every UTF-code point, notably U+0000. (The lack of
U+0000 is often portrayed, typically by MicroSoft users, as
an antique carbuncle that should be removed, however I see
that protecting \00 is alive in the Java JNI interface, see
the 2nd last para of
http://www.dil.univ-mrs.fr/docs/j2sdk/1.5/guide/jni/spec/types.html)
2) An algorithm that will typically be used to select the transcoding
function: it has bytes from a Web resource as its input, and Unicode
characters in an entity as its output. This is the auto-detection
algorithm of Appendix F.
So:
* The auto-detection algorithm is never required when your XML
implementation has the entity available as Unicode: for example,
when an XML document in a single Java String does not require
(indeed, should ignore) any encoding information in the XML Header.
Or when a text resource is accessed over the web, and all the
different protocols encoding-defaults match, and the server and
intermediate caches etc. are configured correctly, and your client
converts the bytes correctly into a form your system can trear as
Unicode. (The unworkability of this parallel chain of metadata
is what makes sending XML as application/xml using the XML headers
and auto-detection more prudent than sending the XML as text/xml and
relying on the Web infrastructure and protocols to get it right.)
* A common reaction that people have when discoving that the MIME
infrastructure is rather broken (for reliable transmission of
non-ASCII text across different locales, or when using a character
encoding different from the locale-default, notably UTF-8) the
typical reaction is not to pull together and make sure everything is
configured correctly, but to hack together something that seems to
work. Invocations of the incompetency of people who make standards
are never particularly convincing when made by people who deliberately
break them.
* The nail in the coffin for external transmission of character
encoding (rather than using auto-detection) is that standard APIs
for writing strings to files do not provide a built-in mechanism
for transmitting the encoding. (This would require something like
my XText format, which is basically generalizing Appendix F for
use in almost any textual data format.)
* XML does not constrain the actual coded character set that
an implementation accepts externally; except that all implementation
should accept UTF-8 and UTF16.
As with many standards, XML describes its input in concrete terms,
but its output only in abstract terms. Indeed, the output of an
XML parser function was defined in such vague terms that an ancilliary
standard, XML Infoset, was written to provide help for subsequent
standards.
I hope this is useful, even for aspiring trolls,
Rick Jelliffe
|