OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] fuzzy end of this lolly-pop OR Why Latin Rocks

[ Lists Home | Date Index | Thread Index ]

Tedd wrote

> Now, to draw this thread back to on-topic, I know how code-points are
> used in url's, html, and such, but I would like to see how xml
> incorporates/uses Unicode code-points. Anyone? Please enlighten me.

XML character handling is best thought of as having two parts:

1) A parsing function: it has Unicode characters in text entities
as its input, and a document containing data and markup as its
output. This is XML proper.

  So:
  * A numeric character reference (e.g., ꯍ ) in XML is
 always in terms of Unicode characters (not UTF-8, ISO8859-1,
 UTF-16, etc.)

  * XML does not constrain the actual coded character set that
 an implementation uses internally.

  * Being defined in terms of Unicode, XML is text, not
  binary. In other words, the character € means exactly
  what ISO 10646 or the Unicode Conosortium says it means.
  If I provide an XML document in CP1252, as used in the US
  and here in Australia, and I encode a byte P as binary data,
  that will be mapped before the XML parser to U+20AC, the Euro
  character (assuming that the XML processor accepts CP1252.)

  In that case, I will need to know which encoding
  was actually used for my data in order for my application to
  map the data back to the original code. Consequently, attempting
  to overload XML characters as binary code points is probably
  unworkable except for tightly coupled processes or where
  UTF-16 is used. Even using UTF-16 for binary overloaded transmission
  is probably not reliable for Japanese systems (see Japanese
  XML Profile at W3C technical report site). And XML 1.0 does not
  allow every UTF-code point, notably U+0000. (The lack of
  U+0000 is often portrayed, typically by MicroSoft users, as
  an antique carbuncle that should be removed, however I see
  that protecting \00 is alive in the Java JNI interface, see
  the 2nd last para of 
http://www.dil.univ-mrs.fr/docs/j2sdk/1.5/guide/jni/spec/types.html)


2) An algorithm that will typically be used to select the transcoding
function: it has bytes from a Web resource as its input, and Unicode
characters in an entity as its output. This is the auto-detection
algorithm of Appendix F.

  So:
  * The auto-detection algorithm is never required when your XML
  implementation has the entity available as Unicode: for example,
  when an XML document in a single Java String does not require
  (indeed, should ignore) any encoding information in the XML Header.
  Or when a text resource is accessed over the web, and all the
  different protocols encoding-defaults match, and the server and
  intermediate caches etc. are configured correctly, and your client
  converts the bytes correctly into a form your system can trear as
  Unicode. (The unworkability of this parallel chain of metadata
  is what makes sending XML as application/xml using the XML headers
  and auto-detection more prudent than sending the XML as text/xml and
  relying on the Web infrastructure and protocols to get it right.)

  * A common reaction that people have when discoving that the MIME
  infrastructure is rather broken (for reliable transmission of
  non-ASCII text across different locales, or when using a character
  encoding different from the locale-default, notably UTF-8) the
  typical reaction is not to pull together and make sure everything is
  configured correctly, but to hack together something that seems to
  work. Invocations of the incompetency of people who make standards
  are never particularly convincing when made by people who deliberately
  break them.

  * The nail in the coffin for external transmission of character
  encoding (rather than using auto-detection) is that standard APIs
  for writing strings to files do not provide a built-in mechanism
  for transmitting the encoding. (This would require something like
  my XText format, which is basically generalizing Appendix F for
  use in almost any textual data format.)

  * XML does not constrain the actual coded character set that
  an implementation accepts externally; except that all implementation
  should accept UTF-8 and UTF16.

As with many standards, XML describes its input in concrete terms,
but its output only in abstract terms. Indeed, the output of an
XML parser function was defined in such vague terms that an ancilliary
standard, XML Infoset, was written to provide help for subsequent
standards.

I hope this is useful, even for aspiring trolls,
Rick Jelliffe




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS