OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to specify a Processing Instruction? (better: howtocontrolencoding on saving)



From: "Chris Bayes" <chris@bayes.co.uk>
 
> P.s. your original UPS document is invalid. It is declared as 
> <?xml version="1.0"?> and yet contains "UPS ONLINER TOOLS ACCESS USER
> TERMS".
> R is invalid in a utf-8 document.

I don't understand this comment. The 8bit code used for LATIN CAPITAL LETTER R in ASCII and ISO8859-1 is the same code point in UTF-8.

But it is good to understand how things work.

1) An XML parseable text entity can be encoded in almost any encoding
(that has an IANA registered charset.)  The encoding declaration lets you
say what encoding your entity is in. (It may be stripped by a parser: you certainly
cannot rely that when the data is re-serialized from the DOM it will come out in the same encoding: that is matter of however the software has been design. )

2) An XML parser operates in terms of Unicode characters, so it will convert
from the external encoding into some kind of Unicode. This includes treating
numeric character references as the corresponding Unicode character number.

3) Inside any software, the Unicode characters will be represented in some way.
This is typically using 8-bit variable-length encodings (i.e. UTF-8) or 16-bit
variable-length encodings (e.g. UTF-16, loosely a.k.a. "Unicode" proper or UCS-2, no flames from codeheads please).  Almost all characters in the Unicode Character Set are < 2^16 at the moment, so to most intents and purposes you can take it that a Unicode character is 16 bits. (This will assumption will change, but not effect many people.) 

4) DOM is defined in terms of UTF-16.  Apparantly COM is too. The storage units
of a character.

5) XPath, however, is defined in terms of full characters. For characters < 2^16 in Unicode, this is the same as the DOM's storage index.

6) If a DOM serialized an XML header which still has the original encoding parameter, but actually outputs the document in a different encoding (e.g. its default), then
the document is likely to fail when any unexpected codes appear.

7) The encoding for XML is UTF-8 (or UTF-16, if there is a special
Byte Order Mark at the beginning of the XML entity). The default encoding
for HTML is ISO 8859-1. 

8) The idea is that the only way systems that have multiple encodings and different
defaults can work together is
   a) by making data carry around explicit labels so that there is no guesswork, and
   b) we all move to UTF-* sooner or later, since that is what modern systems use internally anyway (Java, Microsoft)

Cheers
Rick Jelliffe