OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
RE: [xml-dev] An XML document is not well-formed if encoding="..."does not match the actual encoding of the characters in the document, right?

On Sat, 2012-12-29 at 14:38 +0000, Costello, Roger L. wrote:

> 1. If encoding="..." does not match the actual encoding of the
> characters in the XML document, then the XML parser should raise an
> error.

This isn't actually true. It only happens if your document contains
sequences of bytes ("octets") that are illegal in the character encoding
you claim to be using.

If byte-value 65 represents "A" (Unicode codepoint 65) in encoding A,
and "C" (unicode code point 67) in encoding C, saying your document is
in encoding "A" when it's really in encoding "C" will mean everything
will still be well-formed and work OK in most cases if encoding "A" is
your system encoding, but if you send the document to someone else, and
they convert it to UTF-8, say, or if you start working with the
in-memory representation, you'll find those <HAPPYSOCK> elements might
be reported by the parser as HCPPYSOAK elements instead. It's not
something the computer can detect.

To experiment with this, make a document in ISO 8859-7 containing some
Greek characters but change the encoding declaration to be ISO-8859-8
instead, and see how the Greek characters turn into Hebrew ones!

> Is the solution to the problems to simply eliminate the need for
> conversions by mandating that every application, every IDE, every text
> editor, and every system worldwide adopt one character encoding,
> UTF-8? It that a realistic solution? If so, what is the timeframe in
> which it could be achieved?

The trouble with that is that UTF-8 makes larger files than UTF-16 for
great numbers of people who use ideographic scripts such as Chinese. The
real choice for them is between 16 and 32.

UTF-16 is also somewhat harder to process in some older programming
languages, most notably C and C++, where a zero-valued byte (NUL, as
opposed to the zero-valued machine address, NULL) is used as a string

There isn't a single solution today that's best for everyone, as far as
I know.


Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS