[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Unrecognized encodings (was Re: XML 1.0 Conformance Test Results)
- From: Mike Brown <firstname.lastname@example.org>
- To: email@example.com
- Date: Mon, 11 Jun 2001 11:12:28 -0600 (MDT)
Richard Tobin wrote:
> I don't think it's wrong for you to accept "UTF8", but I think it's
> wrong that the test uses it. It's not required that a parser
> recognize it, and one that doesn't will reject the document at that
Yes, and the XML spec even hints that it is wrong to accept "UTF8" as
being synonymous with "UTF-8". Section 4.3.3 of the XML Rec is pretty
clear on this point, but uses "should" language instead of "must",
All XML processors must be able to read entities in both the UTF-8 and
UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specification
do not apply to character encodings with any other labels, even if the
encodings or labels are very similar to UTF-8 or UTF-16.
In an encoding declaration, the values "UTF-8", "UTF-16", [...]
should be used for the various encodings and transformations of
Unicode / ISO/IEC 10646 [...]
It is recommended that character encodings registered (as charsets)
with the Internet Assigned Numbers Authority [IANA-CHARSETS], other
than those just listed, be referred to using their registered names;
other encodings should use names starting with an "x-" prefix. XML
processors should match character encoding names in a case-insensitive
way and should either interpret an IANA-registered name as the
encoding registered at IANA for that name or treat it as unknown [...]
Given that only "UTF-8" -- not "UTF8" -- is listed in
http://www.iana.org/assignments/character-sets, "UTF8" violates the first
"should" recommendation here (it should be "x-UTF8"). Furthermore the
processor that accepts it as if it were "UTF-8" is violating the third
"should" recommendation that the non-IANA-registered encoding actually be
treated as unknown, and thus produce a fatal error.
My question is, must the XML parser developer honor these "shoulds" as if
they were "musts" and produce a fatal error rather than accepting "UTF8"?
The IANA registry is for character maps that may be used on the Internet.
An XML parser is not necessarily "on the Internet", so I can see an
argument, especially in light of the fact that the EncName production is
not constrained to IANA-registered values, for the acceptance of
unregistered charset names.
Other opinions appreciated.
mike j. brown, software engineer at | xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA | personal: http://hyperreal.org/~mike/