[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Unrecognized encodings (was Re: XML 1.0 Conformance Test Results)
- From: Mike Brown <mike@skew.org>
- To: xml-dev@lists.xml.org
- Date: Mon, 11 Jun 2001 11:12:28 -0600 (MDT)
Richard Tobin wrote:
> I don't think it's wrong for you to accept "UTF8", but I think it's
> wrong that the test uses it. It's not required that a parser
> recognize it, and one that doesn't will reject the document at that
> point.
Yes, and the XML spec even hints that it is wrong to accept "UTF8" as
being synonymous with "UTF-8". Section 4.3.3 of the XML Rec is pretty
clear on this point, but uses "should" language instead of "must",
unfortunately:
All XML processors must be able to read entities in both the UTF-8 and
UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specification
do not apply to character encodings with any other labels, even if the
encodings or labels are very similar to UTF-8 or UTF-16.
[...]
In an encoding declaration, the values "UTF-8", "UTF-16", [...]
should be used for the various encodings and transformations of
Unicode / ISO/IEC 10646 [...]
[...]
It is recommended that character encodings registered (as charsets)
with the Internet Assigned Numbers Authority [IANA-CHARSETS], other
than those just listed, be referred to using their registered names;
other encodings should use names starting with an "x-" prefix. XML
processors should match character encoding names in a case-insensitive
way and should either interpret an IANA-registered name as the
encoding registered at IANA for that name or treat it as unknown [...]
Given that only "UTF-8" -- not "UTF8" -- is listed in
http://www.iana.org/assignments/character-sets, "UTF8" violates the first
"should" recommendation here (it should be "x-UTF8"). Furthermore the
processor that accepts it as if it were "UTF-8" is violating the third
"should" recommendation that the non-IANA-registered encoding actually be
treated as unknown, and thus produce a fatal error.
My question is, must the XML parser developer honor these "shoulds" as if
they were "musts" and produce a fatal error rather than accepting "UTF8"?
The IANA registry is for character maps that may be used on the Internet.
An XML parser is not necessarily "on the Internet", so I can see an
argument, especially in light of the fact that the EncName production is
not constrained to IANA-registered values, for the acceptance of
unregistered charset names.
Other opinions appreciated.
- Mike
_____________________________________________________________________________
mike j. brown, software engineer at | xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA | personal: http://hyperreal.org/~mike/