OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Unrecognized encodings (was Re: XML 1.0 Conformance Test Results)

Richard Tobin wrote:
> I don't think it's wrong for you to accept "UTF8", but I think it's
> wrong that the test uses it.  It's not required that a parser
> recognize it, and one that doesn't will reject the document at that
> point.

Yes, and the XML spec even hints that it is wrong to accept "UTF8" as
being synonymous with "UTF-8". Section 4.3.3 of the XML Rec is pretty 
clear on this point, but uses "should" language instead of "must", 

   All XML processors must be able to read entities in both the UTF-8 and 
   UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specification 
   do not apply to character encodings with any other labels, even if the 
   encodings or labels are very similar to UTF-8 or UTF-16.


   In an encoding declaration, the values "UTF-8", "UTF-16", [...]
   should be used for the various encodings and transformations of
   Unicode / ISO/IEC 10646 [...]


   It is recommended that character encodings registered (as charsets) 
   with the Internet Assigned Numbers Authority [IANA-CHARSETS], other 
   than those just listed, be referred to using their registered names; 
   other encodings  should use names starting with an "x-" prefix. XML 
   processors should match character encoding names in a case-insensitive 
   way and should either interpret an IANA-registered name as the 
   encoding registered at IANA for that name or treat it as unknown [...]

Given that only "UTF-8" -- not "UTF8" -- is listed in
http://www.iana.org/assignments/character-sets, "UTF8" violates the first
"should" recommendation here (it should be "x-UTF8"). Furthermore the
processor that accepts it as if it were "UTF-8" is violating the third
"should" recommendation that the non-IANA-registered encoding actually be
treated as unknown, and thus produce a fatal error.

My question is, must the XML parser developer honor these "shoulds" as if
they were "musts" and produce a fatal error rather than accepting "UTF8"?

The IANA registry is for character maps that may be used on the Internet.  
An XML parser is not necessarily "on the Internet", so I can see an
argument, especially in light of the fact that the EncName production is
not constrained to IANA-registered values, for the acceptance of
unregistered charset names.

Other opinions appreciated.

   - Mike
mike j. brown, software engineer at  |  xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA    |  personal: http://hyperreal.org/~mike/