OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Unrecognized encodings (was Re: XML 1.0 Conformance Test Results)



Richard Tobin wrote:
> I don't think it's wrong for you to accept "UTF8", but I think it's
> wrong that the test uses it.  It's not required that a parser
> recognize it, and one that doesn't will reject the document at that
> point.

Yes, and the XML spec even hints that it is wrong to accept "UTF8" as
being synonymous with "UTF-8". Section 4.3.3 of the XML Rec is pretty 
clear on this point, but uses "should" language instead of "must", 
unfortunately:

   All XML processors must be able to read entities in both the UTF-8 and 
   UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specification 
   do not apply to character encodings with any other labels, even if the 
   encodings or labels are very similar to UTF-8 or UTF-16.

   [...]

   In an encoding declaration, the values "UTF-8", "UTF-16", [...]
   should be used for the various encodings and transformations of
   Unicode / ISO/IEC 10646 [...]

   [...]

   It is recommended that character encodings registered (as charsets) 
   with the Internet Assigned Numbers Authority [IANA-CHARSETS], other 
   than those just listed, be referred to using their registered names; 
   other encodings  should use names starting with an "x-" prefix. XML 
   processors should match character encoding names in a case-insensitive 
   way and should either interpret an IANA-registered name as the 
   encoding registered at IANA for that name or treat it as unknown [...]

Given that only "UTF-8" -- not "UTF8" -- is listed in
http://www.iana.org/assignments/character-sets, "UTF8" violates the first
"should" recommendation here (it should be "x-UTF8"). Furthermore the
processor that accepts it as if it were "UTF-8" is violating the third
"should" recommendation that the non-IANA-registered encoding actually be
treated as unknown, and thus produce a fatal error.

My question is, must the XML parser developer honor these "shoulds" as if
they were "musts" and produce a fatal error rather than accepting "UTF8"?

The IANA registry is for character maps that may be used on the Internet.  
An XML parser is not necessarily "on the Internet", so I can see an
argument, especially in light of the fact that the EncName production is
not constrained to IANA-registered values, for the acceptance of
unregistered charset names.

Other opinions appreciated.

   - Mike
_____________________________________________________________________________
mike j. brown, software engineer at  |  xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA    |  personal: http://hyperreal.org/~mike/