OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Unrecognized encodings (was Re: XML 1.0 Conformance Test Results)



A potentially useful data point: the open source ICU project
(International Components for Unicode) [1], which provides a large
character encoding conversion API in C/C++, has the following policy
for matching names of character encodings (from the distribution file
icu/data/convrtrs.txt):

   Name matching is case-insensitive. Also, dashes '-',
   underscores '_' and spaces ' ' are ignored in names
   (thus cs-iso-latin-1 and csisolatin1 are the same).

Under this regime, "UTF-8" = "utf-8" = "utf_8" = "UTF8" = ...

It seems to me that it is exactly these variations that humans are
likely to produce; given the human-legible/producible aspect of the
design of XML, it's nice to see an algorithmically simple and
unambiguous method to accept authors' expressed intent.

Steve Rowe
MNIS-TextWise Labs

[1] http://oss.software.ibm.com/developerworks/opensource/icu/

Mike Brown wrote:
> Richard Tobin wrote:
> > I don't think it's wrong for you to accept "UTF8", but I
> > think it's wrong that the test uses it.  It's not required
> > that a parser recognize it, and one that doesn't will
> > reject the document at that point.
>
> Yes, and the XML spec even hints that it is wrong to accept
> "UTF8" as being synonymous with "UTF-8". Section 4.3.3 of
> the XML Rec is pretty clear on this point, but uses "should"
> language instead of "must", unfortunately:
>
>    All XML processors must be able to read entities in both
>    the UTF-8 and UTF-16 encodings. The terms "UTF-8" and
>    "UTF-16" in this specification do not apply to character
>    encodings with any other labels, even if the encodings or
>    labels are very similar to UTF-8 or UTF-16.
>
>    [...]