[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Unrecognized encodings (was Re: XML 1.0 Conformance Test Results)
- From: Steve Rowe <sarowe@textwise.com>
- To: xml-dev@lists.xml.org
- Date: Mon, 11 Jun 2001 17:17:50 -0400
A potentially useful data point: the open source ICU project
(International Components for Unicode) [1], which provides a large
character encoding conversion API in C/C++, has the following policy
for matching names of character encodings (from the distribution file
icu/data/convrtrs.txt):
Name matching is case-insensitive. Also, dashes '-',
underscores '_' and spaces ' ' are ignored in names
(thus cs-iso-latin-1 and csisolatin1 are the same).
Under this regime, "UTF-8" = "utf-8" = "utf_8" = "UTF8" = ...
It seems to me that it is exactly these variations that humans are
likely to produce; given the human-legible/producible aspect of the
design of XML, it's nice to see an algorithmically simple and
unambiguous method to accept authors' expressed intent.
Steve Rowe
MNIS-TextWise Labs
[1] http://oss.software.ibm.com/developerworks/opensource/icu/
Mike Brown wrote:
> Richard Tobin wrote:
> > I don't think it's wrong for you to accept "UTF8", but I
> > think it's wrong that the test uses it. It's not required
> > that a parser recognize it, and one that doesn't will
> > reject the document at that point.
>
> Yes, and the XML spec even hints that it is wrong to accept
> "UTF8" as being synonymous with "UTF-8". Section 4.3.3 of
> the XML Rec is pretty clear on this point, but uses "should"
> language instead of "must", unfortunately:
>
> All XML processors must be able to read entities in both
> the UTF-8 and UTF-16 encodings. The terms "UTF-8" and
> "UTF-16" in this specification do not apply to character
> encodings with any other labels, even if the encodings or
> labels are very similar to UTF-8 or UTF-16.
>
> [...]