Re: Unrecognized encodings (was Re: XML 1.0 Conformance Test Resu lts)

From: Rob Lugt <roblugt@elcel.com>

To: Eric Vermetten <EVermetten@nl.alpnet.com>, 'Tim Bray' <tbray@textuality.com>

Date: Mon, 11 Jun 2001 23:55:01 +0100

Title: RE: Unrecognized encodings (was Re: XML 1.0 Conformance Test Results)

Firstly, I have to admit that the ElCel validator does not accept UTF8 as an alias for UTF-8. In my earlier post I stated that it accepts some encoding aliases. In fact it doesn't currently accept any aliases, only the IANA names. I should look at the code before making such assertions!

I was interested why I had this false memory. On looking back over our decisions, I see that we did consider accepting aliases, mainly because Java InputStreamReader works this way and we modelled some of our C++ io classes on Java. However we decided that the XML 1.0 rec recommends being strict so that is what we implemented. Tim Bray's comments have raised some doubt that this is the best approach.

Our general philosophy when writing the XML Validator was to be as strict as possible. After all, one task of the XML validator is to give as much assurance as possible that documents passing through successfully are guaranteed not to be rejected by another conforming processor down the line. However, we do accept ISO-8859-1 and US-ASCII encodings, which other processors are not guaranteed to accept, so that partially diminishes our validity guarantee.

Regards

Rob Lugt

----- Original Message -----

From: Eric Vermetten

To: 'Tim Bray'

Cc: 'xml-dev@lists.xml.org'

Sent: 11 June 2001 22:42

Subject: RE: Unrecognized encodings (was Re: XML 1.0 Conformance Test Resu lts)

Tim Bray wrote:
>is the word "should". In any case, I'd write software to accept
>UTF8, but I'd complain at anyone who sent me data so labeled. -Tim

Perhaps a bit hard to argue with a veteran such as Tim Bray, but
from what I know of the history of SGML and
XML, I wonder: when designing XML, was not one of
the main issues to make something with
less optional features than SGML?
XML has made a clear choice for the standard support of
the Unicode/UCS character set.
Shoudn't the (most commonly used?) Unicode
encodings "UTF-8" and "UTF-16" and their labeling
be treated as one of the cornerstones for XML(parsers)?

Personally I like it when something complains
heavily (i.e. fatal error). It contributes
to clarity and stability. For XML parser writers
as well as for users who switch between then
this then that brand of XML parser.
For such issues, flexibility leads to less security IMHO.

Furthermore, I don't quite see the difference between:
a) writing flexible software (by ones own hand, I presume)
while at the same time
b) complaining when a not so accurate encoding labeling
is received.
Perhaps is this perceived as a bit more personal?

Regards,
Eric Vermetten