OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SAX InputSource and character streams

David Megginson wrote:-

> Mike Brown writes:
>  > My question was, when supplying a character stream to the parser, is it
>  > reasonable to expect that the parser will not complain if the encoding
>  > declaration says the encoding is (was) something the parser does not
>  > support?
>  > XML seems to assume that every parsed entity that a processor
>  > consists of encoded characters (bytes, essentially), whereas in
>  > we obviously have parsers that accept the entities as characters.
> Hmm -- I can see two reasonable arguments here:
> 1. With a Java character stream, there's no way to know what the
> original encoding might have been, so the encoding declaration is
> moot.


> 2. A Java character stream is presented (more-or-less) in UTF-16, so
> the encoding declaration, if present, should agree with that.

I don't agree with this suggestion for the following reasons:-

1) What's the point?  The XML processor has no need to do anything with the
encoding declaration since it already has a character stream.  A SAX
processor doesn't even have to report the encoding to the application.

2) Perhaps most importantly, this undermines the responsibility on the
application to provide a valid character stream.  I would argue that by
passing a character stream, the application undertakes to perform all the
encoding-related tasks of an XML processor and thereby relieves the SAX
processor of that task.

3) The process that created the character stream (possibly be decoding a
byte stream) might have to search for and replace the encoding declaration.
This is both undesirable additional work and requires an understanding of
XML syntax which, IMHO, is not appropriate for a (possibly generic) decoding

4) You focus on Java.  This is understandable given the origins of SAX but,
thanks to the simplicity and ellegance of the SAX interface, it has broken
free and is now implemented in a number of languages.  On some platforms C++
is not constrained by 16-bit characters and can present the application with
full UCS-4 characters.

Rob Lugt
ElCel Technology