OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SAX InputSource and character streams



> > When constructing a SAX InputSource from a character stream
> > (java.io.Reader), is it correct to assume that any encoding
> > declaration given in the document will be ignored?

No ... the XML 1.0 spec does however say (near the end of 4.3.3) that

    it is an error for an entity including an encoding declaration to be presented
    to the XML processor in an encoding other than that named in the declaration.

Translated to English, a SAX processor MAY report an error if the encoding
declaration is wrong, but it's not required.  In the XML spec, "error" is a
wording that accomodates variations in vendor implementations (except "fatal"
ones, which MUST be reported, and validity errors).


> Obviously it is an implementational thing, but I would argue that it makes
> no sense for a SAX parser to try to validate the encoding string contained
> within a character stream (java.io.Reader).

Java doesn't really make it easy to figure out what the input encoding was,
unless you just happen to be using an InputStreamReader so you can use
getEncoding() ... and then can translate from those "Java encoding names"
(not really documented last I checked) back to the real world.  So there's
no "100% reliable" check for whether the encoding name matches.

My conclusion is that it's worth a warning if things don't check out, since
it's easy enough to create a Reader that's using the wrong encoding.  It's
just bits ... and so long as '<', '>', '&' and a few other characters get read
correctly, the XML might actually parse ... but give garbage because the
non-markup characters were misinterpreted.  Consider the different
ISO-8859 encodings --- I could easily see that happening.

It's pretty clear that the XML spec allows that to be treated as a fatal
error, so I'd never assume that an encoding would be ignored.

- Dave