OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Fast text output from SAX?

[ Lists Home | Date Index | Thread Index ]

John Cowan wrote:
> Robin Berjon scripsit:
>>I think what Dennis is looking for is for something to fairly compare 
>>the output from XBIS et al. with that of XML properly written at the end 
>>of a SAX stream. Properly written may or may not involve (depending on 
>>how paranoid you want to be -- I'd go for maximal because broken XML 
>>isn't XML anymore): transcoding, checking that Names are Names, blowing 
>>up if they contain characters that can't be transcoded to the target 
>>encoding, checking that comments and PI data don't contain -- or ?>, 
>>checking that text contains no forbidden character, that namespaces are 
>>properly used, that you're using the proper repertoires for the version 
>>of XML you said you were using, etc.
> Most of these checks are representation-independent: I can barely imagine
> that anyone would bother to develop an optimized representation that
> depended on whether Names were Names, for example.  (Yeah, you could
> save 1 bit by relying on the fact that there are exactly 35122
> valid Name characters in XML 1.0, but really!)
> In practice, an XML writer and an ORX (newly coined generic acronym
> for "optimized representation of XML") writer would be suitable for
> comparison purposes if they did the same set of checks.

If you go read what I said, you'll notice that I wasn't comparing XML 
with an ORX (I like the name :), simply listing a few things that I 
thought Dennis -- and certainly I -- would look for in a quality XML 
serialiser. Just dumping bytes "by hand" works when you know the kind of 
data you'll be dumping -- just as using regexen on XML is fine if you 
really know what your input will look like -- but it's not acceptable as 
a general use approach.

Since you bring the topic up however, I agree that you are right for 
some ORX but not all, and the serialisation method is a large part of 
determining the trade-offs you may or may not wish to make. Many ORX 
would use a single text encoding for instance, not requiring one to 
check a few things in that area. Schema-based ones would only need to 
check names when reading the schema, not when serialising. If you encode 
{ns,ln} pairs instead of QNames you also skip a few checks.

I'm not making assumptions as to which choices are the best, or even if 
they are worth being made (though empirical data would seem to suggest 
they are), simply showing that there are potential targets for 
optimisation worth exploring.

Robin Berjon


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS