[
Lists Home |
Date Index |
Thread Index
]
John Cowan wrote:
> Robin Berjon scripsit:
>>I think what Dennis is looking for is for something to fairly compare
>>the output from XBIS et al. with that of XML properly written at the end
>>of a SAX stream. Properly written may or may not involve (depending on
>>how paranoid you want to be -- I'd go for maximal because broken XML
>>isn't XML anymore): transcoding, checking that Names are Names, blowing
>>up if they contain characters that can't be transcoded to the target
>>encoding, checking that comments and PI data don't contain -- or ?>,
>>checking that text contains no forbidden character, that namespaces are
>>properly used, that you're using the proper repertoires for the version
>>of XML you said you were using, etc.
>
> Most of these checks are representation-independent: I can barely imagine
> that anyone would bother to develop an optimized representation that
> depended on whether Names were Names, for example. (Yeah, you could
> save 1 bit by relying on the fact that there are exactly 35122
> valid Name characters in XML 1.0, but really!)
>
> In practice, an XML writer and an ORX (newly coined generic acronym
> for "optimized representation of XML") writer would be suitable for
> comparison purposes if they did the same set of checks.
If you go read what I said, you'll notice that I wasn't comparing XML
with an ORX (I like the name :), simply listing a few things that I
thought Dennis -- and certainly I -- would look for in a quality XML
serialiser. Just dumping bytes "by hand" works when you know the kind of
data you'll be dumping -- just as using regexen on XML is fine if you
really know what your input will look like -- but it's not acceptable as
a general use approach.
Since you bring the topic up however, I agree that you are right for
some ORX but not all, and the serialisation method is a large part of
determining the trade-offs you may or may not wish to make. Many ORX
would use a single text encoding for instance, not requiring one to
check a few things in that area. Schema-based ones would only need to
check names when reading the schema, not when serialising. If you encode
{ns,ln} pairs instead of QNames you also skip a few checks.
I'm not making assumptions as to which choices are the best, or even if
they are worth being made (though empirical data would seem to suggest
they are), simply showing that there are potential targets for
optimisation worth exploring.
--
Robin Berjon
|