OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Request for Comments: XML binary encoding



Big discussion on Binary XML.  I don;t know how some people on this
list get work done!  I listen but rarely contribute to this list
because it takes me so long to formulate postings...



I've been designing a "next-gen" XML-like language for awhile now,
though all my implementation time has been soaked up by a parser
compiler I;ve been writing (different project), so I still haven't
released a reference implementation of the language (it's called
"reticular structure language" (RSL): http://www.inxar.org/rsl).

Anyhow, RSL may be expressed using a binary "compiled" representation
under certain circumstances.  I initially thought this was a cool idea
because of the incredible performance gains that would be gleaned from
not having to parse the text.  As has been discussed previously,
punctuated by Tim Bray comments, the gains in this area are pretty
limited.  It's not really worth it except under very specific
conditions.

However, RSL has the additional feature that validation is considered
the norm -- most RSL documents should be validated.  What I discovered
is that by compiling the "source" text form into a binary
representation, you can organize the information such that structural
patterns in the document can be grouped.  This pattern grouping allows
future validation (of the binary representation) to be significantly
faster, which is important for RSL (which determines validity at
run-time, not compile-time).

For an extreme example, consider an XML representation of a log file.
The log file has 10,000 entries, each of which is an element with no
attributes and a content model defined in a DTD.  Typical processing
would involve parsing the text and 10,000 regexp challenges to confirm
the validity of each entry to the DTD.

A compiled representation allows one to recognize that all 10,000
entries have the same pattern.  Validation of this document would
require only a single regexp challenge to validate all structures in
the document.

One potential drawback to the current design of this representation
(unpublished) is that is not stream-based.  This would prohibit
SAX-like processing of the binary reprentaion.  The point is that
there are trade-offs in whatever your do.  Simplest things are almost
always best.

Paul