OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

STAX Re: Abbreviated Tag Names



If anyone is interested, there is a very simple compression possible called
STAX.
It could be built into XML parsers trivially, or be a separate layer to XML.
It converts

<?xml version="1.0"?>
<x>
  <y>aaaa</y>
  <y>aaaa</y>
</x>

to

<?stax?>
<?xml version="1.0"?>
<x>
 <y>aaaa</>
 <>aaaa</>
</x>
and does not need a stack or reserve big header space (it could have one,
e.g. a fixed size stack of the deepest 16 elements would be nice).

It would be best with documents with long names/data, repeated elements, and
fairly blunt nesting. Obviously it doesn't give great compression except for
those documents, and even then it cannot compare with binary.  Except there
are three other considerations: first, it does not compress to binary but
keeps the document as text (recoverable, readable, MIME email does not have
to bin64 encode), second, the code is trivial to implement on even a very
lightweight system (e.g., rolled into the parser,it is just an extra
transition or two); third one can use text processing tools (e.g. perl) to
perform the uncompression without going into a binary mode.

Obviously there are lots of other extensions possible, but I wanted to keep
SGML compliant (STAX is still SGML, caveat emptor) and avoid headers (to
keep streaming and lightweight.)

Fairly old source code for a compressor based on this (STAX
format=ShortTAgged Xml) is at
 http://www.ascc.net/~ricko/src/short-tag-compress.c
 http://www.ascc.net/~ricko/src/short-tag-uncompress.c

I think it would be good to have (something like) this kind of ultra-low-end
compression available (i.e. as a matter of compression negotiation), because
I think many servers are two busy to compress data going out (STAX can be
generated by the XML-generating API, and read directly into a SAX stream).

I think it would be useful to have several different compression methods
widely deployed to suit different situations-- STAX fitting into the extreme
low-end.

If anyone is interested in taking this further, I think it would be good.
And it is probably the kind of small infrastructure upgrades that could be
fun and doable for open-source and collaborative development.

Cheers
Rick Jelliffe