[
Lists Home |
Date Index |
Thread Index
]
David Megginson wrote:
>
> You're probably right. Java objects have an awful lot of built-in
> memory overhead just for the java.lang.Object base class, and if you
> naively create a separate object for every element, attribute,
> attribute value, text chunk, and so on, you end up with a very large
> in-memory data structure. Memory aside, Java object creation and
> deletion is also very slow (that's why it takes so long to load an XML
> document into a DOM).
>
Two proposed ways out.
1 - Would the following qualify as a non-naive solution (which deviates
considerably from DOM)
- use classes instead of an element name field. Elements are then
instances - the JVM is optimized at dealing with classes
- throw away pointers to parents and siblings
The first measure saves a lot of redundant information (it is something
like an abusing instance of the "flyweight" pattern), at the
disadvantage of first converting your WXS or Relax NG schema to classes
(a one-time operation). I have a tool that works for DTDs which is in
the Scala distribution. In general data binding would carry you a long
way, although I cannot say that Sun's JAXB worked out for me.
This will get you savings, but they are not on an order-of-magnitude scale.
2 - In order to compress xml data (with all querying operations still
possible, but no updates), there are all techniques that were described in
http://lists.xml.org/archives/xml-dev/200311/msg00690.html - for
immutable data, order-of-magnitude savings are reported.
That means they outperform a gzip in certain cases, while you can deal
with the whole tree in memory in the usual way (with slight overhead for
by-need decompression). It would be interesting to see what happens if
these techniques are applied to mutable representations. The algorithm
seems to be implementable on top of the SAX API. For a little excursion
into theory, check out Sebastian's paper "Tree transducers and tree
compressions" ( http://lampwww.epfl.ch/~maneth/ )
cheers,
Burak
|