I am pleased to announce that version 0.5 of
VTD-XML -- a new, non-extractive, Java-base XML processing API licensed under
GPL -- is now freely available on sourceforge.net. For source
code, documentation, detailed description of API and code examples, please
visit
Capable of random-access, VTD-XML attempts
to be both memory efficient and high performance. The starting point of this
project is the observation that, for XML documents that don't declare
entities in DTD, tokenization can indeed be done by only recording the
starting offset and length of a token.
The core technology of VTD-XML is a binary
format specification called Virtual Token Descriptor (VTD). A VTD record is a
64-bit integer that encodes the starting offset, length, type and nesting
depth of a token in an XML document. Because VTD records don't contain
actually token content, they work alongside of the original XML document,
which is maintained intact in memory by the processing model.
VTD's memory-conserving features can be
summarized as follows:
* Avoid Per-object overhead -- In many
VM-based object-oriented programming languages, per-object
allocation incurs a small amount of memory overhead. A VTD
record is immune to the overhead because it is not an
object. * Bulk-allocation of storage -- Fixed in length, VTD records
can be stored in large memory blocks, which are more
efficient to allocate and GC. By allocating a large array
for 4096 VTD records, one incurs the per-array overhead
(16 bytes in JDK 1.4) only once across 4096 records, thus
reducing per-record overhead to very little.
Our benchmark indicates that VTD-XML
processes XML at the performance level similar to (and often better than) SAX
with NULL content handler. The memory usage is typically between 1.3x ~ 1.6x
of the size of the document, with "1" being the document itself.
Other features included in this release
are:
* Incremental update -- VTD-XML allows one
to modify content of XML without touching irrelevant parts
of the document. * Content extraction -- VTD-XML also allows one to
pull an element out of XML in its serialized format. This
can be an important feature for partial signing/encryption
of SOAP payload for WS-security.
In the upcoming releases, we plan to add the
persistence support so that one can save/load VTD to/from the disk along with
the XML documents to avoid repetitive parsing in read-only situations. XPATH
support is also on the development roadmap. However, we would like to collect
as many suggestions and bug reports before taking the next step.
Your input and suggestions are very
important to make VTD-XML a truly useful XML processor.
All the best,
Harry
|