OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Fast text output from SAX?

[ Lists Home | Date Index | Thread Index ]

Elliotte Rusty Harold wrote:

> At 9:37 PM -0400 4/13/04, Stephen D. Williams wrote:
> ...
>> input document, delta, perform read/update/delete, insert, append
>> output new delta
> In practice input and output are the only things that matter. Compared 
> to this all manipulations of XML in memory are in the noise by several 
> orders of magnitude. Optimizing read, update, delete, insert, etc. is 
> pointless. These things all happen in effectively zero time compared 
> to parsing, serialization, and sometimes the program's unique 
> algorithms. (If the input is an XML document containing 5000 cities 
> for which to solve the travelling salesman problem, no XML 
> optimization will help you.) But the XML manipulations and traversal 
> are trivial.

If this were strictly true, then esXML would be the only solution from 
an efficiency perspective.  In pure esXML/esDOM processing, there is 
never any parsing or serialization or object creation or object 
population and the memory allocation needed is very minimal.  (There can 
be an optional step of 'condensation' on 'output' to fully optimize 
space use, not needed in most cases I think.)

The fact is that creating, populating, and manipulating a data model has 
costs.  This is true of DOM, SAX (where the data model is managed by the 
application), esXML (where the data model is also the 'serialized' 
format so all costs are manipulation), and all other applications that 
involve internal and external data (Corba, DCOM, ONC-RPC, ASN.1/xER, 
etc.).  It's not fair to ignore part of the processing cycle for a 
format (esXML) that trades some manipulation overhead for all 
parsing/serialization/object creation/object population overhead.

The performance equation for esXML in 'native' mode is:
XML - parsing - serialization - memory allocation - object creation - 
object population - extra memory use (5x for DOM, right?)
+ per-element processing overhead - better locality of reference factor

Additionally, the whole parsing etc. stream for XML must be completely 
performed, in DOM cases and many SAX cases, for every element of a 
document/object.  With esXML, if a 3000 element document/object were 
read in and 5 elements manipulated, you only spend 
5*element-manipulation-overhead.  Are there pathological cases where 
conversion-to-native vs. managed-data is better because of the magnitude 
of processing?  Certainly, and the traveling salesman problem may be a 
good example (if it's a single-process algorithm).  I am trying to push 
the tradeoff crossover point far past the point where that is true for 
most business-type applications.  Certainly a distributed N-tier SOA 
processing environment with many hops through specialized service steps 
should show large gains.

For anyone who doesn't think that something like esXML is possible, 
there is a simple 'works as if' design that, although suboptimal in 
space and other ways, achieves the avoidance of parsing and serialization:

Imagine a DOM library where the document object was a growable buffer, 
or growable array of blocks.  Each node that you create, rather than 
being on the global heap (i.e. malloced/new) is instead created as an 
allocation of part of the data space of the document block.  Each node, 
instead of being connected to other nodes by pointers/references, was 
instead connected by network-byte-order relative pointers/offsets.  The 
data is stored as part of a node in XML-compatible character encoding 
(by default), rather than a pointer/reference to another native object 
(like a Java string).
Traversing, reading, updating, and deleting are all done similarly to a 
regular DOM data structure, but at any time, you can perform a simple 
write of the data space 'on the wire' (i.e. over the network), to a 
file, to shared memory, etc.  Once read in again, no preprocessing is 
necessary to begin access.  Some data formats have been used in a 
read-in-and-traverse model, but few or none have been both general 
purpose (a la XML) and allowed in-place update.

If you cared nothing about space use and other semantics and constraints 
that I do care about, this would work.  It is not just a 'serialized 
DOM', but a directly accessible and manipulatable DOM.

esXML does not use a DOM-based linked node model as above, but rather a 
packed tokenized length-counted format with an underlying data 
structure, 'elastic memory', that minimizes data movement for updates.  
Optional tokenization tables and element indexes, along with a low-level 
delta method add to efficiency aspects.  I've been writing the code, but 
I'm slow as this is still a personal project so I can retain control and 
have the main version be open source while my professional life is 
all-consuming.  I'm sure that community development will change things a 
lot, but I want to get it close to right to begin with.


swilliams@hpti.com http://www.hpti.com Per: sdw@lig.net http://sdw.st
Stephen D. Williams 703-724-0118W 703-995-0407Fax 20147-4622 AIM: sdw

fn:Stephen Williams


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS