[
Lists Home |
Date Index |
Thread Index
]
Elliotte Rusty Harold wrote:
> At 9:37 PM -0400 4/13/04, Stephen D. Williams wrote:
> ...
>
>> input document, delta, perform read/update/delete, insert, append
>> output new delta
>
> In practice input and output are the only things that matter. Compared
> to this all manipulations of XML in memory are in the noise by several
> orders of magnitude. Optimizing read, update, delete, insert, etc. is
> pointless. These things all happen in effectively zero time compared
> to parsing, serialization, and sometimes the program's unique
> algorithms. (If the input is an XML document containing 5000 cities
> for which to solve the travelling salesman problem, no XML
> optimization will help you.) But the XML manipulations and traversal
> are trivial.
If this were strictly true, then esXML would be the only solution from
an efficiency perspective. In pure esXML/esDOM processing, there is
never any parsing or serialization or object creation or object
population and the memory allocation needed is very minimal. (There can
be an optional step of 'condensation' on 'output' to fully optimize
space use, not needed in most cases I think.)
The fact is that creating, populating, and manipulating a data model has
costs. This is true of DOM, SAX (where the data model is managed by the
application), esXML (where the data model is also the 'serialized'
format so all costs are manipulation), and all other applications that
involve internal and external data (Corba, DCOM, ONC-RPC, ASN.1/xER,
etc.). It's not fair to ignore part of the processing cycle for a
format (esXML) that trades some manipulation overhead for all
parsing/serialization/object creation/object population overhead.
The performance equation for esXML in 'native' mode is:
XML - parsing - serialization - memory allocation - object creation -
object population - extra memory use (5x for DOM, right?)
+ per-element processing overhead - better locality of reference factor
Additionally, the whole parsing etc. stream for XML must be completely
performed, in DOM cases and many SAX cases, for every element of a
document/object. With esXML, if a 3000 element document/object were
read in and 5 elements manipulated, you only spend
5*element-manipulation-overhead. Are there pathological cases where
conversion-to-native vs. managed-data is better because of the magnitude
of processing? Certainly, and the traveling salesman problem may be a
good example (if it's a single-process algorithm). I am trying to push
the tradeoff crossover point far past the point where that is true for
most business-type applications. Certainly a distributed N-tier SOA
processing environment with many hops through specialized service steps
should show large gains.
For anyone who doesn't think that something like esXML is possible,
there is a simple 'works as if' design that, although suboptimal in
space and other ways, achieves the avoidance of parsing and serialization:
Imagine a DOM library where the document object was a growable buffer,
or growable array of blocks. Each node that you create, rather than
being on the global heap (i.e. malloced/new) is instead created as an
allocation of part of the data space of the document block. Each node,
instead of being connected to other nodes by pointers/references, was
instead connected by network-byte-order relative pointers/offsets. The
data is stored as part of a node in XML-compatible character encoding
(by default), rather than a pointer/reference to another native object
(like a Java string).
Traversing, reading, updating, and deleting are all done similarly to a
regular DOM data structure, but at any time, you can perform a simple
write of the data space 'on the wire' (i.e. over the network), to a
file, to shared memory, etc. Once read in again, no preprocessing is
necessary to begin access. Some data formats have been used in a
read-in-and-traverse model, but few or none have been both general
purpose (a la XML) and allowed in-place update.
If you cared nothing about space use and other semantics and constraints
that I do care about, this would work. It is not just a 'serialized
DOM', but a directly accessible and manipulatable DOM.
esXML does not use a DOM-based linked node model as above, but rather a
packed tokenized length-counted format with an underlying data
structure, 'elastic memory', that minimizes data movement for updates.
Optional tokenization tables and element indexes, along with a low-level
delta method add to efficiency aspects. I've been writing the code, but
I'm slow as this is still a personal project so I can retain control and
have the main version be open source while my professional life is
all-consuming. I'm sure that community development will change things a
lot, but I want to get it close to right to begin with.
sdw
--
swilliams@hpti.com http://www.hpti.com Per: sdw@lig.net http://sdw.st
Stephen D. Williams 703-724-0118W 703-995-0407Fax 20147-4622 AIM: sdw
begin:vcard
fn:Stephen Williams
n:Williams;Stephen
email;internet:sdw@lig.net
tel;work:703-724-0118
tel;fax:703-995-0407
tel;pager:sdwpage@lig.net
tel;home:703-729-5405
tel;cell:703-371-9362
x-mozilla-html:TRUE
version:2.1
end:vcard
|