[
Lists Home |
Date Index |
Thread Index
]
- To: xml-dev@lists.xml.org
- Subject: Re: [xml-dev] XML Performance in a Transacation
- From: Tatu Saloranta <cowtowncoder@yahoo.com>
- Date: Thu, 23 Mar 2006 11:18:07 -0800 (PST)
- Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=Ne2bApL5Exju0k7yNc4Gwvr/Cs6bKjj3qDypXesBCtSGyNz2myAG14oQzBGDtPGq4/rqYiWNiMCdI/PdfnjYKGNwfl51rhfLy92+pRshT1+q/Dg7GWqqg25QQ5HxT39ve0IR3wyOiP6SUVzdQOyRJxIVIVJaRYNt9sClMMPNYAI= ;
- In-reply-to: <31980.60.229.226.213.1143118064.squirrel@intranet.allette.com.au>
--- Rick Jelliffe <rjelliffe@allette.com.au> wrote:
...
> By rights, it seems that there should be some market
> for a highly
> optimized XML parser. You need high performance, you
> seek high performance
> libraries; if there are none, you get them made
> internally or externally.
> But I don't recall ever having seen any requests on
> XML-DEV for high speed
> parsers: certainly none with any dollars behind
> them.
I guess that lack of demand (not just money, but
interest) has
something to do with it. But I think that of two main
approaches
(improving the general case; focusing on specific
subset, whether
by domain or by feature set), the general route would
be mostly
fruitless. However the 'specific solution' path is a
route less
travelled (at least in public); and like you point
out, there
are lots of options one can try.
...
> Hyper-efficient design is not
> an optimization that can be tacked on after, it has
> to be the core of the
Very true. And:
> design; you cannot expect a general-purpose,
> cross-platform parser to be
> optimal. (For example, one trick that goes as far
This is also exactly right; and perhaps it does
suggest
domain-specific (or at least feature set specific)
parsers.
One problem I have seen is that there are no publicly
accepted
subsets. Although Soap (and others with less
understanding; like
the silliness of XMPP) went ahead and limited subset
of xml it
accepts, there's lots of resistance for using ad hoc
subsets; yet
very little effort at coming up with 'standard' ones.
Distinction between validating and non-validating
parsers seems
like the only acceptable division: but for practical
purposes
this is not good enough. Many earlier pull parsers
obviously
just went ahead and chose some subset that made sense
to them.
Also: XML is by its foundation a hierarchical textual
format.
So much effort is used (wasted?) on adding type
systems
(like w3c schema, formerly thought of as a validation
system),
typing, constraints, that it should not be surprising
that
binding non-textual data (numbers, dates etc) is
inefficient.
At least when going the generic parsing route.
Doing tighter type binding, well, one can devise
specific
parsers: but one problem is that the mechanism for
feeding
type info are themselves sources of major overhead.
Who cares
if you can get some speedup on accessing that int
value, when
just using w3c schema instance halves your processing
speed?
At least DTD processing only adds 50% of time (when
DTD
instance is cached).
Another route is of course to forget textual
background and
use a binary encoding (fast infoset, Bnux). This will
result
in faster operation, at least in context of message
processing
where there is significant amount of processing by
middle-men.
But is the Infoset really an optimal presentation for
(object)
data? It still has all impedance of hierarchic data
model, compared
to object or relational data models, even if
primitives can
be typed.
Plus they still need Schema... which is not
supported/integrated with these binary encoding
efforts (ie. there's still schema overhead at one or
both end points).
> back as OmniMark's
> predecessor in the late 80s (I believe) was for
> parsers to have two
> parsers:
> one optimized for the most common case and
> encoding--in XML this would be
> for an entity-less document--, and another to handle
> all the other cases.)
Yes. If you can make use of the fact that there will
not be
nested input streams, you can optimize many things
differently.
It would be good to see how much improvement this
could yield.
Of course, at the end of the day, one could also
consider whether
it is all that important to handle both the
traditional text markup
use case (for which XML was designed for, and where it
is reasonably
good choice), and the later data binding use case
(where xml just
stinks, even after tons of lipstick).
Why not solve these using different serializations and
data models?
For data binding, why not use something more natural
to object binding,
like say, JSON? Primitives, arrays/lists,
Maps/Objects... what more do
you need? There is little use for mixed content; no
need for obscure
macro expansion (entities) beyong encoding purposes...
and due to native
support for native types (ie. parser knows the
primitive types without
need for external information), it's very simple to
avoid pure textual
approach.
In fact, JSON is so trivially simple to parse and
output that it's even
weirder that no money is behind it.
But what the hey; hammer is a hammer, and maybe them
dang weird spiraled
nails just need a bigger hammer! ;-)
-+ Tatu +-
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
|