[
Lists Home |
Date Index |
Thread Index
]
- To: XML Developers List <xml-dev@lists.xml.org>
- Subject: SAX/Java Character Buffers (was Re: [xml-dev] SAX and parallel processing)
- From: David Megginson <david.megginson@gmail.com>
- Date: Sat, 1 Jan 2005 15:51:54 -0500
- Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:references; b=RUnMbWUfLGNb9u9eSUmbfsX8D1juLtikssKOcWAz3JbumQmK8E8BIfOdi3j/vqig5m6+97iNF/dRlsinq7TsXM0K7dE6IhVF8Cql7K3AS/it8yX0U6j9vfmGOZTx4o9uhdvAoL3/bGIEnFcxFjfCPc2yX6ocf8FLKW7GoqvEyrw=
- In-reply-to: <1104611359.3038.148.camel@borgia>
- References: <830178CE7378FC40BC6F1DDADCFDD1D10276723C@RED-MSG-31.redmond.corp.microsoft.com> <200412310131.52268.miles@milessabin.com> <1104460276.3038.23.camel@borgia> <41D4CC8C.1080200@objfac.com> <1104465543.3038.28.camel@borgia> <20041231165744.GA20756@maribor.izzy.net> <75cb920c04123110151af471f9@mail.gmail.com> <41D5DAD0.7000909@objfac.com> <20041231234449.GA21911@maribor.izzy.net> <1104611359.3038.148.camel@borgia>
- Reply-to: David Megginson <david.megginson@gmail.com>
On Sat, 01 Jan 2005 13:29:18 -0700, Uche Ogbuji
<Uche.Ogbuji@fourthought.com> wrote:
[about SAX/Java character events providing an offset into an array
rather than a test/string object]
> I know the original SAX idea was optimization, but I do think this is
> exactly one of those areas where perhaps (IMO) premature optimization
> ends up limiting design evolution, and I also think that it interferes
> with the "Simple" part.
That was a tough choice at the time. I think it was James Clark who
suggested it -- he is justly famous for fast code, but as anyone who
ever tried to work with SP (his C++ SGML parsing library) can attest,
he's not famous for readable code. Here are the pros and cons with
the benefit of six or so years of hindsight:
Pro:
Buffer copying is a killer for high-performance apps. SAX does not
allow a parser to avoid *all* buffer copying -- it's still necessary
to copy attribute values, for example, unless the parser happens to
know that they're tokenized -- but otherwise, a SAX parser can provide
direct offsets into its own buffer for character events and use
internalized strings for Namespace URIs and element and attribute
names, avoiding most thrashing around in the heap. It's worth noting
that even today, when Java heap operations are much faster than they
used to be, SAX-based parsers are still remarkably fast. In any case,
without this speed advantage back in late 1990s, when people were
still scared of Java (much less XML) because it was so slow, SAX may
not have gained widespread acceptance in the commercial world. Who
wants an API that makes your parser run even slower?
Con:
In most XML applications that actually do anything significant with
the parse events, parsing overhead is a tiny fraction of total
processing time, say 1% of the total. In other words, making the XML
parser twice as fast might reduce processing time by 1/200. In any
case, there's usually at least one round of buffer copying anyway,
when the byte buffer (say, from an HTTP packet) gets converted to
Unicode.
I'm not sure what I would do, even if I were starting fresh. A good
API should stay out of people's way, and SAX was always meant to be
low-level. I had assumed that most developers would use fancy
toolkits on top, like the original SAXON, which provided friendlier
events, element stacks, etc.; instead, almost everyone went straight
to the basic API. XML developers always seem to like to stay close to
the metal.
All the best,
David
--
http://www.megginson.com/
|