xml-dev - Re: [xml-dev] JITTs and DOM

Re: [xml-dev] JITTs and DOM

[ Lists Home | Date Index | Thread Index ]

To: Patrick Durusau <pdurusau@emory.edu>
Subject: Re: [xml-dev] JITTs and DOM
From: Jeni Tennison <jeni@jenitennison.com>
Date: Fri, 11 Oct 2002 12:56:30 +0100
Cc: LMNL-DEV <lmnl-dev@lmnl.org>, "Matthew Brook O'Donnell" <mtrout@nycap.rr.com>, <xml-dev@lists.xml.org>
In-reply-to: <3DA69F39.3040302@emory.edu>
Organization: Jeni Tennison Consulting Ltd
References: <200210101305.JAA29379@mail2.reutershealth.com><19514849372.20021010140724@jenitennison.com> <3DA58333.7080001@emory.edu><19218876953.20021010151432@jenitennison.com> <3DA69F39.3040302@emory.edu>
Reply-to: Jeni Tennison <jeni@jenitennison.com>

Hi Patrick,

>>I'd be *very* careful about drawing any conclusions about speed up
>>from these observations. What you've done for these observations is
>>replace markup-significant characters (e.g. '<') with
>>markup-insignificant characters (i.e. '@'), effectively turning
>>whole regions of the document into plain text.
>
> I said in my post that these were observations that suggest further
> investigation. The replacement was noted on the webpage as
> simulating the result of a JITTs parser. Yes, the operation of a
> JITTs parser would be to treat regions of the document into plain
> text. Sorry if that was not explicit in our earlier treatments of
> JITTs parsing.

I understood that the *output* would be plain text, but I thought that
the *input* would be marked-up text. This wasn't the case in the
samples that you were using for your observations. I did see that you
characterised them as "observations" and said that you would do more
investigation, I just didn't want you or anyone else to get too
hopeful about 30x speedup on the basis of these particular
observations.

>>It wouldn't be enough to just ignore all the tags that the parser
>>came across (which is what you've done in effect). Instead, the
>>parser would have to read the tag, look at the name of the tag,
>>check that against a list (from a DTD or schema) in order to work
>>out what to do, and then either generate a "start/endElement" event
>>or generate a "characters" event (to report the tag as a string)
>>depending on the tag's status. If anything, I imagine that this will
>>*add* time to the parsing of the document.
>
> Parsers already build a tree from the DTD or schema in order to
> "recognize" the markup it encounters in the document. All JITTs
> would require is in the lookup step, where a parser now looks for
> the token in the tree is that upon failure, the parser starts
> reading input again. (That assumes you are using the suggested
> ignore option, with delete, it would drop the token from the imput
> string and continue reading input.)

Absolutely. I think I wasn't clear -- I was describing the extra work
that a parser would have to do on top of the "scanning plain text
until you come to a '<'" parsing that your observations were
demonstrating, not the extra work on top of XML parsing.

For what it's worth, the lookup step is not hard to implement as a SAX
filter on top of an existing XML parser (that's basically what I did
when I implemented basic filtering from LMNL documents into XML). As
Rick pointed out, filtering-by-namespace is a very easy place to start
and wins you a lot immediately, but one of the things that I think
we're both interested in is filtering-by-schema/DTD, which is harder
but more powerful and interesting.

The other thing that I think is promising about the JITTs approach is
the ability to parse just the bits of the document that you're
interested in, on the fly, during processing. A DOM implementation
that did this behind the scenes could be very effective. (I'm sure
that native XML databases / content management systems do this kind of
thing all the time; I don't know if any in-memory DOM implementations
do, or if it's been tried and for some reason rejected?)

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/

Follow-Ups:
- Re: [xml-dev] JITTs and DOM
  - From: Gavin Thomas Nicol <gtn@rbii.com>

References:
- JITTs and DOM
  - From: Patrick Durusau <pdurusau@emory.edu>
- Re: [xml-dev] JITTs and DOM
  - From: Jeni Tennison <jeni@jenitennison.com>
- Re: [xml-dev] JITTs and DOM
  - From: Patrick Durusau <pdurusau@emory.edu>

Prev by Date: RE: [xml-dev] Great piece on RSS
Next by Date: RE: [xml-dev] RE: evolvable formats
Previous by thread: Re: [xml-dev] JITTs and DOM
Next by thread: Re: [xml-dev] JITTs and DOM
Index(es):
- Date
- Thread