[
Lists Home |
Date Index |
Thread Index
]
Hi Patrick,
>>I'd be *very* careful about drawing any conclusions about speed up
>>from these observations. What you've done for these observations is
>>replace markup-significant characters (e.g. '<') with
>>markup-insignificant characters (i.e. '@'), effectively turning
>>whole regions of the document into plain text.
>
> I said in my post that these were observations that suggest further
> investigation. The replacement was noted on the webpage as
> simulating the result of a JITTs parser. Yes, the operation of a
> JITTs parser would be to treat regions of the document into plain
> text. Sorry if that was not explicit in our earlier treatments of
> JITTs parsing.
I understood that the *output* would be plain text, but I thought that
the *input* would be marked-up text. This wasn't the case in the
samples that you were using for your observations. I did see that you
characterised them as "observations" and said that you would do more
investigation, I just didn't want you or anyone else to get too
hopeful about 30x speedup on the basis of these particular
observations.
>>It wouldn't be enough to just ignore all the tags that the parser
>>came across (which is what you've done in effect). Instead, the
>>parser would have to read the tag, look at the name of the tag,
>>check that against a list (from a DTD or schema) in order to work
>>out what to do, and then either generate a "start/endElement" event
>>or generate a "characters" event (to report the tag as a string)
>>depending on the tag's status. If anything, I imagine that this will
>>*add* time to the parsing of the document.
>
> Parsers already build a tree from the DTD or schema in order to
> "recognize" the markup it encounters in the document. All JITTs
> would require is in the lookup step, where a parser now looks for
> the token in the tree is that upon failure, the parser starts
> reading input again. (That assumes you are using the suggested
> ignore option, with delete, it would drop the token from the imput
> string and continue reading input.)
Absolutely. I think I wasn't clear -- I was describing the extra work
that a parser would have to do on top of the "scanning plain text
until you come to a '<'" parsing that your observations were
demonstrating, not the extra work on top of XML parsing.
For what it's worth, the lookup step is not hard to implement as a SAX
filter on top of an existing XML parser (that's basically what I did
when I implemented basic filtering from LMNL documents into XML). As
Rick pointed out, filtering-by-namespace is a very easy place to start
and wins you a lot immediately, but one of the things that I think
we're both interested in is filtering-by-schema/DTD, which is harder
but more powerful and interesting.
The other thing that I think is promising about the JITTs approach is
the ability to parse just the bits of the document that you're
interested in, on the fly, during processing. A DOM implementation
that did this behind the scenes could be very effective. (I'm sure
that native XML databases / content management systems do this kind of
thing all the time; I don't know if any in-memory DOM implementations
do, or if it's been tried and for some reason rejected?)
Cheers,
Jeni
---
Jeni Tennison
http://www.jenitennison.com/
|