xml-dev - Re: [xml-dev] JITTs and DOM

Re: [xml-dev] JITTs and DOM

[ Lists Home | Date Index | Thread Index ]

To: LMNL-DEV <lmnl-dev@lmnl.org>
Subject: Re: [xml-dev] JITTs and DOM
From: Gavin Thomas Nicol <gtn@rbii.com>
Date: Sat, 12 Oct 2002 10:05:04 -0400
Cc: xml-dev@lists.xml.org
In-reply-to: <3DA80DE2.1090703@emory.edu>
Organization: Red Bridge Interactive, Inc.
References: <200210101305.JAA29379@mail2.reutershealth.com> <E1800rd-0002Td-00@server2000.ebizhostingsolutions.com> <3DA80DE2.1090703@emory.edu>
Reply-to: gtn@rbii.com

On Saturday 12 October 2002 07:56 am, Patrick Durusau wrote:
> Gavin Thomas Nicol wrote:
> >Part of the value of ARA is that it was explicitly design to support
> > parallel parsing of documents. I'm not sure that JITT can be used in
> > quite the same same way... or at least it'd be more complex because the
> > implicit assumption is that you are operating in the context of a tree.
>
> I am not sure what measure you are using for "complex" 

In this case, the cost of manipulating a tree in parallel (adding nodes, 
etc.). Purely from an implementation perspective, it complicates things 
considerably because you need to work on synchronization etc. In ARA, the 
output is a stream of discrete ranges, so synchronization isn't a major 
problem. In other words, it's not the model that is complex, but (based on 
set of possibly faulty assumptions!) the implementation.

> In our investigations of overlapping texts, it appears that most overlap
> is what we characterized as "localized" and hence, one need only parse a
> fragment in the alternate hierarchy to compare the alternative
> hierarchies.

This is an interesting observation... and I think a fairly important insight 
into markup. Perhaps there's some "proximity" factor in markup, where 
long-range overlapping markup structures are uncommon because most people 
cannot track them? It might be similar to depth of XML trees...

> ARA parallel parses the entire document in order to build its internal
> representation of the ranges in the document.

If you use the term "parse" in the sense "examine every character", that is 
true... but unless JITT has an external addressing mechanism, it will need to 
do that too. You do not have to construct all ranges, all at once however 
(though that is what I do in my work so far). For example, the regular 
expression:

  "<"{NameStart}{NameChar}{S}.*">"

could be used to discover/parse all the start tag ranges, but not attribute, 
or attribute value ranges. That can happen in parallel, or lazily later. In 
terms of tree construction, my current appoach is to use something akin to 
feature logic for "range stop lists" so that certain ranges are suppressed... 
though this could just as easily be something based on forest regular 
expressions, or XPath (that's one part of my work I still need to do: typing 
range constructors to forest regular expressions/schemas).

> In some sense that is not complex, but it certainly poses a certain
> overhead to using the ARA approach. Once the entire document has been
> processed, I would expect querying of the ranges to be quite fast. That
> would not be a drawback with largely static documents and versions of
> documents, but could pose problems with documents and sets of documents
> that are not fairly stable. 

Right. My current work is with very large sets of mostly static documents, 
and on large documents (> 100MB) that are essentially static. This is not a 
limitation of ARA, so much as a constraint of my problem domain. For single 
documents that are changing frequently, ARA would operate more-or-less 
equivalently to SAX though with filtering capabilities like JITT.

I should have my papers online today or tomorrow.

References:
- Re: [xml-dev] JITTs and DOM
  - From: Gavin Thomas Nicol <gtn@rbii.com>
- Re: [xml-dev] JITTs and DOM
  - From: Patrick Durusau <pdurusau@emory.edu>

Prev by Date: Re: [xml-dev] JITTs and DOM
Next by Date: RE: [xml-dev] What is Tag Soup?
Previous by thread: Re: [xml-dev] JITTs and DOM
Next by thread: overhead
Index(es):
- Date
- Thread