OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] JITTs and DOM

[ Lists Home | Date Index | Thread Index ]

On Saturday 12 October 2002 07:56 am, Patrick Durusau wrote:
> Gavin Thomas Nicol wrote:
> >Part of the value of ARA is that it was explicitly design to support
> > parallel parsing of documents. I'm not sure that JITT can be used in
> > quite the same same way... or at least it'd be more complex because the
> > implicit assumption is that you are operating in the context of a tree.
>
> I am not sure what measure you are using for "complex" 

In this case, the cost of manipulating a tree in parallel (adding nodes, 
etc.). Purely from an implementation perspective, it complicates things 
considerably because you need to work on synchronization etc. In ARA, the 
output is a stream of discrete ranges, so synchronization isn't a major 
problem. In other words, it's not the model that is complex, but (based on 
set of possibly faulty assumptions!) the implementation.

> In our investigations of overlapping texts, it appears that most overlap
> is what we characterized as "localized" and hence, one need only parse a
> fragment in the alternate hierarchy to compare the alternative
> hierarchies.

This is an interesting observation... and I think a fairly important insight 
into markup. Perhaps there's some "proximity" factor in markup, where 
long-range overlapping markup structures are uncommon because most people 
cannot track them? It might be similar to depth of XML trees...

> ARA parallel parses the entire document in order to build its internal
> representation of the ranges in the document.

If you use the term "parse" in the sense "examine every character", that is 
true... but unless JITT has an external addressing mechanism, it will need to 
do that too. You do not have to construct all ranges, all at once however 
(though that is what I do in my work so far). For example, the regular 
expression:

  "<"{NameStart}{NameChar}{S}.*">"

could be used to discover/parse all the start tag ranges, but not attribute, 
or attribute value ranges. That can happen in parallel, or lazily later. In 
terms of tree construction, my current appoach is to use something akin to 
feature logic for "range stop lists" so that certain ranges are suppressed... 
though this could just as easily be something based on forest regular 
expressions, or XPath (that's one part of my work I still need to do: typing 
range constructors to forest regular expressions/schemas).
 
> In some sense that is not complex, but it certainly poses a certain
> overhead to using the ARA approach. Once the entire document has been
> processed, I would expect querying of the ranges to be quite fast. That
> would not be a drawback with largely static documents and versions of
> documents, but could pose problems with documents and sets of documents
> that are not fairly stable. 

Right. My current work is with very large sets of mostly static documents, 
and on large documents (> 100MB) that are essentially static. This is not a 
limitation of ARA, so much as a constraint of my problem domain. For single 
documents that are changing frequently, ARA would operate more-or-less 
equivalently to SAX though with filtering capabilities like JITT.

I should have my papers online today or tomorrow.








 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS