[
Lists Home |
Date Index |
Thread Index
]
On Saturday 12 October 2002 07:56 am, Patrick Durusau wrote:
> Gavin Thomas Nicol wrote:
> >Part of the value of ARA is that it was explicitly design to support
> > parallel parsing of documents. I'm not sure that JITT can be used in
> > quite the same same way... or at least it'd be more complex because the
> > implicit assumption is that you are operating in the context of a tree.
>
> I am not sure what measure you are using for "complex"
In this case, the cost of manipulating a tree in parallel (adding nodes,
etc.). Purely from an implementation perspective, it complicates things
considerably because you need to work on synchronization etc. In ARA, the
output is a stream of discrete ranges, so synchronization isn't a major
problem. In other words, it's not the model that is complex, but (based on
set of possibly faulty assumptions!) the implementation.
> In our investigations of overlapping texts, it appears that most overlap
> is what we characterized as "localized" and hence, one need only parse a
> fragment in the alternate hierarchy to compare the alternative
> hierarchies.
This is an interesting observation... and I think a fairly important insight
into markup. Perhaps there's some "proximity" factor in markup, where
long-range overlapping markup structures are uncommon because most people
cannot track them? It might be similar to depth of XML trees...
> ARA parallel parses the entire document in order to build its internal
> representation of the ranges in the document.
If you use the term "parse" in the sense "examine every character", that is
true... but unless JITT has an external addressing mechanism, it will need to
do that too. You do not have to construct all ranges, all at once however
(though that is what I do in my work so far). For example, the regular
expression:
"<"{NameStart}{NameChar}{S}.*">"
could be used to discover/parse all the start tag ranges, but not attribute,
or attribute value ranges. That can happen in parallel, or lazily later. In
terms of tree construction, my current appoach is to use something akin to
feature logic for "range stop lists" so that certain ranges are suppressed...
though this could just as easily be something based on forest regular
expressions, or XPath (that's one part of my work I still need to do: typing
range constructors to forest regular expressions/schemas).
> In some sense that is not complex, but it certainly poses a certain
> overhead to using the ARA approach. Once the entire document has been
> processed, I would expect querying of the ranges to be quite fast. That
> would not be a drawback with largely static documents and versions of
> documents, but could pose problems with documents and sets of documents
> that are not fairly stable.
Right. My current work is with very large sets of mostly static documents,
and on large documents (> 100MB) that are essentially static. This is not a
limitation of ARA, so much as a constraint of my problem domain. For single
documents that are changing frequently, ARA would operate more-or-less
equivalently to SAX though with filtering capabilities like JITT.
I should have my papers online today or tomorrow.
|