Lists Home |
Date Index |
> > pull- kXML: the parser says where it is, the DH tells the
> parser which
> > way
> > to go. In other words, the document handler has the parser
> > pull only the
> > data it's interested in; branches get skipped. The document does not
> > have to
> > be in-memory
> I see where you are getting with this, but do you mean that only the
> branches that are told to be relavant are kept in memory, or neither.
> Because if nothing is cached, than this goes bac to the pull idea.
Hmmm. The pull-parser sends events, same as the push-parser. The data for
the current node (name, plus maybe attributes if the node is an element) is
stored in the event. It may not be much data, but it is "in-memory" at that
point in time.
The document handler may take that event and store it in the application's
internal cache. Or it may just call the event's toString() method that spits
the event information out to, say, stdout. In that case, the "in-memory"
component is very transitory.
The main difference between push and pull is that the push-parser iterates
depth-first through every node in the document. The pull parser can be
directed to skip branches, so you don't get the subevents generated for
nodes on those branches. If, say, you skip processing of an element, the
push parser just zips through the document content looking for the skipped
element's end tag, and then generates its next event from the tag that
follows. This can be more efficient for grabbing document data that's
sparsely distributed. Less events generated = less overhead.
So I don't think the two are in anyway equivalent, unless you plan on
visiting every node in the document anyway, in which case they function
pretty much the same.
> > batch (or ??)- DOM: A single method loads the document
> in-memory. The
> > app
> > then navigates at its leisure
> Ok, but I would consider this pull as well. Why do you not
> think that the
> pull concept applies here. Instead of telling the parser
> what to load as in
> your pull explanation above, you are in a way telling it to
> load the whole
> document and make all nodes relavant. Then I think we are in the pull
> concept area again?
Right, but in a push-parser, you can choose what to cache; you can grab
parts of a document. With a DOM parse method, you get the whole document.
It's the difference between being served a seven-course meal (batch) and
choosing from a buffet (pull).
> > fully directed- (XQuery??)- the document handler builds instructions
> > that
> > the parser uses to navigate and return data from the document.
> That's a new way to look at it for me. Have to give it a
> thought. I would
> think XQuery would fit into the DOM/pull (batch) category,
> since it just
> uses a different access syntax, but the document is accessed
> the same way.
Similar to pull, but here you're telling the parser up front what it is
you're looking for, rather than directing it in real time. The problem with
pull parsing is that you may skip over nodes that you're later interested in
if you're not careful (sloppy coding); with the fully-directed approach, you
get everything you intend to get because you specified it up front. The
directed parser can then order you "queries" so that they are optimized for
a single pass-through (unless there are navigation dependencies on the
document content which may preclude a single pass-through). This assumes, in
both cases, you know in advance what you're looking for.
Now, I've used three of these approaches in my own work. I haven't used a
fully-directed parser (maybe the term pre-directed is better), although I
envisage it as being something similar to SQL queries on a database. That's
why I mention XQuery. The key difference is that a document in a file is
sequential access, not random access as in a database, so I'm not fully
cognizant of what optimizations are possible in a sequential access