[
Lists Home |
Date Index |
Thread Index
]
> if you want minimal memory overhead (and not just create DOM and
> navigate it) you can record XML context of one position in file (that
> would include i-scope namespace declarations, stack of start tags,
> attributes etc.) and use it to move back parser and then restart
> parsiing from this position though i have not seen parser that can do
> this ...
My parser does that. For example, when I parse an ebook, I lay it out
a page at a time, and mark the position in the XML of the content that
starts each page. I then write an index file containing all the marks
for each page.
Once a document has been indexed, it's very quick to, say, open the
document and jump to the 200th page, or to jump back quickly page by
page, without storing all the XML for each page.
The drawbacks are that: a) if the document changes, you have to
reindex everything and b) if any of the display attributes (e.g. text
size, line spacing, etc) changes, you have to reindex everything.
All I record for a mark is the offset in the file, the read depth and
the tags of each level of nesting. I don't know anything about
i-scope namespace declarations (I said I was hopelessly naive!)
> and here is how it could be done in XmlPull (for details see:
> http://www.extreme.indiana.edu/~aslom/xmlpull/patterns.html#ANY_ORDER)
[...]
> wrapper.skipSubTree();
I think the advantage of having the nesting level explicit in the
parsing is that the parser is in a position to deal reasonably
robustly with malformed XML, without aborting.
I started off aborting with an error on any mismatched tag, but I
found that in practise, files I was finding on the net had a plethora
of minor errors, and fixing them is much easier if the parser gives
warnings for many errors in the same document (sometimes there are
hundreds of errors) rather than aborting at the first one...
Of course, skipSubTree could do something like that, but it has not
got the option of ascending further up in the tree than the level at
which it was called, which is sometimes the best thing to do
(depending on your recovery heuristics, obviously).
cheers,
rog.
|