[
Lists Home |
Date Index |
Thread Index
]
rog@vitanuova.com wrote:
>Hi,
>I'm afraid I'm new to this list, so am probably breaking 300 list
>taboos...
>
hi,
i am sure i am breaking some too so you are not alone :-)
>It's just a simple and small solution to an XML parsing problem I had,
>which was satisfactory at the time, seems to me to be generally
>applicable, and I haven't seen anything similar. Sorry in advance
>about the length.
>
i think so too and is on the reason i worked on XmlPull API
(www.xmlpull.org)
>First thing: this is no panacea, and to be quite honest, I'm actually
>(heretic!) not at all keen on XML, but this API at least made things
>bearable; the structure of the code dealing with the XML was quite
>logical, and the space used in doing so was largely bounded.
>
that is true about pull parsing: in general: the code that is doing
parsing tends to mirror
XML structure and i even went so far as to say that it is pattern common
in xml pull parsing
(http://www.extreme.indiana.edu/~aslom/xmlpull/patterns.html#MIRROR)
>I developed it when I was writing a browser for the Open Ebook
>standard to run on limited memory platforms. Obviously DOM was out of
>the question, and SAX became really awkward as it would have been
>necessary to traverse the whole XML tree from the start when moving
>back a page; moreover I found it difficult to write code that
>corresponded directly to the DTDs.
>
>The idea is very simple: treat the XML as a multi-level stream, and
>provide an interface that allows one to *mark* a place in the stream,
>and *go to* a previously marked place.
>
that goes one step above and beyond streaming pull parsing however i was
already experimenting with something like that in XPP2 XmlPullNode that
allowed to build XML tree in memory on demand and even for sub-trees to
access directly XPP2 event stream and i was very happy with capabilities
of such "mixed" API.
>The basic API looked something like:
>
<snip/>
>Open()ing an XML file produces a parser p; then p.next() produces the
>next XML item in the file *at the same nesting level*.
>
that is main different when comparing witrh XmlPull as next() in
XmlPull returns move stream to next event doing depth-first iteration
(exactly like SAX).
>Therefore,
>
> p := xml->open("foo.xml");
> while ((i := p.next()) != nil)
> process_item(i);
>
>
>will only process the top level elements.
>
this will process top level elements in XmlPull:
while( parser.nextTag() == pp.START_TAG ) {
processItem(parser);
}
>A crucial point is that when you get to the end of the current nesting
>level, next() returns nil; this allows one to easily write a
>recursive-descent-style parser, for instance (from the ebook reader)
>parses a <head> tag:
>
nextTag in XmlPull allows you to do the same as it returns only two
vallues START_TAG or END_TAG and exception is thrown if input contained
anything else
> e_head(p: ref Parser, i: ref Item.Tag)
> {
> p.down();
> while ((t0 := nexttag(p)) != nil) {
> case t0.name {
> "title" =>
> e_title(p, t0);
> "link" =>
> e_link(p, t0);
> "style" =>
> e_style(p, t0);
> }
> }
> p.up();
> }
>
and here is how it could be done in XmlPull (for details see:
http://www.extreme.indiana.edu/~aslom/xmlpull/patterns.html#ANY_ORDER)
e_head(XmlPullParser parser) throws XmlPullException
{
parser.require( pp.START_TAG, null, "item");
while (parser.nextTag(parser) != XmlPullParser.START_TAG) {
if( "title".equals(parser.getName()) {
e_title(parser);
} else if( "title".equals(parser.getName()) {
e_link(parser);
} else if( "style".equals(parser.getName()) {
e_style(parser);
} else { // ignore uknonw elements
wrapper.skipSubTree();
}
parser.require( pp.END_TAG, null, "item");
}
>Here, nexttag() is a locally defined function that returns the next
>Item that's a Tag, ignoring everything else. The various e_*
>functions deal with the various kinds of tags that can be found within
>an XHTML <head> tag.
>
in XmlPull nextTag() is more restrictive and will skip only white space
text content.
>This style of interface means that it's possible to write code that
>matches fairly closely the DTD, does not parse the whole document into
>one in-core data sstructure, and avoids having to write abstruse state
>machines!
>
i agree completely :-)
>If the XML is in a seekable file, you can mark a place in the file
>(which records all the state of the XML parser at that point in the
>file, and a place to seek to), and then return to it later, or even
>store the mark externally and use it as an index for rapid
>retrieval at a later date.
>
>This means that for files containing a large dataset (e.g. Ebooks!)
>you don't necessarily have to store all the dataset, even in a derived
>data structure.
>
>
if you want minimal memory overhead (and not just create DOM and
navigate it)
you can record XML context of one position in file (that would include
i-scope namespace
declarations, stack of start tags, attributes etc.) and use it to move
back parser and
then restart parsiing from this position though i have not seen parser
that can do this ...
>I'm aware that my parsing of XML is probably hopelessly naive, and
>perhaps there is some facet of XML that makes this approach impossible
>for XML in general (I came up with this for a specific problem, after
>all). If so, I'd love to know why.
>
thi sapproach means that parsing is done again and again each time you
move back in stream (and this can only work with stream that supports
efficient marking and going back - this works fine for files but is not
case for networks sockets ...)
>If not, I hope I've managed to contribute a thought or two to the
>debate...
>
it was very interesting post and showing that we have similar problems
and come up with
similar approaches ot solve them.
thanks,
alek
|