[
Lists Home |
Date Index |
Thread Index
]
Hi,
I'm afraid I'm new to this list, so am probably breaking 300 list
taboos...
I sent the below to Tim Bray in response to his article (the one
presumably spawned this thread); he suggested I post it to this list.
It's just a simple and small solution to an XML parsing problem I had,
which was satisfactory at the time, seems to me to be generally
applicable, and I haven't seen anything similar. Sorry in advance
about the length.
First thing: this is no panacea, and to be quite honest, I'm actually
(heretic!) not at all keen on XML, but this API at least made things
bearable; the structure of the code dealing with the XML was quite
logical, and the space used in doing so was largely bounded.
I developed it when I was writing a browser for the Open Ebook
standard to run on limited memory platforms. Obviously DOM was out of
the question, and SAX became really awkward as it would have been
necessary to traverse the whole XML tree from the start when moving
back a page; moreover I found it difficult to write code that
corresponded directly to the DTDs.
The idea is very simple: treat the XML as a multi-level stream, and
provide an interface that allows one to *mark* a place in the stream,
and *go to* a previously marked place.
The basic API looked something like:
open: fn(f: string): ref Parser;
Parser: adt {
next: fn(p: self ref Parser): ref Item;
up: fn(p: self ref Parser);
down: fn(p: self ref Parser);
mark: fn(p: self ref Parser): ref Mark;
goto: fn(p: self ref Parser, m: ref Mark);
atmark: fn(p: self ref Parser, m: ref Mark): int;
};
Item: adt {
pick {
Tag =>
name: string;
attrs: Attributes;
Text =>
ch: string;
Process =>
target: string;
data: string;
Doctype =>
name: string;
public: int;
params: list of string;
Stylesheet =>
attrs: Attributes;
Error =>
loc: Locator;
msg: string;
}
};
Attributes: adt {
all: fn(a: self Attributes): list of Attribute;
get: fn(a: self Attributes, name: string): string;
};
Attribute: adt {
name: string;
value: string;
};
The unconventional language is Limbo (see
http://www.vitanuova.com/inferno/papers/limbo.html), but the concepts
are familiar: an "adt" is basically a struct with function members,
and a "pick" is a type-safe discriminated union; the type declarations
are Pascal-like (i.e. type after identifier name).
Open()ing an XML file produces a parser p; then p.next() produces the
next XML item in the file *at the same nesting level*.
Therefore,
p := xml->open("foo.xml");
while ((i := p.next()) != nil)
process_item(i);
will only process the top level elements.
e.g. given the following document:
<item>
<title>MetaData</title>
<date>2003-01-12T00:18:05-05:00</date>
<link>http://bitworking.org/news/8</link>
<description>Upon waking, the dinosaur...</description>
</item>
you'll get one item of type Tag, with name "item".
A crucial point is that when you get to the end of the current nesting
level, next() returns nil; this allows one to easily write a
recursive-descent-style parser, for instance (from the ebook reader)
parses a <head> tag:
e_head(p: ref Parser, i: ref Item.Tag)
{
p.down();
while ((t0 := nexttag(p)) != nil) {
case t0.name {
"title" =>
e_title(p, t0);
"link" =>
e_link(p, t0);
"style" =>
e_style(p, t0);
}
}
p.up();
}
Here, nexttag() is a locally defined function that returns the next
Item that's a Tag, ignoring everything else. The various e_*
functions deal with the various kinds of tags that can be found within
an XHTML <head> tag.
This style of interface means that it's possible to write code that
matches fairly closely the DTD, does not parse the whole document into
one in-core data sstructure, and avoids having to write abstruse state
machines!
If the XML is in a seekable file, you can mark a place in the file
(which records all the state of the XML parser at that point in the
file, and a place to seek to), and then return to it later, or even
store the mark externally and use it as an index for rapid
retrieval at a later date.
This means that for files containing a large dataset (e.g. Ebooks!)
you don't necessarily have to store all the dataset, even in a derived
data structure.
I'm aware that my parsing of XML is probably hopelessly naive, and
perhaps there is some facet of XML that makes this approach impossible
for XML in general (I came up with this for a specific problem, after
all). If so, I'd love to know why.
If not, I hope I've managed to contribute a thought or two to the
debate...
cheers,
rog.
|