[
Lists Home |
Date Index |
Thread Index
]
I've designed a hybrid API for XML parsing. Oh no, not another one.
Here's the two bullet summary:
o Document trees are parsed lazily; it looks like a tree model
o _Consuming iterators_ operate on the children of a node, removing
each child from its parent before processing
In this model, you can flip back and forth between random access to a
subtree for convenience, and consuming iterators for all the
performance advantages of event pull---with minimal changes to code.
An initial implementation of this and some notes are at
http://lua-users.org/wiki/LazyKit . The closest thing to this model
that I've found is XPP2's XmlPullNode; I'm looking for other
references.
This was inspired by XML::Twig and pulldom. In particular, pulldom's
terse documentation includes this quote:
As you can see, the events stream object is pretty
sophisticated. Still, it has limits. You can only expand the
current node because events and nodes relating to any other node
are probably lost in the mists of time. An XML document into a
random access data structure!
"Lost in the mists of time" sounded a little like how garbage
collectors work: they free memory when they can prove that the memory
can't affect the future of the program. That thought led to the
obvious Lua code:
for i,subtree in xpairs(parent) do
parent[i] = nil
do_something_with(subtree)
end
The xpairs_c() iterator is effectively a shortcut for that:
for i,subtree in xpairs_c(parent) do
do_something_with(subtree)
end
Note that up until you begin consuming a tree, you can use all the
conventional iterators and random access you want. But once you start
consuming, you've promised that you no longer care about addressing
its (direct) contents.
From an event pull perspective, then, the tree view is an
unlimited-length lookahead buffer of events associated with an
element.
One concrete example is XML-RPC <value> nodes. If they contain an
element such as <integer>, that's the value of the <value>
node. Otherwise, the character data content of the <value> element is
a string-typed value. The code to process the element must potentially
look ahead an arbitrary number of initial character data nodes to find
out whether there is an element lurking inside, or it must default to
string content.
Even though you're consuming a tree, you can still save references to
subtrees:
for i,subtree in xpairs_c(parent) do
if subtree.name == "xref" then
table.insert(references, subtree)
end
end
(pedantic: subtree is also lazily parsed, so at the time it's added to
the references table, it only physically contains a start node. But
as soon as the next element in parent is touched, it will be
completed.)
In the above example, non-<xref> nodes will still be fully loaded into
memory, as the iterator can't tell that you haven't saved a reference
to them. The current remedy is to manually call consume() on nodes
you don't want to keep, and filtering iterators will do that for
skipped nodes. But in a different implementation environment,
examination of refcounts inside the iterator implementation could
determine that the non-<xref> nodes were unreachable, and hence
skippable.
Although I've not yet implemented it, consuming can interact with the
underlying parser. Consider:
<document>
<firstname>Jay</firstname>
<lastname>Carlson</lastname>
<bodytext>Spending too much time listening to <ref>In Utero</ref>
can be [...]
</bodytext>
<title>I Think I'm DOM</title>
</document>
lastname, title = xmlextract.strings_consume(tree, "lastname", "title")
The strings_consume filter can potentially turn off character data and
other events inside any node it knows it doesn't need (like bodytext),
as references to them cannot possibly affect the rest of the program.
Finally, an example. This converts a tiny subset of XHTML to the Lua
Wiki format. A few things to note:
o This uses a consuming iterator (switch_c) but would be just as
correct using a non-consuming iterator (switch).
o Many handlers (like body/h1) use xstring(), which is a conventional
access tool for trees. It knows nothing about consuming.
o The only memory used is for the current tree being processed and
its parents. Feel free to run it on a 4G file!
(...well, as long as those <h1> nodes are small...xstring() doesn't
consume, so their contents will be fully loaded. The price of
convenience and reuse...)
Jay Carlson
nop@nop.com
--- snip ---
local root=lazytree.parsefile(filename)
local function printi(s)
-- no newline
io.stdout:write(s)
end
local ftable = {
head=function () end;
body={
h1=function (h1)
print("=== "..xstring(h1).." ===")
end;
h2=function (h2)
print("=== "..xstring(h2).." ===")
end;
h3=function (h3)
print("== "..xstring(h3).." ==")
end;
pre=function (pre)
print(" {{{"..xstring(pre).." }}}")
end;
p={
[""]=function (s)
s = string.gsub(s, "\n[ ]+", "\n")
s = string.gsub(s, "[ ]+", " ")
printi(s)
end;
code=function (code)
local s = xstring(code)
s = string.gsub(s, "\n", " ")
s = string.gsub(s, " [ ]+", " ")
printi("{{"..s.."}}")
end;
dfn=function (dfn)
printi("''"..xstring(dfn).."''")
end;
[-1]=function () print() print() end;
}
}
}
xmliter.switch_c(root, ftable, {no_tags=true})
|