[
Lists Home |
Date Index |
Thread Index
]
On Mon, Mar 17, 2003 at 04:43:51PM +0100, Robin Berjon wrote:
>
> Parts of it may be interesting to this
> conversation, notably stuff done by Barrie Slaymaker. You may wish to
> take a look at EventPath as described and implemented here:
>
> http://search.cpan.org/author/RBS/XML-Filter-Dispatcher-0.47/lib/XML/Filter/Dispatcher.pm
> http://search.cpan.org/author/RBS/XML-Filter-Dispatcher-0.47/
First let me say that Tim's posting resonates very strongly with me;
I've spent a goodly number of manmonths recently in prototyping tools to
address exactly his issues, with some success and some ways to go.
I want to summarize XML::Filter::Dispatcher (adding to what Robin has
said), and then I want to mention XML::Essex below because it's a
prototype of a pull mode scripting environment like what Tim shows that
is also event driven so it allows while ( pull ) style processing
without reading the entire document (via Perl's newish ithreads). Both
are pragmatic prototypes and not efforts to specify new standards. Both
have newer versions on my harddrives than are on CPAN (I don't think
there's much uptake by other programmers beyond tire kicking; these
things are too new and too prototype-y, developed for my own needs
rather than as a public service).
X::F::D is a pattern/action streaming rules processor that uses an XPath
superset I refer to as EventPath. EventPath is essentially XPath +
SAXish axes + aggregation functions which drive like string() but return
structures (Perl hashes and nested hashes so far). It is not like STX
in that it does not specify a new query language, templating language
and mixed mode processing model for SAX (part streaming and part
buffering); it is explicitly an attempt to apply XPath-like queires to
SAX streams, to allow callbacks when rules match and to allow transforms
to natural (for Perl) data structures. STX is in many ways a more
respectable and better thought out effort, it's just not a tool I've
needed.
X::F::D only implements the subset of EventPath that I've needed thus
far, but there is no theoretical limit (see also the more academically /
research oriented inclined Xaos project). X::F::D buffers the SAX
events just enough to get in-order delivery for expressions like
'a[c]/b' where c may occur before or after b. X::F::D has a few known
bugs and needs some good optimization, but I am using it in production
tools.
X::F::D does allow you to collect incoming spans of events in to Perl
data structures and then fire rules like (paraphrasing here):
'/path/to/twig' => 'hash()' => \&sub_to_call
That gathers the attributes and content under each matching <twig>
element in to a flat hash keyed on names like "@foo" and "foo/bar"
with the string() values in the content (see the X::F::D docs for
more details).
Experience shows that for me (and perhaps only me :), X::F::D is a
decent way to approach a streaming XML processor when you want to fire
callbacks at selected points which can be as granular as SAX events or
by selecting a more refined set of events. By allowing arbitrary XML
trees to be converted in to flat (non-nested) or hierarchical (nested)
Perl data structures, you can mix event based callbacks with automated
tree building. X::F::D does not use any sort of DOM, but it extensible:
you could easily use your own tree builders or other summary functions
(like string(), etc) with it.
It is especially effective when building OO hierarchies or validating /
preprocessing on the fly because the existing Perl tools don't allow you
to insert callbacks at useful parsing points to munge values or
construct / manipulate object hierarchies.
XML::Essex, on the other hand, is a step up the spectrum from X::F::D
(it uses X::F::D). It's an attempt to build toolkit that allows
processing XML files in the same way that you would process a text file,
modulo the fact that text files are treated as flat sequences of records
in Perl while XML is hierarchical.
http://search.cpan.org/author/RBS/XML-Filter-Essex/lib/XML/Essex.pm
XML::Essex uses threads to provide a mixed pull style and rules based
processing environment. It also provides a SAX oriented DOM that lets
you cope with single events by firing callbacks or by pulling them with
get( $event_path_expr ) which skips to the next matching event. You can
get DOM twigs using get_element() to fetch the next element as an
XML::SAX::Element instance.
By doing things like get( "start::*" ) in a while loop, you can skip
through the input stream until you find events of interest and then
decide whether / how to process them or to skip on. Using rules based
processing (see below), you can intercept them as they occur. You can
mix the metaphors so that you can use rules where convenient and then
have a main loop that sweeps up after them, etc.
The primary interface is a scripting one (see below), but if SAX works
better than a scripting approach, there are XML::Handler::Essex,
XML::Generator::Essex, and XML::Filter::Essex to allow you to use
input/output/both subsets of XML::Essex functionality in SAX contexts.
XML::Essex also provides a SAX DOM + SAX based output method similar to
print()ing XML but with proper semantics based on what data types you
put(). This API is designed to allow you to have "rollup" style
emissions where you build a tree to emit inline as a data structure or
in Perl variables and to support incremental emission where you emit
chunks of document. The former is useful for documents and pieces of
documents you can build in batchs, the latter is useful for emitting
things in loops.
This all allows you to get as close to a SAX streaming approach as you
need to (or at least as close as *I've* needed to ;), or go to the sort
of event loop that Tim shows. Here's a trivial event pull loop:
use XML::Essex;
my $count = 0;
while (1) {
get "start_element::*";
++$count;
}
print $count;
I haven't yet needed the pull style event loop in the real world, so I
don't have any detailed examples to show; hopefully the get( $expr ) and
get_element() ideas are clear enough for now. I do use the rules based
stuff and the output stuff a lot.
Here's some real code that reads an XML document using rules to convert
it in to a Perl data structure rooted in $example and then starts to
put() out a result document:
my $example;
## Parsing to perl data structures works well with rules,
## this happens to read in the whole document because I
## know it's small a priori.
on
'/*' => [ 'struct()' => sub { $example = xvalue } ],
'/*/*' => [ 'struct()' => sub {
xvalue->{type} = $_[1]->name;
push @{$example->{chunks}}, xvalue;
} ];
parse_doc;
## Construct some events to send. The output may be huge so
## we emit the header, then the (possibly large) main content
## then the footer.
##
## Emit the header
##
my $pageml_start_elt = start_elt "pageml";
my $description = $example->{'@description'};
my $format = $example->{'@format'};
$description = "N/A" unless defined $description && length $description;
put(
$pageml_start_elt,
[ title =>
"Example Message: ",
[ b => $example->{'@name'} ],
" in ",
$example->{'@format'},
" Format"
],
[ nav =>
[ a => { href => "index" }, "Table of Contents" ],
[ a => { href => "messages_index" }, "Example Messages" ],
],
[ div =>
[ h2 => "Message Description" ],
[ p => $description ],
],
);
my $start_elt = start_elt "div" => {
class => "example",
message_name => $example->{'@name'},
message_format => $example->{'@format'},
};
put $start_elt, "\n";
## ...emit lots and lots of data elements here...
##
## Emit the footer
##
put end_elt $start_elt;
put end_elt $pageml_start_elt;
|