OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0

[ Lists Home | Date Index | Thread Index ]

On Thu, 2004-12-23 at 00:53 -0500, John Cowan wrote:
> Uche Ogbuji scripsit:
> > Tenorsax (amara.saxtools.tenorsax) is a framework for "linerarizing"
> > SAX logic so that it flows more naturally, and needs a lot less state
> > machine wizardry.
> This sounds *very* interesting.  Is there a more detailed writeup somewhere?

Heh.  I should have known.  My focus in documentation was the Bindery
(data binding) stuff (which I think is very well documented) because I
figured the initial audience for Amara would be the typical Python
programmer who grimaces any time he has to deal with that smells to
XMLish (SAX and DOM are contemptible Java-isms to many Pythoneers, and
don't even get them started on that bloated XSLT thingy).

Anyway, in focusing on documenting the ultra-Python-friendly Bindery I
did end up neglecting the other parts a bit.  I plan to catch up, and in
fact, I plan to treat Tenorsax as a main topic in my upcoming O'Reilly
article [1], which will cover Amara.

Just to give an idea of the technique, however, I'll post a few methods
of a sample Tenorsax handler

First a trivial case, just to set the scene:

    def handle_meta(self, end_condition):
        name = self.params.get((None, 'name'))
        content = self.params.get((None, 'content'))
        print "Meta name:", name, " content:"
        print content
        yield None
        raise StopIteration

This method handles XHTML meta tags: worries only about attributes and
ignores content.

end_condition is Tenorsax plumbing.  More on it in a bit.  The first 4
function body lines just grab attribute values and print them to
console.  self.params within a Tenorsax handler always holds the current
SAX event.  Of course, the key to Tenorsax linearization is that you
actually see multiple SAX events within a single method call [2].  Even
in this simple handler you see 2 events.  The start meta tag comes, and
then the "yield None" hands control back to Tenorsax, and then upon the
end meta tag, the code immediately after that line resumes, with all the
local state intact.  This means that a lot of variables you would have
usually had to manage across methods in plain old SAX become local
variables in Tenorsax.  the "raise StopIteration" basically signals back
to the framework "we're done here".

On to a more interesting handler:

    def handle_p(self, end_condition):
        yield None
        content = u''
        while not self.event == end_condition:
            if self.event[0] == saxtools.CHARACTER_DATA:
                content += self.params
            yield None
        #Element closed.  Wrap up
        print "Document content para:", content
        raise StopIteration

This time it's a p element, and it has content, so we get to see
multiple interesting events in one handler.

The start tag isn't interesting, so we immediately pass control back to
Tenorsax ("yield None").  Then content is a local variable that will
aggregate the text content of the p, which could come in multiple text
events.  end_condition now comes into play: it's tenorsax's way of
letting each handler method know what event signals the end of its scope
(e.g. the event for close p tag in this case) [3].  Each child text
event results in another iteration of the loop, and once the end tag is
seen, we print the accumulated content.

Finally, to show more of how handlers are invoked, here's the html:html

    def handle_html(self, end_condition):
        dispatcher = {
            (pulldom.START_ELEMENT, XHTML_NS, u'head'):
            (pulldom.START_ELEMENT, XHTML_NS, u'body'):
        #Initial call corresponds to the start html element
        curr_gen = None
        yield None
        while not self.event == end_condition:
            curr_gen = tenorsax.standard_body(dispatcher, curr_gen,
            yield None
        #Element closed.  Wrap up
        raise StopIteration

dispatcher is a Python dictionary which maps events to handlers.  In
this case, head start tags get delegated to the self.handle_head method
and body start tags to the self.handle_body method.  The curr_gen stuff
is an unfortunate bit of boilerplate I have not yet been able to refine
away (working on it).  Every now and then I wish Python had macros.
They would help a lot here.  tenorsax.standard_body automatically checks
the current event to see if there's a match for delegating to one of the
methods indicated in dispatcher.

I'd like to tidy things up a tad bit more, but as it is, I have found
Tenorsax to be a huge help in writing SAX programs quickly.  The
Scimitar code that translates Schematron to Python code is implemented
in only about 400 lines of Python code (excluding comments, spacing,
etc.), and this includes all the Python skeleton code for emitted
validator scripts.  I tried implementing it in plain SAX at first.  It
was running to 2-3 times the code length and my brain was on the verge
of explosion from the state machine logic.

Anyway, thanks for asking, and thus helping me seed the documentation.
More on Tenorsax to come, for sure, because I do think many will find it
very useful.

[1] http://www.xml.com/pub/au/84

[2] For those who care about the nuts and bolts the trick here is
basically a semi-co-routine arrangement between the Tenorsax framework
and each handler method in turn.  This is made possible by Python
generators.  Full co-routines are not really in the cards with Python at
present, but I'm not convinced they'd make more than a cosmetic

[3] This is a simplified case that doesn't handle nested p tags.
Supporting nesting is a pretty simple matter.

Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Use CSS to display XML - http://www.ibm.com/developerworks/edu/x-dw-x-xmlcss-i.html
Full XML Indexes with Gnosis - http://www.xml.com/pub/a/2004/12/08/py-xml.html
Be humble, not imperial (in design) - http://www.adtmag.com/article.asp?id=10286
UBL 1.0 - http://www-106.ibm.com/developerworks/xml/library/x-think28.html
Use Universal Feed Parser to tame RSS - http://www.ibm.com/developerworks/xml/library/x-tipufp.html
Default and error handling in XSLT lookup tables - http://www.ibm.com/developerworks/xml/library/x-tiplook.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/
The State of Python-XML in 2004 - http://www.xml.com/pub/a/2004/10/13/py-xml.html


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS