[
Lists Home |
Date Index |
Thread Index
]
Interesting. We seem to be rediscovering co-routines, plus a lot of other
machinery from Jackson structured programming. It's a powerful solution to
the push-pull dilemma, but it does need support at the programming language
level (because the process has multiple stacks). I tried to do something
similar in a very early version of Saxon, but it relied on Java threads and
became very unwieldy.
Of course if you move to a higher level of programming (say XSLT or XQuery)
then the push-pull decisions, and the mechanisms used to handle push-pull
conflicts, get hidden under the covers and programmers don't need to worry
about them.
Michael Kay
http://www.saxonica.com/
> -----Original Message-----
> From: Uche Ogbuji [mailto:uche.ogbuji@fourthought.com]
> Sent: 23 December 2004 08:45
> To: John Cowan
> Cc: xml-dev@lists.xml.org
> Subject: Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0
>
> On Thu, 2004-12-23 at 00:53 -0500, John Cowan wrote:
> > Uche Ogbuji scripsit:
> >
> > > Tenorsax (amara.saxtools.tenorsax) is a framework for
> "linerarizing"
> > > SAX logic so that it flows more naturally, and needs a
> lot less state
> > > machine wizardry.
> >
> > This sounds *very* interesting. Is there a more detailed
> writeup somewhere?
>
> Heh. I should have known. My focus in documentation was the Bindery
> (data binding) stuff (which I think is very well documented) because I
> figured the initial audience for Amara would be the typical Python
> programmer who grimaces any time he has to deal with that smells to
> XMLish (SAX and DOM are contemptible Java-isms to many Pythoneers, and
> don't even get them started on that bloated XSLT thingy).
>
> Anyway, in focusing on documenting the ultra-Python-friendly Bindery I
> did end up neglecting the other parts a bit. I plan to catch
> up, and in
> fact, I plan to treat Tenorsax as a main topic in my upcoming O'Reilly
> article [1], which will cover Amara.
>
> Just to give an idea of the technique, however, I'll post a
> few methods
> of a sample Tenorsax handler
>
> First a trivial case, just to set the scene:
>
> def handle_meta(self, end_condition):
> name = self.params.get((None, 'name'))
> content = self.params.get((None, 'content'))
> print "Meta name:", name, " content:"
> print content
> yield None
> raise StopIteration
>
> This method handles XHTML meta tags: worries only about attributes and
> ignores content.
>
> end_condition is Tenorsax plumbing. More on it in a bit. The first 4
> function body lines just grab attribute values and print them to
> console. self.params within a Tenorsax handler always holds
> the current
> SAX event. Of course, the key to Tenorsax linearization is that you
> actually see multiple SAX events within a single method call
> [2]. Even
> in this simple handler you see 2 events. The start meta tag
> comes, and
> then the "yield None" hands control back to Tenorsax, and
> then upon the
> end meta tag, the code immediately after that line resumes,
> with all the
> local state intact. This means that a lot of variables you would have
> usually had to manage across methods in plain old SAX become local
> variables in Tenorsax. the "raise StopIteration" basically
> signals back
> to the framework "we're done here".
>
> On to a more interesting handler:
>
> def handle_p(self, end_condition):
> yield None
> content = u''
> while not self.event == end_condition:
> if self.event[0] == saxtools.CHARACTER_DATA:
> content += self.params
> yield None
> #Element closed. Wrap up
> print "Document content para:", content
> raise StopIteration
>
> This time it's a p element, and it has content, so we get to see
> multiple interesting events in one handler.
>
> The start tag isn't interesting, so we immediately pass
> control back to
> Tenorsax ("yield None"). Then content is a local variable that will
> aggregate the text content of the p, which could come in multiple text
> events. end_condition now comes into play: it's tenorsax's way of
> letting each handler method know what event signals the end
> of its scope
> (e.g. the event for close p tag in this case) [3]. Each child text
> event results in another iteration of the loop, and once the
> end tag is
> seen, we print the accumulated content.
>
> Finally, to show more of how handlers are invoked, here's the
> html:html
> handler:
>
> def handle_html(self, end_condition):
> dispatcher = {
> (pulldom.START_ELEMENT, XHTML_NS, u'head'):
> self.handle_head,
> (pulldom.START_ELEMENT, XHTML_NS, u'body'):
> self.handle_body,
> }
> #Initial call corresponds to the start html element
> curr_gen = None
> yield None
> while not self.event == end_condition:
> curr_gen = tenorsax.standard_body(dispatcher, curr_gen,
> self.event)
> yield None
> #Element closed. Wrap up
> raise StopIteration
>
> dispatcher is a Python dictionary which maps events to handlers. In
> this case, head start tags get delegated to the
> self.handle_head method
> and body start tags to the self.handle_body method. The
> curr_gen stuff
> is an unfortunate bit of boilerplate I have not yet been able
> to refine
> away (working on it). Every now and then I wish Python had macros.
> They would help a lot here. tenorsax.standard_body
> automatically checks
> the current event to see if there's a match for delegating to
> one of the
> methods indicated in dispatcher.
>
> I'd like to tidy things up a tad bit more, but as it is, I have found
> Tenorsax to be a huge help in writing SAX programs quickly. The
> Scimitar code that translates Schematron to Python code is implemented
> in only about 400 lines of Python code (excluding comments, spacing,
> etc.), and this includes all the Python skeleton code for emitted
> validator scripts. I tried implementing it in plain SAX at first. It
> was running to 2-3 times the code length and my brain was on the verge
> of explosion from the state machine logic.
>
> Anyway, thanks for asking, and thus helping me seed the documentation.
> More on Tenorsax to come, for sure, because I do think many
> will find it
> very useful.
>
> [1] http://www.xml.com/pub/au/84
>
> [2] For those who care about the nuts and bolts the trick here is
> basically a semi-co-routine arrangement between the Tenorsax framework
> and each handler method in turn. This is made possible by Python
> generators. Full co-routines are not really in the cards
> with Python at
> present, but I'm not convinced they'd make more than a cosmetic
> difference.
>
> [3] This is a simplified case that doesn't handle nested p tags.
> Supporting nesting is a pretty simple matter.
>
> --
> Uche Ogbuji Fourthought, Inc.
> http://uche.ogbuji.net http://4Suite.org http://fourthought.com
> Use CSS to display XML -
> http://www.ibm.com/developerworks/edu/x-dw-x-xmlcss-i.html
> Full XML Indexes with Gnosis -
> http://www.xml.com/pub/a/2004/12/08/py-xml.html
> Be humble, not imperial (in design) -
> http://www.adtmag.com/article.asp?id=10286
> UBL 1.0 -
> http://www-106.ibm.com/developerworks/xml/library/x-think28.html
> Use Universal Feed Parser to tame RSS -
> http://www.ibm.com/developerworks/xml/library/x-tipufp.html
> Default and error handling in XSLT lookup tables -
> http://www.ibm.com/developerworks/xml/library/x-tiplook.html
> A survey of XML standards -
> http://www-106.ibm.com/developerworks/xml/library/x-stand4/
> The State of Python-XML in 2004 -
> http://www.xml.com/pub/a/2004/10/13/py-xml.html
>
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://www.oasis-open.org/mlmanage/index.php>
>
>
|