xml-dev - More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0)

More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0)

[ Lists Home | Date Index | Thread Index ]

To: John Cowan <jcowan@reutershealth.com>
Subject: More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0)
From: Uche Ogbuji <uche.ogbuji@fourthought.com>
Date: Thu, 23 Dec 2004 02:05:54 -0700
Cc: xml-dev@lists.xml.org
In-reply-to: <1103791522.4600.68.camel@borgia>
Organization: Fourthought, Inc.
References: <1103757598.10272.67.camel@borgia> <20041223055359.GK25900@skunk.reutershealth.com> <1103791522.4600.68.camel@borgia>

On Thu, 2004-12-23 at 01:45 -0700, Uche Ogbuji wrote:
> On Thu, 2004-12-23 at 00:53 -0500, John Cowan wrote:
> > Uche Ogbuji scripsit:
> > 
> > > Tenorsax (amara.saxtools.tenorsax) is a framework for "linerarizing"
> > > SAX logic so that it flows more naturally, and needs a lot less state
> > > machine wizardry.
> > 
> > This sounds *very* interesting.  Is there a more detailed writeup somewhere?

While on the topic of SAX taming features in Amara, there is also
amara.saxtools.xpattern_sax_state_machine, which I didn't even bother
mentioning in the announcement (too much to cram in).

This module takes an XPattern (e.g. "/xbel/folder/bookmark") and
generates a state machine which can be plugged into any regular SAX
handler.  In this way, you can automatically look for certain XPatterns
which have interesting bits of code for you to process, and ignore the
rest.  This is sort of the opposite of Tenorsax: embrace the state
machine, but automate it, rather than sweeping it unto a fancy
framework.

amara.domtools.pushdom uses this state machine generator to provide a
function where you specify a set of XPatterns, and get back a series of
DOM chunks in series from the SAX parse.  It's like a pulldom, but a
*lot* simpler (and more declarative).  So the following three lines are
*complete* code for printing all links in a, XBEL file:

from amara.domtools import pushdom
for docfrag in pushdom("bookmark", xbel_file):
    print docfrag.firstChild.getAttributeNS(None, 'href')

And what's more, no more than the amount of DOM needed to represent each
bookmark node is in memory at any given time (i.e. similar, friendly
memory usage as SAX).  If you had a terabyte XBEL file, this code would
still only take up a few KB of RAM.

-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Use CSS to display XML - http://www.ibm.com/developerworks/edu/x-dw-x-xmlcss-i.html
Full XML Indexes with Gnosis - http://www.xml.com/pub/a/2004/12/08/py-xml.html
Be humble, not imperial (in design) - http://www.adtmag.com/article.asp?id=10286
UBL 1.0 - http://www-106.ibm.com/developerworks/xml/library/x-think28.html
Use Universal Feed Parser to tame RSS - http://www.ibm.com/developerworks/xml/library/x-tipufp.html
Default and error handling in XSLT lookup tables - http://www.ibm.com/developerworks/xml/library/x-tiplook.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/
The State of Python-XML in 2004 - http://www.xml.com/pub/a/2004/10/13/py-xml.html

Follow-Ups:
- Re: [xml-dev] More on taming SAX (was Re: [xml-dev] ANN: Amara XMLToolkit 0.9.0)
  - From: Jeff Rafter <lists@jeffrafter.com>
- Re: [xml-dev] More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0)
  - From: David Megginson <david.megginson@gmail.com>

References:
- ANN: Amara XML Toolkit 0.9.0
  - From: Uche Ogbuji <uche.ogbuji@fourthought.com>
- Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0
  - From: Uche Ogbuji <uche.ogbuji@fourthought.com>

Prev by Date: Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0
Next by Date: RE: [xml-dev] ANN: Amara XML Toolkit 0.9.0
Previous by thread: Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0
Next by thread: Re: [xml-dev] More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0)
Index(es):
- Date
- Thread