xml-dev - Re: [xml-dev] More on taming SAX (was Re: [xml-dev] ANN: Amara XMLToolki

Re: [xml-dev] More on taming SAX (was Re: [xml-dev] ANN: Amara XMLToolki

[ Lists Home | Date Index | Thread Index ]

To: lists@jeffrafter.com
Subject: Re: [xml-dev] More on taming SAX (was Re: [xml-dev] ANN: Amara XMLToolkit 0.9.0)
From: Uche Ogbuji <uche.ogbuji@fourthought.com>
Date: Thu, 23 Dec 2004 17:46:32 -0700
Cc: xml-dev@lists.xml.org
In-reply-to: <41CB11D1.50600@jeffrafter.com>
Organization: Fourthought, Inc.
References: <1103757598.10272.67.camel@borgia> <20041223055359.GK25900@skunk.reutershealth.com> <1103791522.4600.68.camel@borgia> <1103792754.4600.81.camel@borgia> <41CB11D1.50600@jeffrafter.com>
User-agent: Mozilla Thunderbird 0.9 (X11/20041127)

Jeff Rafter wrote:

>> While on the topic of SAX taming features in Amara, there is also
>> amara.saxtools.xpattern_sax_state_machine, which I didn't even bother
>> mentioning in the announcement (too much to cram in).
>
>
> Can you expand on your expansion?

Why not?  This is fun.

> As I was reading this I was thinking that in the Java/C# world an 
> interesting approach would be to keep a pseudo DOM stack for the event 
> hierarchy. Maybe something where you keep everything at an ancestral 
> level intact while parsing
>
>
> <foo>
>   <bar1>
>     <baz1/>
>     <baz2/>
>   </bar1>
>   <bar2>
>     <baz1>
>       <sub/>
>     </baz1>
>     <baz2>text</baz2>
>   </bar2>
> </foo>
>
> So when the event stream reached /foo/bar2/baz2/text() you would have 
> the following in a DOM like structure:
>
>   foo
>     \
>      bar1 (... no children)
>      bar2
>        \
>         baz1 (... no children, just the previous sibling and attrs)
>         baz2 (only the StartTag)
>
> I am not sure that the preceding siblings would be very useful and 
> have more chances for pathological cases but when I construct 
> mini-trees this is the subset I find handy.

This would save space in a deep, but not a wide tree, no?

> It is useful when working with an editor to
> understand the immediate context. Unfortunately by requiring the 
> previous siblings you have to maintain quite a bit more... the whole 
> preceding branch of the tree.

Yep.

Anyway, this whole idea is different from the pushdom.  I considered it, 
but decided the complexity/benefit ration was too poor.  pushdom only 
instantiates the nodes at the "end" of the XPattern: only baz2 in your 
example.  If you try to go up the tree from that node you get a dead end 
at the document fragement node.

As to the behavior of amara.saxtools.xpattern_sax_state_machine, it 
doesn't have anything to do with instantiating nodes or anything like 
that: it's a pure state machine, and the SAX handler code can choose to 
use the states to do such processing, if it likes (which is what pushdom 
does).

Using your example to illustrates the workings of 
amara.saxtools.xpattern_sax_state_machine:

from amara.saxtools import xpattern_sax_state_machine
patterns = ['/foo/bar2/baz2']
machine = xpattern_sax_state_machine(patterns, {})  #{} = no namespace
import pprint #Python standard data structure pretty-printer
pprint.pprint(machine._state_table)

The output of this is:

{0: {(1, None, None): 1},
 1: {(1, None, u'foo'): 2, (0, None, None): 0, (1, '?', False): 3},
 2: {(0, None, u'foo'): 1, (1, None, u'bar2'): 4, (1, '?', False): 5},
 3: {(0, '?', False): 1},
 4: {(0, None, u'bar2'): 2, (1, None, u'baz2'): 6, (1, '?', False): 7},
 5: {(0, '?', False): 2},
 6: {(0, None, u'baz2'): 4},
 7: {(0, '?', False): 4}}

This is the state machine that was constructed for you from the pattern 
'/foo/bar2/baz2'.  As you can see, 
amara.saxtools.xpattern_sax_state_machine does its most interesting work 
even before you know what the XML instance to be processed looks like.  
A SAX handler can then use the resulting machine object to automatically 
manage the state as it is processing the instance.  All it has to do it 
notify the machine object of the SAX events as they come, and the 
machine will take care of state transitions.  The machine object will 
notify the SAX handler of when it has found an event that matches one of 
the registered patterns.

All this does if free the programmer from managing all the familiar 
hair-pulling logic "OK, I know I'm supposed to bargle when I see a baz 
within a bar, but only if it's a child of a top-level foo" and all 
that.  If the programmer can express things in XSLT patterns (and I find 
you usually can), amara.saxtools.xpattern_sax_state_machine does the 
heavy lifting for them.  Pushdom, for example, uses such a machine 
object to trigger "start creating a DOm node for this subtree" based on 
the patterns provided by the caller.

The state table structure is probably not clear unless you're pretty 
familiar with Python, so let me expand on that a bit:

{0: {(1, None, None): 1},
 1: {(1, None, u'foo'): 2, (0, None, None): 0, (1, '?', False): 3},
 2: {(0, None, u'foo'): 1, (1, None, u'bar2'): 4, (1, '?', False): 5},
 3: {(0, '?', False): 1},
 4: {(0, None, u'bar2'): 2, (1, None, u'baz2'): 6, (1, '?', False): 7},
 5: {(0, '?', False): 2},
 6: {(0, None, u'baz2'): 4},
 7: {(0, '?', False): 4}}

This is a dictionary within a dictionary.  A Python dictionary is 
basically what some languages call an associative array and some 
(confusingly, IMO) call a hashtable.  In {foo: bar, spam: eggs}, the key 
foo is mapped to bar and the key spam to eggs. In the above all the 
keys  of the outer dict are states.  The value is a nested dictionary 
that defines the transitions possible from that state.  Each transition 
is defined as a mapping from a SAX event (the tuple that serves as the 
key) to a next state (the number that serves as the value).

Syntax for the event tuples is as follows:

(1, None, None) = start document
(0, None, None) = end document
(1, None, u'foo') = start element "foo" (no namespace)
(0, None, u'foo') = end element "foo" (no namespace)
(1, '?', False) = wildcard for any element that is not "expected" in the 
pattern

The wildcard is needed so that, say the pattern 'foo/baz'

Does not unexpectedly match any node in:

<foo>
  <bar>
    <baz/>
  </bar>
</foo>

When the state machine sees foo, it's state becomes effect "waiting for 
baz", but we don't want to match the baz within the bar, so we switch to 
a new state (triggered by a wildcard) that's in effect: "waiting for the 
end of this unexpected element".

Clear as mud?  I hope it at least covers the ground.

>> This module takes an XPattern (e.g. "/xbel/folder/bookmark") and
>> generates a state machine which can be plugged into any regular SAX
>> handler.  In this way, you can automatically look for certain XPatterns
>> which have interesting bits of code for you to process, and ignore the
>> rest.  This is sort of the opposite of Tenorsax: embrace the state
>> machine, but automate it, rather than sweeping it unto a fancy
>> framework.
>
>
> Karl Waclawek has done some work in this area in both Delphi and C# in 
> his toolkit XPEA. But I am sure he will take some ideas from this 
> thread as well... it is all very interesting.

http://sourceforge.net/projects/xpea/ ?  Interesting.  Thanks for the 
ref.  Yes this seems along the general lines of what I'm doing.

-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Use CSS to display XML - http://www.ibm.com/developerworks/edu/x-dw-x-xmlcss-i.html
Full XML Indexes with Gnosis - http://www.xml.com/pub/a/2004/12/08/py-xml.html
Be humble, not imperial (in design) - http://www.adtmag.com/article.asp?id=10286
UBL 1.0 - http://www-106.ibm.com/developerworks/xml/library/x-think28.html
Use Universal Feed Parser to tame RSS - http://www.ibm.com/developerworks/xml/library/x-tipufp.html
Default and error handling in XSLT lookup tables - http://www.ibm.com/developerworks/xml/library/x-tiplook.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/
The State of Python-XML in 2004 - http://www.xml.com/pub/a/2004/10/13/py-xml.html

References:
- ANN: Amara XML Toolkit 0.9.0
  - From: Uche Ogbuji <uche.ogbuji@fourthought.com>
- Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0
  - From: Uche Ogbuji <uche.ogbuji@fourthought.com>
- More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0)
  - From: Uche Ogbuji <uche.ogbuji@fourthought.com>
- Re: [xml-dev] More on taming SAX (was Re: [xml-dev] ANN: Amara XMLToolkit 0.9.0)
  - From: Jeff Rafter <lists@jeffrafter.com>

Prev by Date: Re: [xml-dev] XPath and XPattern (was Re: [xml-dev] More on taming SAX)
Next by Date: Re: [xml-dev] More on taming SAX (was Re: [xml-dev] ANN: Amara XMLToolkit 0.9.0)
Previous by thread: RE: [xml-dev] More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0)
Next by thread: Re: [xml-dev] More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0)
Index(es):
- Date
- Thread