[
Lists Home |
Date Index |
Thread Index
]
Jeff Rafter wrote:
>> While on the topic of SAX taming features in Amara, there is also
>> amara.saxtools.xpattern_sax_state_machine, which I didn't even bother
>> mentioning in the announcement (too much to cram in).
>
>
> Can you expand on your expansion?
Why not? This is fun.
> As I was reading this I was thinking that in the Java/C# world an
> interesting approach would be to keep a pseudo DOM stack for the event
> hierarchy. Maybe something where you keep everything at an ancestral
> level intact while parsing
>
>
> <foo>
> <bar1>
> <baz1/>
> <baz2/>
> </bar1>
> <bar2>
> <baz1>
> <sub/>
> </baz1>
> <baz2>text</baz2>
> </bar2>
> </foo>
>
> So when the event stream reached /foo/bar2/baz2/text() you would have
> the following in a DOM like structure:
>
> foo
> \
> bar1 (... no children)
> bar2
> \
> baz1 (... no children, just the previous sibling and attrs)
> baz2 (only the StartTag)
>
> I am not sure that the preceding siblings would be very useful and
> have more chances for pathological cases but when I construct
> mini-trees this is the subset I find handy.
This would save space in a deep, but not a wide tree, no?
> It is useful when working with an editor to
> understand the immediate context. Unfortunately by requiring the
> previous siblings you have to maintain quite a bit more... the whole
> preceding branch of the tree.
Yep.
Anyway, this whole idea is different from the pushdom. I considered it,
but decided the complexity/benefit ration was too poor. pushdom only
instantiates the nodes at the "end" of the XPattern: only baz2 in your
example. If you try to go up the tree from that node you get a dead end
at the document fragement node.
As to the behavior of amara.saxtools.xpattern_sax_state_machine, it
doesn't have anything to do with instantiating nodes or anything like
that: it's a pure state machine, and the SAX handler code can choose to
use the states to do such processing, if it likes (which is what pushdom
does).
Using your example to illustrates the workings of
amara.saxtools.xpattern_sax_state_machine:
from amara.saxtools import xpattern_sax_state_machine
patterns = ['/foo/bar2/baz2']
machine = xpattern_sax_state_machine(patterns, {}) #{} = no namespace
import pprint #Python standard data structure pretty-printer
pprint.pprint(machine._state_table)
The output of this is:
{0: {(1, None, None): 1},
1: {(1, None, u'foo'): 2, (0, None, None): 0, (1, '?', False): 3},
2: {(0, None, u'foo'): 1, (1, None, u'bar2'): 4, (1, '?', False): 5},
3: {(0, '?', False): 1},
4: {(0, None, u'bar2'): 2, (1, None, u'baz2'): 6, (1, '?', False): 7},
5: {(0, '?', False): 2},
6: {(0, None, u'baz2'): 4},
7: {(0, '?', False): 4}}
This is the state machine that was constructed for you from the pattern
'/foo/bar2/baz2'. As you can see,
amara.saxtools.xpattern_sax_state_machine does its most interesting work
even before you know what the XML instance to be processed looks like.
A SAX handler can then use the resulting machine object to automatically
manage the state as it is processing the instance. All it has to do it
notify the machine object of the SAX events as they come, and the
machine will take care of state transitions. The machine object will
notify the SAX handler of when it has found an event that matches one of
the registered patterns.
All this does if free the programmer from managing all the familiar
hair-pulling logic "OK, I know I'm supposed to bargle when I see a baz
within a bar, but only if it's a child of a top-level foo" and all
that. If the programmer can express things in XSLT patterns (and I find
you usually can), amara.saxtools.xpattern_sax_state_machine does the
heavy lifting for them. Pushdom, for example, uses such a machine
object to trigger "start creating a DOm node for this subtree" based on
the patterns provided by the caller.
The state table structure is probably not clear unless you're pretty
familiar with Python, so let me expand on that a bit:
{0: {(1, None, None): 1},
1: {(1, None, u'foo'): 2, (0, None, None): 0, (1, '?', False): 3},
2: {(0, None, u'foo'): 1, (1, None, u'bar2'): 4, (1, '?', False): 5},
3: {(0, '?', False): 1},
4: {(0, None, u'bar2'): 2, (1, None, u'baz2'): 6, (1, '?', False): 7},
5: {(0, '?', False): 2},
6: {(0, None, u'baz2'): 4},
7: {(0, '?', False): 4}}
This is a dictionary within a dictionary. A Python dictionary is
basically what some languages call an associative array and some
(confusingly, IMO) call a hashtable. In {foo: bar, spam: eggs}, the key
foo is mapped to bar and the key spam to eggs. In the above all the
keys of the outer dict are states. The value is a nested dictionary
that defines the transitions possible from that state. Each transition
is defined as a mapping from a SAX event (the tuple that serves as the
key) to a next state (the number that serves as the value).
Syntax for the event tuples is as follows:
(1, None, None) = start document
(0, None, None) = end document
(1, None, u'foo') = start element "foo" (no namespace)
(0, None, u'foo') = end element "foo" (no namespace)
(1, '?', False) = wildcard for any element that is not "expected" in the
pattern
The wildcard is needed so that, say the pattern 'foo/baz'
Does not unexpectedly match any node in:
<foo>
<bar>
<baz/>
</bar>
</foo>
When the state machine sees foo, it's state becomes effect "waiting for
baz", but we don't want to match the baz within the bar, so we switch to
a new state (triggered by a wildcard) that's in effect: "waiting for the
end of this unexpected element".
Clear as mud? I hope it at least covers the ground.
>> This module takes an XPattern (e.g. "/xbel/folder/bookmark") and
>> generates a state machine which can be plugged into any regular SAX
>> handler. In this way, you can automatically look for certain XPatterns
>> which have interesting bits of code for you to process, and ignore the
>> rest. This is sort of the opposite of Tenorsax: embrace the state
>> machine, but automate it, rather than sweeping it unto a fancy
>> framework.
>
>
> Karl Waclawek has done some work in this area in both Delphi and C# in
> his toolkit XPEA. But I am sure he will take some ideas from this
> thread as well... it is all very interesting.
http://sourceforge.net/projects/xpea/ ? Interesting. Thanks for the
ref. Yes this seems along the general lines of what I'm doing.
--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Use CSS to display XML - http://www.ibm.com/developerworks/edu/x-dw-x-xmlcss-i.html
Full XML Indexes with Gnosis - http://www.xml.com/pub/a/2004/12/08/py-xml.html
Be humble, not imperial (in design) - http://www.adtmag.com/article.asp?id=10286
UBL 1.0 - http://www-106.ibm.com/developerworks/xml/library/x-think28.html
Use Universal Feed Parser to tame RSS - http://www.ibm.com/developerworks/xml/library/x-tipufp.html
Default and error handling in XSLT lookup tables - http://www.ibm.com/developerworks/xml/library/x-tiplook.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/
The State of Python-XML in 2004 - http://www.xml.com/pub/a/2004/10/13/py-xml.html
|