xml-dev - Re: [xml-dev] parser models

Re: [xml-dev] parser models

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] parser models
From: Arjun Ray <aray@nyct.net>
Date: Tue, 24 Sep 2002 10:29:56 +0000
In-reply-to: <200209232145.RAA03126@mail2.reutershealth.com>
References: <r01050300-1015-399AD282CE5611D6B76F0003937A08C2@[192.168.124.21]> <200209232145.RAA03126@mail2.reutershealth.com>

John Cowan <jcowan@reutershealth.com> wrote:

| During a startElement callback, Shemp's client can ask it to start 
| creating a tree; all SAX events are then consumed until (but not 
| including) the matching endElement callback, at which time the tree 
| is available.

Could Shemp support an "ignore subtree" option, i.e. consume and throw
away all events until the matching endElement callback?  This would allow
the client to skip over uninteresting subtrees without hassling the
details.

Which brings me to my long-standing beef with SAX: state maintenance.  A
lot of the time, what we call state dependence is more accurately context
dependence.  In SAX, if you split up the event handling in the app among
various classes, you have to mess with setHandler() calls explicitly and
track the stack at the same time to get this right.  This is a pain.

In search of a more "natural" idiom, I've been experimenting with a pure
push API which supports stack-based delivery of events.  It's built around
two mutually dependent interfaces that the consumer will have to implement
which look like this (many details omitted):

  interface Element {
      void      gi( String name ) ; // the element type
      AttList   attlist( ) ; // an interface to push attribute info
      Content   content( ) ;
  }

  interface Content {
      Element   newChild( ) ;
      Content   endChild( Content child ) ;
      void      text( char[] buf, int len, int off, boolean pcData ) ;
      void      endContent( ) ;
  }

The parser will maintain a stack of deferred Content instances and a
"current Content instance", tracking the open element hierarchy.  The
normal operation goes like this

  1.  Parser has a start-tag:
       - calls currContent.newChild() to get an Element instance (which 
         is basically a context sensitive factory-like constructor).
       - pumps starttag info to this Element instance to let it do its
         thing (typically, build some application defined object).
       - calls content() on this to get a new Content instance for the new
         child.
       - Stacks current, makes new Content instance current.

  2.  Parser has data events:
       - delivers them to current Content instance.

  3.  Parser has an end-tag:
       - calls endContent() on current Content instance for its cleanup
       - pops parent Content instance off stack.
       - calls endChild(child) on parent, completed child Content instance
         as an argument to allow parent-child communication and synch.  
         Note that the return value allows the parent to replace itself if
         needed.  This becomes the current Content instance.

It sounds more complicated than it actually is.  The SAX ContentHandler
has been split into two pieces, separating the "constructor" information
from the content handling information.  This allows you to combine app
specific content classes with more generic constructor packages without
interference.  

More importantly, there is the critical endChild(child) call - something
missing entirely from the SAX interface.  This is where all the state
management can take place in a contextually local fashion (as it almost
always is in practice).  So you get to split up the state machine into
appropriate classes/objects also.

Cheesy example:

 public class HtmlTable implements Element, Content {
     ...
     Element newChild( ) {
         return (Element) new HtmlTr( ) ; // sexier constructors possible
     }
     Content endChild( Content child ) {
         if ( child instanceof HtmlTr ) {
             // etc 
         }
         return this ;
     }
 }

 public class HtmlTr implements Element, Content {
     void gi( String name ) {
         if ( ! "tr".equals( name ) ) 
             throw ScreamAndDieException ;
     }
     Content content( ) {
         return (Content) this ;
     }
     // etc
 }   
      
A lot of this is boilerplate code that can also be "hoisted".  For deep or
deeply recursive structures in the XML, this works very well, I've found.

Follow-Ups:
- Re: [xml-dev] parser models
  - From: Aleksander Slominski <aslom@cs.indiana.edu>
- Re: [xml-dev] parser models
  - From: james anderson <james.anderson@setf.de>

References:
- parser models
  - From: "Simon St.Laurent" <simonstl@simonstl.com>
- Re: [xml-dev] parser models
  - From: John Cowan <jcowan@reutershealth.com>

Prev by Date: Re: [xml-dev] parser models
Next by Date: Re: [xml-dev] DOCTYPE declaration in the document prolog
Previous by thread: Re: [xml-dev] parser models
Next by thread: Re: [xml-dev] parser models
Index(es):
- Date
- Thread