xml-dev - RE: [xml-dev] DESIGN PROPOSAL: Java XMLIterator

RE: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
[ Lists Home | Date Index | Thread Index ]
To: "'James Clark'" <jjc@jclark.com>, "'John Cowan'" <jcowan@reutershealth.com>, <xml-dev@lists.xml.org>
Subject: RE: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
From: "Chris Fry" <cfry@bea.com>
Date: Tue, 18 Dec 2001 09:45:03 -0800
Importance: Normal
In-reply-to: <001701c18785$9ea07300$0e00a8c0@bkk.thaiopensource.com>
Reply-to: <cfry@bea.com>
Hi Everyone,

We have been working on a similar API at BEA to be used for serializing and
deserializing arbitrary XML.  Currently we have an implmentation that is
bidirectional with SAX and DOM, i.e. you can iterate over DOM trees, or
build a DOM tree from the iterator.  You can generate SAX events or iterate
over SAX events.  There is also a pull parser implementation to feed the
iterator.  I chose the term XMLInputStream and XMLOutputStream to give the
idea that this api was to be used to stream XML.  We are getting ready to
post the code publicly but here is a first taste.  The API is a loose object
wrapper for SAX Events that allows you to pull SAX events from the parser.

First some design information:

1) I decided to type everything, rather than extend java's generics.
	a) This means there is an attribute interface and an XMLName (qname)
interface
2) I don't use the SAXAttributes class for attributes.
3) I have an output stream, an input stream, a buffered stream, a filtered
stream, and an input/output stream
	a) output stream: allows you to write as well as read XML
	b) input stream: reads/iterates over XML
	c) filtered stream: you can filter only the events you want
	d) input/output stream: an output stream you can write to and then read
from the stream
4) The API was designed to be read only to allow very efficient
implementations.
5) The API was designed in the context of web services
	a) currently their isn't much support for validation

The main work for the API is done in three interfaces: the XMLInputStream,
XMLEvent (base class) and the StartElement. I'm a big fan of typed
interfaces, some people may not have this inclination.  Anyway here are the
interfaces, I would appreciate any feedback you have.

public interface XMLInputStream {
  /**
   * Get the next Element on the stream
   * @see weblogic.xml.stream.Element
   */
  public XMLEvent next() throws XMLStreamException;
  /**
   * Check if there are more Elements to pull of the stream
   * @see weblogic.xml.stream.Element
   */
  public boolean hasNext() throws XMLStreamException;
  /**
   * Skip the next stream event
   */
  public void skip() throws XMLStreamException;
  /**
   * Skips the entire next start tag / end tag pair.
   */
  public void skipElement() throws XMLStreamException;
  /**
   * Check the next element without reading it from the stream.
   * Returns null if the stream is at EOF or has no more elements.
   * @see weblogic.xml.stream.XMLEvent
   */
  public XMLEvent peek() throws XMLStreamException;
  /**
   * Position the stream at the next element of this type.  The method
   * returns true if the stream contains another element of this type
   * and false otherwise.
   * @param eventType An integer code that indicates the element type.
   * @see weblogic.xml.stream.XMLEvent
   */
  public boolean skip(int eventType) throws XMLStreamException;
  /**
   * Position the stream at the next element of this name.  The method
   * returns true if the stream contains another element with this name
   * and false otherwise.  Skip is a forward operator only.  It does
   * not look backward in the stream.
   * @param name An object that defines an XML name.
   * If the XMLName.getNameSpaceName() method on the XMLName argument
returns
   * null the XMLName will match just the local name.  Prefixes are
   * not checked for equality.
   * @see weblogic.xml.stream.XMLName
   */
  public boolean skip(XMLName name) throws XMLStreamException;
  /**
   * Position the stream at the next element of this name and this type.
   * The method returns true if the stream contains another element
   * with this name of this type and false otherwise.
   * @param name An object that defines an XML name.
   * If the XMLName.getNameSpaceName() method on the XMLName argument
returns
   * null the XMLName will match just the local name.  Prefixes are
   * not checked for equality.
   * @param eventType An integer code that indicates the event type.
   * @see weblogic.xml.stream.XMLEvent
   * @see weblogic.xml.stream.XMLName
   */
  public boolean skip(XMLName name, int elementType) throws
XMLStreamException;
  /**
   * Closes this input stream and releases any system resources associated
with the stream.
   */
  public void close() throws XMLStreamException;
}

public interface XMLEvent {

  public static final int ELEMENT;
  public static final int START_ELEMENT;
  public static final int END_ELEMENT;
  public static final int PROCESSING_INSTRUCTION;
  public static final int CHARACTER_DATA;
  public static final int COMMENT;
  public static final int SPACE;
  public static final int NULL_ELEMENT;
  public static final int START_DOCUMENT;
  public static final int END_DOCUMENT;
  public static final int START_PREFIX_MAPPING;
  public static final int END_PREFIX_MAPPING;
  public static final int CHANGE_PREFIX_MAPPING;
  public static final int ENTITY_REFERENCE;
  /**
   * Get the element type of the current element,
   * returns an integer so that switch statements
   * can be written on the result
   */
  public int getType();
  /**
   * Get the string value of the type name
   */
  public String getTypeAsString();
  /**
   * Get the XMLName of the current element
   * @see weblogic.xml.stream.XMLName
   */
  public XMLName getName();

  /**
   * Check if this Element has a name
   */
  public boolean hasName();

  /**
   * Return the location of this Element
   */
  public Location getLocation();
}

public interface StartElement extends XMLEvent {
  /**
   * Returns an AttributeIterator of non-namespace declared attributes
   */
  public AttributeIterator getAttributes();
  /**
   * Returns the attribute referred to by this name
   */
  public Attribute getAttributeByName(XMLName name);
  /**
   * Gets the value that the prefix is bound to in the
   * context of this element.  Returns null if
   * the prefix is not bound in this context
   */
  public String getNamespaceUri(String prefix);
  /**
   * Gets a java.util.Map from prefixes to URIs in scope for this
   * element.
   */
  public Map getNamespaceMap();
}
> -----Original Message-----
> From: James Clark [mailto:jjc@jclark.com]
> Sent: Monday, December 17, 2001 9:34 PM
> To: John Cowan; xml-dev@lists.xml.org
> Subject: Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
>
>
> > This is a first design for XMLIterator, a third base-level API
> > which allows an application to pull content from XML.  This
> > avoids the memory demand and navigation issues of DOM, and
> > is a more straightforward programming model than SAX, which
> > requires magic data connections between the event handlers in
> > order to maintain application state.  XMLIterator extends
> > the familiar Iterator interface, so it models an XML document
> > as a linear collection of partially specified nodes.
>
> I very much agree that we need such an API.  SAX works great for some
> kinds of application.  In particular, it works well for generic XML
> applications which do not have to parse a particular XML vocabulary.
> However, SAX is really awkward for some applications, particularly
> applications that parse a particular XML vocabulary with a complex,
> highly nested structure.
>
> As it happens, I have been working on a similar API for the last few
> months.  One impetus for doing this was my experience in implementing
> Jing. I was struck by how painful it was to parse a RELAX NG schema
> into an internal form using SAX.  The equivalent non-XML syntax was
> easily parsed using a straightforward recursive descent parser.  By
> contrast, the parser for the XML syntax was a warped and twisted mess.
>
> My API is currently called "pullax" (pull API for XML). This is still
> very much work in progress.  I hadn't been planning to release for a
> month or two yet.  But since you have started this discussion, I think
> the most constructive thing I can do is to release what I have now.  I
> do have quite a comprehensive API and I do have a fairly complete
> sample implementation.  I have made this available at
>
>   http://www.thaiopensource.com/pullax/
>
> I chose to do my initial sample implementation on top of Xerces 2
> because it provides a native interface (XNI) with a "pull" parser
> API. (I would call it a "controlled push" rather than a "pull"
> API. Roughly, it has a variant of XMLReader.parse which you call
> multiple times; on each call, it parses some portion of the document
> making SAX-like callbacks on handlers.)  This allows an implementation
> that neither requires the whole document in memory (as would an
> implementation on top of DOM), nor the use of threads (as would an
> implementation on top of SAX).  XNI also provides a very rich set of
> information. You'll need Xerces 2 Beta 3 if you want to play with my
> implementation.  See
>
>    http://xml.apache.org/xerces2-j/index.html
>
> Obviously, SAX and DOM adapters are on my list of things to do.
>
> The bad news is that the API documentation is pretty pathetic at the
> moment and still needs a lot of work. This message will have to serve
> as an overview of the API for now.
>
> In designing pullax, I have tried to follow modern Java best
> practices, for example, in favoring immutability and using classes for
> type-safe enumerations. One of my main guides here has been Joshua
> Bloch's book "Effective Java"
> (http://java.sun.com/docs/books/effective/).  This is a truly
> excellent book done by the guy who designed several of the better
> recent Java platform APIs (including the Collections API).
>
> Perhaps the most fundamental decision in designing a pull API is
> whether the properties for each node are provided
>
> (a) by methods on some sort of node object returned by the
> scanner/parser/iterator object
>
> (b) by methods on the scanner/parser object itself; the scanner/parser
> object has methods to move to the next node
>
> You've chosen (a).  A couple of notable pull APIs use (b):
>
> - the XmlReader API in .NET; this is the principal XML parser API for
> .NET (see
> http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystem
> xmlxmlreadercl
> asstopic.asp)
>
> - XML Pull Parser (http://www.extreme.indiana.edu/soap/xpp/)
>
> I tried it both ways in pullax.  I ended up, like you, with (a), for
> the following reasons:
>
> 1. Handling attributes in (b) is messy
>
> 2. (a) works more like the java.util.Iterator and
> java.util.Enumeration that are familiar to every Java programmer
>
> 3. (a) makes it much easier to construct filters/processing pipelines;
> for example, writing a RELAX NG validator that wraps around a
> non-validating parser.
>
> The main argument against (a) is that it involves more object
> creation, which, according to Java folklore, is a performance killer.
>
> Now, you've minimized object creation by having next() implicitly
> invalidate any previously returned nodes. I don't think this is an
> acceptable design for an API intended for widespread public use:
>
> 1. It's a common requirement to need to lookahead in the document when
> deciding how to process the current node.  Your design makes this
> awkward.  It also makes it very awkward to write a filter that needs
> lookahead in doing its filtering (imagine a filter that merges
> adjacent text nodes).
>
> 2. This behavior would be a big surprise to the average Java user.
> The Iterators and Enumerations which a typical Java user will be
> familiar with just don't work like this.
>
> 3. It's the kind of API that leads to "Write Once, Debug Everywhere"
> rather than "Write Once, Run Everywhere".  A typical scenario is that
> a user writes an application that needs lookahead; they incorrectly
> access an XMLNode object after another call to next(); they test their
> application with an implementation that allocates a new XMLNode object
> for each next() call; their application appears to work fine. Then
> somebody else tries to use the application with a parser
> implementation that reuses XMLNode objects and the application
> mysteriously and silently gives the wrong results.
>
> In summary, this design does not promote reliability.  I believe
> priority should be given to reliability over performance.
>
> My "solution" is simply to accept the object creation.  Modern Java
> VMs (like Hotspot) do a fantastic job of efficient allocation of
> short-lived objects; object creation has much less performance
> overhead with modern VMs than it used to with classic VMs.  In any
> case, a user that is prepared to sacrifice programming convenience for
> an extra ounce of performance can use SAX. (Also, since the objects
> returned are immutable, there is an opportunity for reducing object
> creation by sharing.)
>
> The central interface in my API is XmlScanner. (I'm planning a
> companion XmlPrinter interface for writing XML.) This corresponds to
> your XMLIterator interface.  This interface is similar to
> java.util.Iterator but I chose not to derive XmlScanner from Iterator,
> for two reasons:
>
> 1. the equivalents of the next() and hasNext() methods need to be
> able to throw a java.io.IOException
>
> 2. it's awkward and inefficient to have always to cast the return
> value of next()
>
> My XmlScanner object returns XmlItem objects.  I call these objects
> "items" rather than "nodes" because "node" to me suggests a tree view
> where elements have children rather than a flat view with
> start-element and end-element objects.
>
> My XmlItem object has similar methods to your XMLNode object to return
> the item type, the local name, namespace URI, QName, prefix, value
> etc.  The method names are chosen based on the Infoset and XPath.
>
> I toyed with the approach to attributes that you took, that is, having
> ATTRIBUTE items following the START_ELEMENT item. This has the
> advantage of being simple. However, I found it inconvenient to work
> with and felt it would seem rather strange to anybody with exposure to
> SAX or DOM.  So instead an XmlItem of type START_ELEMENT has
> getAttribute() methods that return an XmlItem for an attribute
> identified by name or index.
>
> XmlItem has a getContext() method returning an XmlContext object.
> This provides information about the context of the item, such as the
> in-scope namespaces.  Typically, many XmlItem objects can share the
> same XmlContext object.
>
> A major challenge in designing a general-purpose XML API is to deal
> with the diversity of XML applications.  At one end of the spectrum
> are simple applications that need no more than elements, attributes
> and text (the "holy trinity of XML" as I think David Megginson once
> called them).  At the other end of the spectrum are applications such
> as XML editors that want as much detail about the markup as they can
> get including things like comments and entities.  Just as there is a
> diversity of XML applications, so is there a diversity of XML
> processors/parsers.  There are large, complex parsers like Xerces that
> a very rich set of information but take a corresponding hit in terms
> of size and speed.  There is also a need for simpler parsers that do
> less but can be smaller and faster.
>
> The solution I use in pullax is based on the "feature" concept of
> SAX2.  An implementation of the pullax API implements the
> XmlScannerFactory interface. By default an XmlScanner created by an
> XmlScannerFactory returns exactly three types of XmlItem:
> START_ELEMENT, END_ELEMENT, TEXT.  Also by default TEXT items are
> maximal.  So, for example, the document
>
>   <doc>4<!-- a silly comment -->2</doc>
>
> will be returned as three items: a START_ELEMENT item, a TEXT item
> with string value "42", and an END_ELEMENT item. If an application
> wishes to see, for example, comments, it must request the SHOW_COMMENT
> feature from the XmlScannerFactory before creating the XmlScanner.  If
> the parser cannot satisfy the request, it must throw an exception.
> XmlScannerFactory objects are designed to be dynamically discoverable
> using the service provider mechanism (like JAXP).
> XmlScannerFactoryFinder is a utility class that takes a set of
> features and dynamically finds an XmlScannerFactory implementation
> that supports those features.  This approach ensures that the support
> for a rich information set in pullax does not get in the way of simple
> applications or simple XML processors.
>
> The pullax API aims to provide a very rich information set.  As far as
> the document instance is concerned, it is intended to support the
> union of SAX2, DOM2 core, and the XML infoset and then some.  As far
> as the DTD is concerned, pullax currently provides approximately the
> same information as the union of the XML Infoset and DOM Level 2 core.
> I have opted not to provide the detailed lexical information about the
> DTD that SAX2 provides. It seems to me that it is not much use having
> lexical information about DTDs if you lose information about parameter
> entities within declarations; but dealing with parameter entities
> within declarations is just too hard for a general-purpose API,
> especially when consider nested parameter entity references. I believe
> DTD editor type applications really require specialized APIs and
> parsers (eg DTDinst see http://www.thaiopensource.com/dtdinst).
>
> Another respect in which pullax's approach to DTDs differs from SAX is
> that it represents the DOCTYPE declaration as a single item.  There
> does seem much point in breaking it down into a multiple items.  Most
> of the information is in the XmlDtd object which is available from the
> XmlContext.  Note that the XmlDtd object is immutable.  I'm planning
> to extend the API to allow straightforward DTD caching: the idea is
> that a user-supplied XmlDtdResolver object will map the system id,
> public id and internal subset to an XmlDtd object.
>
> I've written too much already.  I'll be happy to answer any questions
> people may have about the design and I'll try to get the API doc into
> shape as soon as possible.
>
> James
>
>
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>
>
>
References:
- Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
  - From: "James Clark" <jjc@jclark.com>
Prev by Date: Re: [xml-dev] Impact of Various Features on W3C Schema Design
Next by Date: Re: [xml-dev] s-expressions and XML was Re: [xml-dev] terra incognita
Previous by thread: Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
Next by thread: Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
Index(es):
- Date
- Thread