xml-dev - Re: [xml-dev] Schema-type-aware SAX processing

Re: [xml-dev] Schema-type-aware SAX processing

[ Lists Home | Date Index | Thread Index ]

To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Schema-type-aware SAX processing
From: Jeff Greif <jgreif@alumni.princeton.edu>
Date: Thu, 26 May 2005 15:01:33 -0700
In-reply-to: <43571.64.81.75.3.1117141555.squirrel@webmail.maden.org>
References: <43571.64.81.75.3.1117141555.squirrel@webmail.maden.org>
User-agent: Mozilla Thunderbird 1.0 (Windows/20041206)

Christopher R. Maden wrote:

>Surely I am not the first person to try doing this, but I can't seem to
>find any prior art nor any straightforward way to do this.
>
>I have data that may be arbitrarily large and may conform to arbitrary
>XSDL schemata.  Because of the size, I want to process the document as an
>event stream (hence SAX), and I want to make different processing
>decisions based on the declared types from the schema and based on the
>ultimate base types, if there's any type inheritance.
>  
>

Here's an outline of one way to proceed using Xerces (I've only used 
Xerces-J; I don't know if what follows applies to Xerces-P):

It's unclear from your post whether you have all the schemas available 
in advance.  However, it suffices to have parsed the XSD grammars 
relevant to a particular document (into a grammar pool) before doing 
what follows.  This might involve looking at the namespace of the root 
element and any xsi:schemaLocation attribute on that element and/or 
using some custom entity resolver and fetching the relevant grammar and 
anything it imports or includes.

Having found all the grammars, you retrieve the grammar for the root 
element's namespace from the pool, and convert it to an XSModel (from 
the XML Schema API as specified on the Worldwide Web Consortium web 
site).  Given the root element's qualified name, you can get its 
XSElementDeclaration from the XSModel, from there its type declaration, 
and from there the base types.  You might also need to look at any 
xsi:type attribute on the root element in case the content is specified 
by a derived type of the declared type.  If so, you can examine that 
derived type declaration also from the information in the XSModel.  This 
can all be done in handling startElement() for the root element.

The problem is harder if you want to handle elements deeper down in the 
document whose association with components in the schema depend upon the 
details of the grammar.  The easiest way to handle these would be to 
turn on validation and PSVI annotation in your parser, and get the 
XSElementDeclaration for any element from the PSVI information.  
Probably you would have to access the PSVI from endElement().

Jeff

References:
- Schema-type-aware SAX processing
  - From: "Christopher R. Maden" <crism@maden.org>

Prev by Date: RE: [xml-dev] Using Me Using You
Next by Date: Re: [xml-dev] Schema-type-aware SAX processing
Previous by thread: Schema-type-aware SAX processing
Next by thread: Re: [xml-dev] Schema-type-aware SAX processing
Index(es):
- Date
- Thread