[
Lists Home |
Date Index |
Thread Index
]
- From: Eric Bohlman <ebohlman@netcom.com>
- To: Sean McGrath <sean@digitome.com>
- Date: Thu, 8 Jun 2000 13:32:07 -0700 (PDT)
In my email to Sean that started this discussion, I mentioned that I had
some ideas about integrating PYX with RAX to provide a very simple
"pull-mode" interface for parsing XML. Many XML documents include one or
more "records" that are processable with the RAX API, but also include
some "loose" elements. For example, an RSS file includes "records" like
<image> and <item> but also includes elements like <title>, <description>,
and <managingEditor>. A simple way to parse such a document would be to
read the "loose" elements as PYX lines and the "record" elements as RAX
records.
I'm going to call this the "PYXRAX" interface, which will be identical to
the RAX interface with the addition of one method, ReadPYX(), which
returns the next PYX line from the input as a string. ReadPYX() will
return the lines of the PYX stream corresponding to the input being
parsed, except that prior to returning the start-tag event for an element
that has been defined as a record delimiter (using SetRecord()), it will
return a special PYX line, consisting of the letter 'R' followed by the
element name, indicating that a RAX record is waiting to be processed.
At this point the caller may either call ReadRecord(), in which case the
record will be read and the next call to ReadPYX() will return the next
PYX event for the portion of the input after the closing tag for the
record delimiter (e.g. the contents of the record will have been
"swallowed" as far as ReadPYX() is concerned; note that this event could
be another 'R' event if there are consecutive records), or may continue
calling ReadPYX(), in which case no record will be recognized and the PYX
events corresponding to the record's contents will be returned as if no
record had been set.
Calling ReadRecord() before a record delimiter has been seen, or in the
middle of a record that has been partially read by ReadPYX(), will skip to
the next opening record delimiter, if any; this corresponds to the
current ReadRecord() behavior.
Thus in parsing a typical RSS file where "image," "item," and
"textinput" were set as record types, ReadPYX() would return standard PYX
events for all the elements prior to the <image> and would then return an
"Rimage" event. If ReadPYX() were immediately called again, it would
return a start-tag event "(image", a start-tag event "(title",
etc. Calling ReadRecord() at this point would *not* return an "image"
record; it would return the first "item" record. Calling
ReadRecord() immediately after reading the "Rimage" event *would* return
an "image" record, and calling ReadPYX after reading the record would
return a "Ritem" event.
Is everybody hopelessly confused by now? Should I present an example of
reading an OCS file?
On another matter, the documentation for RAX doesn't clearly specify what
should be done if a "field" level element contains nested elements. If I
understand correctly, Sean's Python implementation omits the content of
nested elements, whereas Robert Hanson's Perl implementation concatenates
their text values in document order, similar to the way XPath computes the
value of a node with children. I tend to favor the latter (a third
alternative would stringify the nested tags as well as their content,
resulting in a value that could be "microparsed" by another parser).
***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************
|