[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Parsing Streaming XML Incrementally
Hello,
I am working with a device which has implemented a sort of proprietary
RPC mechanism to retrive data and manipulate operation, using XML (actual
party names will be omitted to protect the guilty).
Unfortunately, the device uses XML in a way which seems very non-standard
to me, although I must admit I am fairly new to XML. I am of a mind to go
and scream at the vendor of this device for making the RPC mechanism work
the way it does, but first I'd like some feedback to ensure that I'm on
sound theoretical footing.
Here's a brief example of how an exchange with this thing works, so that
I can explain the problems with it:
CLIENT: connects to server over TCP stream, emits:
<?xml version="1.0" encoding="us-ascii"?>
<handshake version="1.0" foo="bar" other_params="blah">
SERVER: accepts client TCP connection, emits:
<?xml version="1.0" encoding="us-ascii"?>
<handshake version="1.0" foo="bar" other_params="blah">
At this point, per the spec of the vendor's RPC protocol, server and
client must compare each other's XML and RPC protocol versions and decide
whether to continue or terminate the session.
I really don't have to go any further to illustrate the problem. Standard
XML parsers, at least the perl implementations I've been working with (all
the standard ones available from CPAN, e.g. XML::DOM, XML::SAX::PurePerl,
and plain old XML::Parser), cannot seem to handle the XML fragments
emitted by the server, since they are not well-formed. First of all, a
client must parse the output on a line-by-line basis, since all output
is newline-terminated and there is no way to identify a single "chunk"
or message-block. So the client has to look for complete "chunks" by
scanning for newlines in the stream ... the server output never uses
newlines in such a way that tags might be broken over lines, e.g.:
<some_tag
attr1="foo" attr2="bar">
so the client can pass each newline-terminated chunk to a parser. The
problem is obvious: the first "document" in the above example would
consist only of the XML declaration ... not well formed since no
root element. The second "doc" is an opening tag missing its closing
tag ... again a well-formedness error. But let's continue nonetheless:
(client and server agree on protocol version, both continue session)
CLIENT:
<request request-id="blah blah">
<!-- tags which indicate the type and parameters of the request -->
</request>
SERVER:
<response request-id="blah blah">
<!-- content tags indicating status of response and requested info -->
</response>
CLIENT:
(continues as above, making more requests in the same manner until
finished)
SERVER:
(continues as above, responding to requests in the same manner or issuing
error-indicator tags if appropriate)
CLIENT: (requesting session close)
<request>
<end-session/>
</request>
SERVER: (ack'ing session close request)
<response>
<session-ended/>
</response>
CLIENT: (closing session)
</handshake>
SERVER: (closing session)
</handshake>
CLIENT and SERVER: (TCP connections closed)
Same problems as before ... even though here the request/response
sequences includ the open and close tags, forming a root element,
in sequence, since we are dealing with newlines we have to pass each
line to the parser separately, so it sees "documents" which are not
well formed, and barfs ... whether DOM or SAX. Then finally we get
the closing "handshake" tag ... only a closing tag so the parser dies.
It seems the only way to handle such a stream of half-baked XML would
be to write a parser which extracts XML "atoms" out of the stream and
passes them to another layer (user program), which would be responsible
for ensuring matched opening/closing tags, etc. These atoms would be
the indivisible "units" of XML: opening tags, closing tags, open-close
tags, PIs, character sequences, etc. And in fact this is what I have begun
to do: implement an XML parser which only looks for "atoms", a sort of
really dumb SAX.
It seems to me the right way to do this would be to get rid of the
intervening newlines, inserting only one at the end of well-formed
"chunks" of XML that would serve as the protocol message units, or
keep the newlines but indicate the message boundary with a CR-LF
combination or something similar. And each protocol exchange would
involve only complete elements, with root, say
<handshake blah="blah"></handshake>
instead of the way it is currently done.
A couple of questions: first, does anybody know of an existing *perl*
library which parses XML incrementally, as a stream, looking only for
"atoms" and leaving it up to the library user to make sure the doc is
well formed ... to save me the work of writing my own. Or am I missing
something in existing perl DOM or SAX modules which would let me set
a mode to parse incrementally?
Second, is there any precedence for the way the vendor is using XML
streams in this manner? Is there anything else like this in common
use/acceptance? How does the experienced XML development community
feel about it? Is it a valid "use" of XML, or an "abuse"?
I am curiously awaiting comments ...
Thanks,
Stephen J. Scheck