xml-dev - Re: [xml-dev] Partial documents in tree-based APIs

Re: [xml-dev] Partial documents in tree-based APIs

[ Lists Home | Date Index | Thread Index ]

To: Elliotte Rusty Harold <elharo@metalab.unc.edu>
Subject: Re: [xml-dev] Partial documents in tree-based APIs
From: Robin Berjon <robin.berjon@expway.fr>
Date: Mon, 07 Apr 2003 14:05:41 +0200
Cc: xml-dev@lists.xml.org,Laurent Bihanic <laurent.bihanic@atosorigin.com>
In-reply-to: <p04330103bab494833f6d@[192.168.254.4]>
Organization: Expway
References: <p04330103bab494833f6d@[192.168.254.4]>
Reply-to: robin.berjon@expway.fr
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.3) Gecko/20030312

Elliotte Rusty Harold wrote:
> Now consider the case of a tree-based API such as DOM, JDOM, or XOM 
> which encounters a malformedness error. Traditionally, these APIs have 
> reported no information from a malformed document to the client 
> application. However, recently Laurent Bihanic submitted a patch to JDOM 
> in which as much of the document as had been able to be successfully 
> parsed was made available through the exception that was thrown to 
> indicate the malformedness error. This was quite clever. It had never 
> occurred to me, and I had never noticed any other API do anything similar.
> 
> What I'd like to get broader discussion of is whether this is a good 
> idea. There are certainly use cases for it.

I think it is supported by libxml2 (or at least, it is by the Perl wrapper 
XML::LibXML) through a "recover" option. I don't know the actual details but I 
think it parses as much as it can and renders that as a DOM. I wanted to use it 
once to save a couple thousand documents from a variety of errors produced by 
the generating tool, but unfortunately all the errors were on or very close to 
the root element, so I had to regex my way out of it instead. I certainly 
would've found that option useful if it had been possible to recover more fully 
from the errors I was getting.

> Is this approach something to be encouraged? Should other tree-based 
> APIs like XOM and DOM copy this innovation? What advantages and 
> disadvantages have I not thought of?

I don't see disadvantages given that it throws an exception anyway. However that 
approach does have the drawback that you're not getting all the (corrupted) data 
back, just all that precedes the corruption. I'm currently working on a SAX 
parser wrapped around an HTML parser to try to provide a way (generic to any 
XML, unlike TagSoup which is incredibly useful but currently targetted at HTML) 
to recover bad XML. I believe that non-WF XML sadly enough still happens too 
often and providing users with tools to recover from such situations is 
definitely helpful.

-- 
Robin Berjon <robin.berjon@expway.fr>
Research Engineer, Expway        http://expway.fr/
7FC0 6F5F D864 EFB8 08CE  8E74 58E6 D5DB 4889 2488

References:
- Partial documents in tree-based APIs
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>

Prev by Date: RE: [xml-dev] XML into SQL and out again
Next by Date: Adobe "integrtes" XML
Previous by thread: Re: [xml-dev] Partial documents in tree-based APIs
Next by thread: ANN: OWL Quick Intro
Index(es):
- Date
- Thread