OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] XML Parsing using DOM parser

[ Lists Home | Date Index | Thread Index ]

>At 10:29 AM 02/01/02 -0800, Deepa Venkatesan wrote:
> >           Kindly share your thoughts.. I recently
> >wrote a code to parse an XML file containing catalog
> >content (as big as 10MB) using DOM parser. The
> >performance has been miserable particularly when the
> >XML file size increased. The problem is that using a
> >SAX parser (the only other alternative that strikes
> >me) I would have to re write the complete XML and the
> >code for this would be really elaborate. My final
> >objective of parsing to change 2 lines for every
> >catalog item (the XML file has as many as 3000 catalog
> >items).

[Tim Bray]

>This may be a job for perl or python.  Both have XML parsers;
>in perl and I assume python these can be up with a bit of work
>to pass everything through and let you fiddle with just the
>pieces you want.  If the incoming data was generated by a
>machine it's quite likely sufficiently regular that you don't
>even need to use the XML parser, just pattern-match for the
>tags you care about. This will run faster and be less work
>to write. -Tim

...with the caveat that both innocent and malevolently crafted,
fully 1.0 compliant XML , may blow your application out
of the water if you by-pass WF parsing in this way.

Lets say your pattern matcher is triggering on
<invoice> start-tags, likely candidates for problems
         CDATA sections
         General Entity Refs

         <!-- this ain't no <invoice> start-tag -->

CDATA sections
         this ain't no <invoice> start-tag

Generally entity Refs:
         <!DOCTYPE foo [
         <!ENTITY bar SYSTEM "bar.xml">
         <!-- lots of invoices in here but your pattern-matcher will never 
see them -->

Oh and by the way, if your app needs to trigger on namespace qualified
tags pattern matching gets you into deep trouble if there are
default namespace decls around.

In my opinion, skipping WF parsing is too dangerous to countenance in
all but "throwaway" apps where you can live with the gotchas. For all other
cases, I'd advocate using a parser, and/or being more specific than saying
"use XML" when tieing down interchange notations.

I go on about this periodically [1] and am delighted to have this opportunity
to re-wind my broken record so early in 2002:-)




News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS