OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   If XML is too hard for regular expressions, perhaps he'd be better off w

[ Lists Home | Date Index | Thread Index ]

At 07:40 PM 3/29/2003 -0500, Elliotte Rusty Harold wrote:
>No, validation doesn't help because it has absolutely nothing to say about 
>comments, processing instructions, CDATA sections, white space in tags, or 
>character entities and very little to say about entity references. It's 
>just too hard to tell what is and isn't the string you're looking for 
>without using a genuine parser.

This is what I had thought most people would expect - regular expressions 
are not normally what you use to parse something described by a BNF. And it 
also agrees with other things I have read, eg [1]:

         This is a example of how NOT to process XML using Perl. Please
         don't use regular expressions on XML, in the very short run
         you will be bitten. This was by far the most painful example
         to write and although it does the job it will break for the
         next version of the RFC. Entity resolution especially is much
         easier if you use a parser.

And of course, with entities, default values, namespace prefixes, and CDATA 
sections, it's really quite difficult to interpret a document based only on 
textual patterns in the document instance itself. Many people here will 
remember the following example from [2], where these two elements must be 
treated as identical:

<item xmlns:dc="http://purl.org/dc/elements/1.1/";>
   <description>Upon waking, the dinosaur...</description>

<root:item xmlns:bc="http://purl.org/dc/elements/1.1/"; xmlns:root="" >
   <description>Upon waking, the dinosaur...</description>

Of course, Joe wanted to solve this problem with an XML subset based on the 
following rules:

1.  All namespace declarations must be done in the root element.

2. Never a declaration for the "" namespace. I.e. if an element sits
the "" namespace then the element name will never have a namespace

3. No CDATA sections.

4. No DTDs.

Those rules are fine if you control the production of the XML as well as 
the consumption. If you don't, you need to be able to interpret whatever 
XML someone throws at you, and that may be too hard for regular expressions.

Isn't the lesson simply that you need a parser to interpret XML? And if so, 
why is that a problem? Most languages I use require a parser...


[1] Ways to Rome: Processing XML with Perl
     Original version by Ingo Macherius, <macherius@gmd.de>
     Maintained by Michel Rodriguez, <m.v.rodriguez@ieee.org>
     Version: 2.1: 2002-09-17

[2] Regex-able XML: Is there a Regex-able subset of XML?
     Joe Gregoriohttp://bitworking.org/news/40 


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS