xml-dev - If XML is too hard for regular expressions, perhaps he'd be better off w

If XML is too hard for regular expressions, perhaps he'd be better off w

[ Lists Home | Date Index | Thread Index ]

To: Elliotte Rusty Harold <elharo@metalab.unc.edu>,Jeff Lowery <Jeff.Lowery@creo.com>
Subject: If XML is too hard for regular expressions, perhaps he'd be better off with a parser
From: Jonathan Robie <jonathan.robie@datadirect-technologies.com>
Date: Mon, 31 Mar 2003 11:12:02 -0500
Cc: xml-dev@lists.xml.org
In-reply-to: <p04330100baabed0ce9eb@[192.168.254.4]>
References: < <4827EC373E03DB44AC2AB3900430896B02666F@csiex01.scenicsoft.com><4827EC373E03DB44AC2AB3900430896B02666F@csiex01.scenicsoft.com>

At 07:40 PM 3/29/2003 -0500, Elliotte Rusty Harold wrote:
>No, validation doesn't help because it has absolutely nothing to say about 
>comments, processing instructions, CDATA sections, white space in tags, or 
>character entities and very little to say about entity references. It's 
>just too hard to tell what is and isn't the string you're looking for 
>without using a genuine parser.

This is what I had thought most people would expect - regular expressions 
are not normally what you use to parse something described by a BNF. And it 
also agrees with other things I have read, eg [1]:

         This is a example of how NOT to process XML using Perl. Please
         don't use regular expressions on XML, in the very short run
         you will be bitten. This was by far the most painful example
         to write and although it does the job it will break for the
         next version of the RFC. Entity resolution especially is much
         easier if you use a parser.

And of course, with entities, default values, namespace prefixes, and CDATA 
sections, it's really quite difficult to interpret a document based only on 
textual patterns in the document instance itself. Many people here will 
remember the following example from [2], where these two elements must be 
treated as identical:

<item xmlns:dc="http://purl.org/dc/elements/1.1/";>
   <title>MetaData</title>
   <dc:date>2003-01-12T00:18:05-05:00</bc:date>
   <link>http://bitworking.org/news/8</link>
   <description>Upon waking, the dinosaur...</description>
</item>

<root:item xmlns:bc="http://purl.org/dc/elements/1.1/"; xmlns:root="" >
   <root:title>MetaData</root:title>
   <bc:date>2003-01-12T00:18:05-05:00</bc:date>
   <root:link>http://bitworking.org/news/8</root:link>
   <description>Upon waking, the dinosaur...</description>
</root:item>

Of course, Joe wanted to solve this problem with an XML subset based on the 
following rules:

1.  All namespace declarations must be done in the root element.

2. Never a declaration for the "" namespace. I.e. if an element sits
the "" namespace then the element name will never have a namespace
qualifier.

3. No CDATA sections.

4. No DTDs.

Those rules are fine if you control the production of the XML as well as 
the consumption. If you don't, you need to be able to interpret whatever 
XML someone throws at you, and that may be too hard for regular expressions.

Isn't the lesson simply that you need a parser to interpret XML? And if so, 
why is that a problem? Most languages I use require a parser...

Jonathan

[1] Ways to Rome: Processing XML with Perl
     Original version by Ingo Macherius, <macherius@gmd.de>
     Maintained by Michel Rodriguez, <m.v.rodriguez@ieee.org>
     Version: 2.1: 2002-09-17
     http://xmltwig.cx/perl_survey/perl_survey.html

[2] Regex-able XML: Is there a Regex-able subset of XML?
     Joe Gregoriohttp://bitworking.org/news/40

References:
- RE: [xml-dev] If XML is too hard for a programmer, perhaps he'd b e better off as a crossing guard
  - From: Jeff Lowery <Jeff.Lowery@creo.com>
- RE: [xml-dev] If XML is too hard for a programmer, perhaps he'd b e better off as a crossing guard
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>

Prev by Date: Who Will Use It? (WAS RE: [xml-dev] Transactional Integrity of Services (SHORT))
Next by Date: Re: [xml-dev] Transactional Integrity of Services (SHORT)
Previous by thread: RE: [xml-dev] If XML is too hard for a programmer, perhaps he'd b e better off as a crossing guard
Next by thread: Transactional Integrity of Services (LONG)
Index(es):
- Date
- Thread