[
Lists Home |
Date Index |
Thread Index
]
At 07:40 PM 3/29/2003 -0500, Elliotte Rusty Harold wrote:
>No, validation doesn't help because it has absolutely nothing to say about
>comments, processing instructions, CDATA sections, white space in tags, or
>character entities and very little to say about entity references. It's
>just too hard to tell what is and isn't the string you're looking for
>without using a genuine parser.
This is what I had thought most people would expect - regular expressions
are not normally what you use to parse something described by a BNF. And it
also agrees with other things I have read, eg [1]:
This is a example of how NOT to process XML using Perl. Please
don't use regular expressions on XML, in the very short run
you will be bitten. This was by far the most painful example
to write and although it does the job it will break for the
next version of the RFC. Entity resolution especially is much
easier if you use a parser.
And of course, with entities, default values, namespace prefixes, and CDATA
sections, it's really quite difficult to interpret a document based only on
textual patterns in the document instance itself. Many people here will
remember the following example from [2], where these two elements must be
treated as identical:
<item xmlns:dc="http://purl.org/dc/elements/1.1/">
<title>MetaData</title>
<dc:date>2003-01-12T00:18:05-05:00</bc:date>
<link>http://bitworking.org/news/8</link>
<description>Upon waking, the dinosaur...</description>
</item>
<root:item xmlns:bc="http://purl.org/dc/elements/1.1/" xmlns:root="" >
<root:title>MetaData</root:title>
<bc:date>2003-01-12T00:18:05-05:00</bc:date>
<root:link>http://bitworking.org/news/8</root:link>
<description>Upon waking, the dinosaur...</description>
</root:item>
Of course, Joe wanted to solve this problem with an XML subset based on the
following rules:
1. All namespace declarations must be done in the root element.
2. Never a declaration for the "" namespace. I.e. if an element sits
the "" namespace then the element name will never have a namespace
qualifier.
3. No CDATA sections.
4. No DTDs.
Those rules are fine if you control the production of the XML as well as
the consumption. If you don't, you need to be able to interpret whatever
XML someone throws at you, and that may be too hard for regular expressions.
Isn't the lesson simply that you need a parser to interpret XML? And if so,
why is that a problem? Most languages I use require a parser...
Jonathan
[1] Ways to Rome: Processing XML with Perl
Original version by Ingo Macherius, <macherius@gmd.de>
Maintained by Michel Rodriguez, <m.v.rodriguez@ieee.org>
Version: 2.1: 2002-09-17
http://xmltwig.cx/perl_survey/perl_survey.html
[2] Regex-able XML: Is there a Regex-able subset of XML?
Joe Gregoriohttp://bitworking.org/news/40
|