Lists Home |
Date Index |
- To: <firstname.lastname@example.org>
- Subject: Re: [xml-dev] If XML is too hard for regular expressions, perhaps he'd be better off with a parser
- From: "Rick Jelliffe" <email@example.com>
- Date: Tue, 1 Apr 2003 15:20:33 +1000
- References: < <4827EC373E03DB44AC2AB3900430896B02666F@csiex01.scenicsoft.com> <4827EC373E03DB44AC2AB3900430896B02666F@csiex01.scenicsoft.com> <firstname.lastname@example.org>
From: "Jonathan Robie" <email@example.com>
> This is what I had thought most people would expect - regular expressions
> are not normally what you use to parse something described by a BNF.
> Isn't the lesson simply that you need a parser to interpret XML? And if so,
> why is that a problem? Most languages I use require a parser...
The usual approach for people who want to do text processing on
an XML document is to (accumulate) a set of standard normalizing
tools, so that the XML is in a single format. This simplifies the
expressions needed for text processing. (For example, Omnimark
provided a script for normalizing, and I think SPAM was the normalizing
tool for SP)
A conclusion that you cannot use text processing on XML and should
always use a parser is just wrong, and against experience. People hear
that XML simplifies SGML to make it more parseable, but XML addresses
the issues of needing a DTD or SGML declaration to parse the document
(RCDATA content, shortrefs, minimization, etc). It does so that there is a more
simply-parseable form of a document possible when canonicalized
(which used to be called normalized, but other things use that now).
Desparate Perl Hackers should canonicalize first. Get rid of
CDATA sections, use a single literal delimiter, play with the namespaces,
handle entity and character references.
Of course, you might see this as splitting the parsing up into separate stages.
The reason for doing text processing rather than 'parsing' (i.e. in to objects)
is usually to take advantage of text streams rather than objects, for example
where you want to process a very large document without creating zillions