Re: [xml-dev] parsing markup with Perl

My experience with Perl is positive, to put it mildly: using it as a preprocessor for XML processing - as an XMLifier I would call it. The task at hand was a very versatile reporting tool for log data distributed over a dozen log files, each one with a different log event format, all non-XML but most containing embedded XML fragments. (XML was never parsed - Perl just identified the begin and end of the XML fragments and transferred them as a chunk into the XML output.) It worked wonderfully: Perl raced through the lines of (typically) 50-200 MB of non-XML log messages in less than a minute, applying a few dozen of regular expressions which never caused any trouble, and emitted XML; Saxon stepped in and wrought its miracles which would have been impossible to imagine (not to mention - implement) if not designed and defined in terms of XML processing. The Perl-enabled transition to the XML data model was not the implementation of a predefined task - it enabled the very discovery of a task, a radically new perspective of querying and reporting capabilities no one had imagined. (Hitherto, developers had thought that the natural way to analyze log data was grepping.)

To summarize - regular expressions are a grand way to solve one's dependence on regular expressions - "parse once and for (XQuery) all". (The only alternative would be to write a Domain Specific Language, of course.)

One last remark: the main problem with regular expressions for most people is that they never take the time to learn the little language completely, believing that they save time by just "looking up" solutions at stackoverflow & Co, which is an illusion. This is an aspect regex has in common with XPath.

As in so many places - integration is the magic word. Not XML or something else, but XML and something else.

Hans-Juergen

Von: Michael Kay <mike@saxonica.com>
An: Rick Jelliffe <rjelliffe@allette.com.au>
CC: "xml-dev@lists.xml.org OASIS" <xml-dev@lists.xml.org>
Gesendet: 15:20 Samstag, 8.Februar 2014
Betreff: Re: [xml-dev] parsing markup with Perl

They had unreadable code and it was driving them into the ground. My takeaway was that Perl as it stood then required infeasibly much commenting to be maintainable

My only encounter with Perl was equally negative. I was called in as a consultant to rescue a system that had serious performance problems (like response times of two minutes for customers checking the balance on their accounts). It all turned out to be due to one module, written in Perl, which was doing regex-based transformations on XHTML pages. It took a while to work out what the 500 lines of Perl was trying to do, but in the end we rewrote it (using Java DOM, I believe - the project wasn't a good place right then for anything innovative) and solved all the problems at a stroke.

Regular expressions seem to have two problems. The first is that they are unreadable. Anything but the simplest regexes are impenetrable to anyone reading the code, and often to the person writing it, which is why debugging is so hard. The second is that performance is highly unpredictable except to people who really understand the technology extremely well.

I think we should treat Perl a bit like certain pesticides; something you're only allowed to use if you've been through the right training courses and have acquired a license, which has to be renewed every year by passing exams.

Michael Kay

Saxonica