Re: [xml-dev] Parsing XML with anything but

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Amelia A Lewis <amyzing@talsever.com>
To: liam@w3.org
Date: Mon, 9 Dec 2013 23:10:04 -0500

Hey, Liam!

On Mon, 09 Dec 2013 22:07:09 -0500, Liam R E Quin wrote:
> The "desperate perl hacker" was a significant and much-discussed use
> case during XML development, and was part of why we chose a self-evident
> empty element syntax.

Mmmmm. I suggest that you didn't succeed. XML, in the general case, 
cannot be reliably handled with regular expressions. This is 
unsurprising; the problem of parity is literally a textbook case for 
the limitations of regular expressions (regular languages, regular 
grammars, finite state automata) in parsing. XML's reliance on parity 
both for tag delimiters (<>) and for start/end semantics (<></>) is 
fairly unquestionable.

Developing a library of regular expressions that handles a series of 
special cases in XML is a good way of falling prey to the classic Perl 
programmer's virtue of hubris. That code may be safe in your own 
(desperate hackish) hands; it isn't safe in someone else's.

One of my earliest experiences of this, around 2001, had to do with a 
processor for handling SOAP (probably 1.1). The designer, a developer 
who is *significantly* smarter, better-trained (in computer sciences in 
general, though not in XML or markup in particular), and more 
experienced than I am (or was), decided that a namespace declaration 
binding the default prefix *necessarily* changed the prefix of 
attributes-without-prefix. Gentle (and less-gentle) remonstration based 
on specifications failed to change his mind. Since SAX wasn't doing the 
right thing, he implemented code that caught the events, changed the 
prefixes appropriately, and passed it on. And on output, it 
did-the-right-thing for generating attributes. Since this blew up in 
ways that those reading this list can probably easily imagine, the XML 
geeks were required to make it work for all those situations.

Even deprecating this enormous pile of pigs' lips as our first activity 
did not save us from the succeeding *two infinite years* of writing 
increasingly baroque and fragile code to catch the output from this ... 
desperate hack ... and turn it into something that was both well-formed 
and valid. It had shipped as production code. Our later ships of the 
production code could *not* say "we fucked up; we can't handle this 
horse pucky," whatever our competitors did with it. We were finally 
able to drop support for the versions of shipping products that used 
this nightmare, and instead rely on well-vetted parsing code (like, the 
original SAX before it got filtered) that Did the Right Thing, and to 
throw out something over 20K lines of specialized "fix the problem that 
we generated by failing to actually train up on the real problem rather 
than our desperate-hackish conception of what it ought to be" code.

I haven't any patience for it. XML 1.0 namespace are a disaster, XML 
schema a living nightmare. Trying to cope with incoming XML that 
*could* contain these things *without understanding those 
specifications*, even if the plan is that the incoming stuff *won't* 
contain them, is asking for problems. Because then you find you have to 
cope with them. And you can't throw out all that beautiful work you've 
done! And when you've moved on, and someone else is trying to deal with 
the new inputs for the code that you wrote that worked so well ... 
perhaps that's brilliant, rather than stupid, but it's not something 
that's going to make your successor bless your name. Or the name of XML.

And that's a problem of training. Like the developer/designer/architect 
who simply *could not believe* that the specification required that 
elements and attributes respond differently to the declaration of a 
binding to the default prefix: insufficient willingness to believe that 
the specification writers could specify something boneheaded. Like the 
DPH-s who wrote piles of regexes because the spec writers said "we're 
making it work for you!" without looking at the specification and 
discovering it's type-1 in the Chomsky hierarchy, not type-0.

> Use of regular expressions does not need to be evidence of stupidity,
> nor of poor training. 

In general? Absolutely not. In dealing with a grammar that is 
context-free, but not regular? It's a sign of poor training at least.

If the expressions operate over something that's known to safely 
conform to a regular grammar (necessarily a special case in XML 
processing), then it's fine. Alas, anyone who succeeds with this is 
going to keep going with it until the [^>] bites. That's an absolute 
certainty if the code is used by more than one person, especially if 
it's hand-me-down.

> I admit to using regular expressions to process
> XML at times myself, although I also suppose that since I haven't
> received a whole lot of introductory XML training I'm poorly trained in
> XML...

I'm probably supposed to be intimidated, considering history and 
authorship and such.

Sorry. 

I think that if you turn over your aggregation of regexes to someone 
else, then Bad Things Will Happen. I think that if you don't expect 
that, then perhaps it's an indication of poor training or experience. 
Naivete? Something. Perhaps you'd be one of the ones offering strong, 
understandable, and written (so that they can be passed on) warnings on 
the limitations of the bits that you're turning over to others, and 
none of this applies.

> Absence of carefulness is a problem, but that can be a problem with any
> tool.

Hammers and screws are an inappropriate combination, as a general rule. 
It has nothing to do with how careful one is, pounding the damned 
things in.

However, let me provide another anecdote, on why this particular 
analogy occurred to me. When I was young (and ... still not pretty, 
alas), I was heavily involved in theater. Community theater, college 
theater. Since I was notably *terrible* on stage, I ended up as part of 
the supporting staff. We did things like building the sets. Our 
director (who, in this environment, is probably better described as 
BossAndGod), handed out lumber, fabric, screws, and ... yes, hammers. 
To build the scrims for the backdrops. On purpose. Because they could 
be hammered in, quickly, and later, when we tore it all down, a 
screwdriver generally got the things back out. We weren't *allowed* to 
use screwdrivers (no power tools in that era, mind; circumstances have 
certainly changed since then) because it *took too long*. We were 
always short on time when a show was coming up.

In other words, this was sensible behavior, for the circumstances. Not 
that we could convince anyone involved with carpentry of it, mind. We 
generally ended up with at least one person each year who had been a 
carpenter's assistant, or who did carpentry of some sort for fun, who 
*insisted* that we could be just as fast doing it the right way. They 
may have even been right. Our way worked, though, and we knew how to 
use our regexen^Whammers.

Mind you, when I tried to build my loft in my first dorm room at 
college, I decided that perhaps I'd been misled. YMMV.

Amy!
-- 
Amelia A. Lewis                    amyzing {at} talsever.com
About the use of language: it is impossible to sharpen a pencil with a 
blunt axe. It is equally vain to try to do it with ten blunt axes 
instead.
                -- Edsger Dijkstra

Follow-Ups:
- Re: [xml-dev] Parsing XML with anything but
  - From: Liam R E Quin <liam@w3.org>
- Re: [xml-dev] Parsing XML with anything but
  - From: Dimitre Novatchev <dnovatchev@gmail.com>

References:
- Re: [xml-dev] Parsing XML with anything but
  - From: Gareth Oakes <goakes@gpslsolutions.com>
- Re: [xml-dev] Parsing XML with anything but
  - From: "Simon St.Laurent" <simonstl@simonstl.com>
- Re: [xml-dev] Parsing XML with anything but
  - From: Amelia A Lewis <amyzing@talsever.com>
- Re: [xml-dev] Parsing XML with anything but
  - From: Liam R E Quin <liam@w3.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]