[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Parsing XML with anything but
- From: Amelia A Lewis <amyzing@talsever.com>
- To: liam@w3.org
- Date: Mon, 9 Dec 2013 23:10:04 -0500
Hey, Liam!
On Mon, 09 Dec 2013 22:07:09 -0500, Liam R E Quin wrote:
> The "desperate perl hacker" was a significant and much-discussed use
> case during XML development, and was part of why we chose a self-evident
> empty element syntax.
Mmmmm. I suggest that you didn't succeed. XML, in the general case,
cannot be reliably handled with regular expressions. This is
unsurprising; the problem of parity is literally a textbook case for
the limitations of regular expressions (regular languages, regular
grammars, finite state automata) in parsing. XML's reliance on parity
both for tag delimiters (<>) and for start/end semantics (<></>) is
fairly unquestionable.
Developing a library of regular expressions that handles a series of
special cases in XML is a good way of falling prey to the classic Perl
programmer's virtue of hubris. That code may be safe in your own
(desperate hackish) hands; it isn't safe in someone else's.
One of my earliest experiences of this, around 2001, had to do with a
processor for handling SOAP (probably 1.1). The designer, a developer
who is *significantly* smarter, better-trained (in computer sciences in
general, though not in XML or markup in particular), and more
experienced than I am (or was), decided that a namespace declaration
binding the default prefix *necessarily* changed the prefix of
attributes-without-prefix. Gentle (and less-gentle) remonstration based
on specifications failed to change his mind. Since SAX wasn't doing the
right thing, he implemented code that caught the events, changed the
prefixes appropriately, and passed it on. And on output, it
did-the-right-thing for generating attributes. Since this blew up in
ways that those reading this list can probably easily imagine, the XML
geeks were required to make it work for all those situations.
Even deprecating this enormous pile of pigs' lips as our first activity
did not save us from the succeeding *two infinite years* of writing
increasingly baroque and fragile code to catch the output from this ...
desperate hack ... and turn it into something that was both well-formed
and valid. It had shipped as production code. Our later ships of the
production code could *not* say "we fucked up; we can't handle this
horse pucky," whatever our competitors did with it. We were finally
able to drop support for the versions of shipping products that used
this nightmare, and instead rely on well-vetted parsing code (like, the
original SAX before it got filtered) that Did the Right Thing, and to
throw out something over 20K lines of specialized "fix the problem that
we generated by failing to actually train up on the real problem rather
than our desperate-hackish conception of what it ought to be" code.
I haven't any patience for it. XML 1.0 namespace are a disaster, XML
schema a living nightmare. Trying to cope with incoming XML that
*could* contain these things *without understanding those
specifications*, even if the plan is that the incoming stuff *won't*
contain them, is asking for problems. Because then you find you have to
cope with them. And you can't throw out all that beautiful work you've
done! And when you've moved on, and someone else is trying to deal with
the new inputs for the code that you wrote that worked so well ...
perhaps that's brilliant, rather than stupid, but it's not something
that's going to make your successor bless your name. Or the name of XML.
And that's a problem of training. Like the developer/designer/architect
who simply *could not believe* that the specification required that
elements and attributes respond differently to the declaration of a
binding to the default prefix: insufficient willingness to believe that
the specification writers could specify something boneheaded. Like the
DPH-s who wrote piles of regexes because the spec writers said "we're
making it work for you!" without looking at the specification and
discovering it's type-1 in the Chomsky hierarchy, not type-0.
> Use of regular expressions does not need to be evidence of stupidity,
> nor of poor training.
In general? Absolutely not. In dealing with a grammar that is
context-free, but not regular? It's a sign of poor training at least.
If the expressions operate over something that's known to safely
conform to a regular grammar (necessarily a special case in XML
processing), then it's fine. Alas, anyone who succeeds with this is
going to keep going with it until the [^>] bites. That's an absolute
certainty if the code is used by more than one person, especially if
it's hand-me-down.
> I admit to using regular expressions to process
> XML at times myself, although I also suppose that since I haven't
> received a whole lot of introductory XML training I'm poorly trained in
> XML...
I'm probably supposed to be intimidated, considering history and
authorship and such.
Sorry.
I think that if you turn over your aggregation of regexes to someone
else, then Bad Things Will Happen. I think that if you don't expect
that, then perhaps it's an indication of poor training or experience.
Naivete? Something. Perhaps you'd be one of the ones offering strong,
understandable, and written (so that they can be passed on) warnings on
the limitations of the bits that you're turning over to others, and
none of this applies.
> Absence of carefulness is a problem, but that can be a problem with any
> tool.
Hammers and screws are an inappropriate combination, as a general rule.
It has nothing to do with how careful one is, pounding the damned
things in.
However, let me provide another anecdote, on why this particular
analogy occurred to me. When I was young (and ... still not pretty,
alas), I was heavily involved in theater. Community theater, college
theater. Since I was notably *terrible* on stage, I ended up as part of
the supporting staff. We did things like building the sets. Our
director (who, in this environment, is probably better described as
BossAndGod), handed out lumber, fabric, screws, and ... yes, hammers.
To build the scrims for the backdrops. On purpose. Because they could
be hammered in, quickly, and later, when we tore it all down, a
screwdriver generally got the things back out. We weren't *allowed* to
use screwdrivers (no power tools in that era, mind; circumstances have
certainly changed since then) because it *took too long*. We were
always short on time when a show was coming up.
In other words, this was sensible behavior, for the circumstances. Not
that we could convince anyone involved with carpentry of it, mind. We
generally ended up with at least one person each year who had been a
carpenter's assistant, or who did carpentry of some sort for fun, who
*insisted* that we could be just as fast doing it the right way. They
may have even been right. Our way worked, though, and we knew how to
use our regexen^Whammers.
Mind you, when I tried to build my loft in my first dorm room at
college, I decided that perhaps I'd been misled. YMMV.
Amy!
--
Amelia A. Lewis amyzing {at} talsever.com
About the use of language: it is impossible to sharpen a pencil with a
blunt axe. It is equally vain to try to do it with ten blunt axes
instead.
-- Edsger Dijkstra
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]