Re: [xml-dev] Parsing XML with anything but

On Mon, Dec 9, 2013 at 3:16 PM, Uche Ogbuji <uche@ogbuji.net> wrote:

I'd rather ask: why on earth would anyone suffer through XSLT if they had Haskell? HaXml is actually a very respectable parser and toolkit.

Also I think you're mixing up two issues. TagSoup and BeautifulSoup were both designed for parsing what passes for HTML on the Web, and not for XML. With all respect to Mike Kay, his tools won't directly help you there.

But thanks to tagsoup his tools help me indirectly at a price that I consider as good as free. I believe the same mechanism is behind the saxon:parse-html extension.�

Compared against the hoops people are prepared to jump through to make data SQLizable, filtering HTML through Tagsoup� firmly sits in the column labelled trivial. So much so that I regard� HTML is for all intents parseable by X(SLT|Query).

�

Sure people do use them for processing XML sometimes, and not just XHTML. Well, people also use regexen for doing so. I personally use grep on XML almost as much as I do the compliant parsers that I myself have worked on.

My concern with regexp solutions would be robustness, extensibility and readability so I would never do it. On the occasion I used regexp's within an XSLT conversion to upConvert text and then had to amend it 8 months later I spent days eyeballing it to try and get a handle so I could amend it. Why did I have to amend it. Because it was extensible enough to handle variations of the input it was tested with.

<snip>

�

Well that comes back to what I always come back to on this list. XML is to complicated, and that complication necessarily manifests itself in fully compliant tools. �Developers have come to hate XML and just want to crowbar, chisel and scrape it as quickly as possible into a structure that they can actually understand, using a tool that seems to hate XML as much as they do.

I wanted to know whether there was anything more to it than this. The same set of people that will decry parentheses in Scheme or angled bracketed markup will happily type in hieroglyphics at a mongodb shell prompt without a murmur of protest. So it's not based on rationality then.
�

We can complain all we want about the poor professional state of developers as we understand it, but it's the raw reality, and we can either make it easier for them to do the right thing, or keep on tsk tsking at them from here.

Give a high performance car an interface that makes it easier to drive...... then place� an unqualified driver at the wheel .........