Lists Home |
Date Index |
> The standard answer is to use tidy to convert to XHTML.
> http://tidy.sourceforge.net/ and then parse it with an
> ordinary XML parser.
I wake up some nights dreaming that I'm working in a sweatshop writing HTML
parsing code and they won't let me go to the bathroom until it's 100% (:-)
I've got two nightmare HTML parsing stories...
The first was back in '96 when we were writing a web-browser from scratch.
There was so much bad HTML out there already that the guy writing the parser
basically had to completely violate all rules of *HTML* to make things come
out the way browsers showed it. Both Netscape and IE allowed completely bad
HTML to go through (but then again, most people already know that).
A couple of years ago, we tried to write a single-pass combo XML/HTML parser
for a product we were working on. Again, it was a total *nightmare* with
daily 'exception' reports. The engineer working on it wasn't too thrilled
about having to rewrite the YACC grammar on a weekly basis--the W3C HTML
specs were practically useless in real life. There were things being done at
popular web-sites (like AOL) that would set your teeth on edge. And visual
editors like DreamWeaver weren't helping any. It became an exercise in
futility. After about six months of this, we finally threw our hands up in
the air and ripped it all out and went with tidy and Xerces.
It still doesn't do a 100% job (tidy sometimes generates bad output, i.e.
XHTML that doesn't look anything like the original). But it's better than
anything else out there. Most other HTML parsing toolkits (including the
ones in Java) just give up.
If somebody hasn't done so already, they should extract the Mozilla HTML
parser/DOM-builder and graft it onto a standard XML parser... I know it's
against XML rules, but it would have a lot of practical uses (like that of
the original poster).
Ramin Firoozye - Wizen Software
- multum in parvo -