Lists Home |
Date Index |
Without wishing in any way to vitiate John Cowan's plea for attendees at
XML2002, I should like to announce that it is possible to see John for
free at the next meeting of the XML SIG, Tuesday 12 November 7 - 9 p.m.
at 125 Broad Street in downtown Manhattan. John will be unveiling his
project "TagSoup, A SAX Parser For Nasty, Ugly HTML" (abstract below).
It is, however, necessary to register in advance in order to reserve a
place for this presentation. You may do that by emailing a request to me
at mailto:email@example.com. You will receive a confirmation by return
XML SIG Presentation 12 November 2002
John Cowan: "TagSoup, A SAX Parser For Nasty, Ugly HTML"
For the last year I have been working on a new parser written in Java
that, instead of parsing well-formed or valid XML, parses HTML as it is
found in the wild: nasty and brutish, though quite often far from short.
TagSoup is designed for people who have to process this stuff using some
semblance of a rational application design. By providing a SAX
interface, it allows standard XML tools to be applied to even the worst
HTML. TagSoup is now very close to being ready for its first public Open
Source release under the Academic Free License, a cleaned-up and
patent-safe BSD-style license which allows proprietary re-use.
TagSoup is a parser, not a whole application; it isn't intended to
permanently clean up bad HTML, as HTML Tidy does, only to parse it on
the fly. Therefore, it does not convert presentation HTML to CSS or
anything similar. It does guarantee well-structured results: tags will
wind up properly nested, default attributes will appear appropriately,
and so on.
The semantics of TagSoup are as far as practical those of actual HTML
browsers. In particular, never, never will it throw any sort of syntax
error: the TagSoup motto is "Just Keep On Truckin'". But there's much,
much more. For example, if the first tag is LI, it will supply the
application with enclosing HTML, BODY, and UL tags. Why UL? Because
that's what browsers assume in this situation. For the same reason,
overlapping tags are correctly restarted whenever possible: text like:
This is <B>bold, <I>bold italic, </b>italic, </i>normal text
gets correctly rewritten as:
This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
By intention, TagSoup is small and fast. After release, I will spend
some time making it faster if it turns out to be too slow. It does not
depend on the existence of any framework other than SAX, and should be
able to work with any framework that can accept SAX parsers.
If your tag soup is not HTML, TagSoup can use a custom schema (written
in Tag Soup Schema Language, a subset of RELAX NG compact syntax)
instead of using the default HTML schema. You can also replace the
low-level HTML scanner with one based on Sean McGrath's PYX format (very
close to James Clark's ESIS format). You can also supply an
AutoDetector that peeks at the incoming byte stream and guesses a
character encoding for it. (Otherwise, the platform default is used. If
someone supplies a good AutoDetector I may package it with later
The presentation will focus on practical results: you will learn how to
use TagSoup in its simple HTML mode, and get an idea of which features
can be customized and how.