OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   In praise of tidy [was: Re: Conversion of existing web pages from HTML]

[ Lists Home | Date Index | Thread Index ]
  • From: Peter Murray-Rust <peter@ursus.demon.co.uk>
  • To: ",XML-Dev List" <xml-dev@xml.org>
  • Date: Fri, 28 Apr 2000 10:06:51 +0100

At 10:12 PM 4/27/00 +0800, Rick JELLIFFE wrote:
>Kiat Soh wrote:
>> I am wondering if there's anyone who tries converting
>> the existing HTML pages to XML and XSL.

Many thanks for your question, Kiat - it has stimulated some very valuable

>The place to start is to use Dave Ragget's tool "tidy" which
>can clean up HTML, create CSS, and generate XHTML pretty well.
>I recommend running the data twice through it to really get the
>funnies removed.

May I also take this opportunity to congratulate Dave Raggett on his tidy
program. I believe it has done more to promote the idea of re-use than
almost anything else. Some additional points:

tidy is freely avaiable on a wide range of platforms. It has been
integrated with other tools to provide GUI-based systems including editors

tidy has a virtual community which keeps it up-to-date - there are very
regular releases and "all bugs are shallow".

tidy not only produces well-formed HTML but does its best to produce HTML
conformant with one of the myriad HTML DTDs. If the DTD is specified tidy
will use that as a starting point - if not it will guess the most likely

tidy will delete or modify elements and attributes that are inconsistent
with the assumed DTD.

tidy will deliberately throw errors or warnings about bad style. These
	(a) hardcoding formatting markup (FONT, color, etc.) and will instead add
class attributes for use by a stylesheet 
	(b) failure to include accessibility attributes (img@alt, table@summary). 

tidy can output HTML asxml, and produces formatted empty elements that will
parse as xml but not break browsers

Therefore in a relatively painless manner, tidy ensures that an HTML page
can be read by others without information loss. This - after all - is what
XML is all about. So IF we urge everyone to make their HTML pages
tidy-compatible we shall have increased the re-use of the information on
the WWW by terabytes. As simple as that.

tidy is also an excellent way of learning about XML if you know HTML. Henry
Rzepa and I are doing exactly that and have made tidy one of the key
approaches in out VirtualXML ConCourse (to be announced RSN).

The main issues in converting "HTML" to XML would seem to be:
	- can I produce *my* HTML so that *I* can re-use it? Obviously you should
create XHTML.
	- can I create *my* HTML so that someone else (with whom there is no prior
agreement) can re-use it? 
	- can I re-use some other person's HTML?

Note that in many sectors, publishers of HTML *want* the world to re-use
their published material. HTML per se is rather weak - the use of H1-6 as
"structuring" components makes it extremely difficult to extract the
document structure. However if the documents are XHTML there is still lot
that can be done. Here's a simple but powerful example:
	"Find images of molecules on the WWW".
With bad HTML there is nothing that can be done. But with tidy-ed XHTML you
could reasonably expect that all images had an alt tag and those *might*
include the substring "molecule", e.g:
	<img src="fig32" alt="picture of aspirin molecule" />
Retrieve the alt attribute values and search for the substring "molecule".
Or even search the content of the containing element. In this way it would
be trivial to identify all sites which contained pictures of molecules. 

Chemists responded very rapidly to the idea of chemical/MIME that Henry, I
and ben Whittaker promoted. I am confident that if we published this idea
tomorrow, a number of sites would start using it profitably.

Another very powerful way that HTML can be used is through the <div>
elements. I have now taken to writing HTML like:

<div class="chapter" title="Drugs">
  <div class="section" title="beta-lactams">...

With the use of CSS only (i.e. not even XSLT) this can be made to display
as if it contained H2 and H3 elements, but moreover contains complete
structuring information which can be searched, re-used, transformed, etc. I
suppose the reason it isn't commoner is because you have to do some
document design (i.e. thinking) before writing. But presumably authoring
packages could be made to output this if there was demand. Again a
virtually cost-free solution.

This is an example of the human factor in XML - often forgotten in the
Schema discussions. It's a simple, cost-free, exercise that would
dramatically increase the value of published information for those who
believe that XML is a tool, not a weapon.


This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS