xml-dev - RE: In praise of tidy [was: Re: Conversion of existing web pages from HT

RE: In praise of tidy [was: Re: Conversion of existing web pages from HT
[ Lists Home | Date Index | Thread Index ]
From: "Bruce, Ian" <ian.bruce@theso.co.uk>
To: "'Peter Murray-Rust'" <peter@ursus.demon.co.uk>, ",XML-Dev List" <xml-dev@xml.org>
Date: Fri, 28 Apr 2000 11:23:18 +0100
If anyone is interested I have written a perl/tk wraper around tidy which
allows you to select the start dir and will then "tidy" all files from that
point.
If anyone wants a copy email me.

> Ian Bruce
> Electronic Publishing
> The Stationery Office Ltd
> Tel: 01603 695045
> Fax: 01603 696501
> http://www.tso-online.co.uk
> 
> -----Original Message-----
> From:	Peter Murray-Rust [SMTP:peter@ursus.demon.co.uk]
> Sent:	Friday, April 28, 2000 10:07 AM
> To:	,XML-Dev List
> Cc:	h.rzepa@ic.ac.uk
> Subject:	In praise of tidy [was: Re: Conversion of existing web pages
> from HTML]
> 
> At 10:12 PM 4/27/00 +0800, Rick JELLIFFE wrote:
> >Kiat Soh wrote:
> >> 
> >> I am wondering if there's anyone who tries converting
> >> the existing HTML pages to XML and XSL.
> 
> Many thanks for your question, Kiat - it has stimulated some very valuable
> discussion.
> 
> >
> >The place to start is to use Dave Ragget's tool "tidy" which
> >can clean up HTML, create CSS, and generate XHTML pretty well.
> >I recommend running the data twice through it to really get the
> >funnies removed.
> 
> May I also take this opportunity to congratulate Dave Raggett on his tidy
> program. I believe it has done more to promote the idea of re-use than
> almost anything else. Some additional points:
> 
> tidy is freely avaiable on a wide range of platforms. It has been
> integrated with other tools to provide GUI-based systems including editors
> 
> tidy has a virtual community which keeps it up-to-date - there are very
> regular releases and "all bugs are shallow".
> 
> tidy not only produces well-formed HTML but does its best to produce HTML
> conformant with one of the myriad HTML DTDs. If the DTD is specified tidy
> will use that as a starting point - if not it will guess the most likely
> HTML DTD.
> 
> tidy will delete or modify elements and attributes that are inconsistent
> with the assumed DTD.
> 
> tidy will deliberately throw errors or warnings about bad style. These
> include 
> 	(a) hardcoding formatting markup (FONT, color, etc.) and will
> instead add
> class attributes for use by a stylesheet 
> 	(b) failure to include accessibility attributes (img@alt,
> table@summary). 
> 
> tidy can output HTML asxml, and produces formatted empty elements that
> will
> parse as xml but not break browsers
> 
> 
> Therefore in a relatively painless manner, tidy ensures that an HTML page
> can be read by others without information loss. This - after all - is what
> XML is all about. So IF we urge everyone to make their HTML pages
> tidy-compatible we shall have increased the re-use of the information on
> the WWW by terabytes. As simple as that.
> 
> tidy is also an excellent way of learning about XML if you know HTML.
> Henry
> Rzepa and I are doing exactly that and have made tidy one of the key
> approaches in out VirtualXML ConCourse (to be announced RSN).
> 
> The main issues in converting "HTML" to XML would seem to be:
> 	- can I produce *my* HTML so that *I* can re-use it? Obviously you
> should
> create XHTML.
> 	- can I create *my* HTML so that someone else (with whom there is no
> prior
> agreement) can re-use it? 
> 	- can I re-use some other person's HTML?
> 
> Note that in many sectors, publishers of HTML *want* the world to re-use
> their published material. HTML per se is rather weak - the use of H1-6 as
> "structuring" components makes it extremely difficult to extract the
> document structure. However if the documents are XHTML there is still lot
> that can be done. Here's a simple but powerful example:
> 	"Find images of molecules on the WWW".
> With bad HTML there is nothing that can be done. But with tidy-ed XHTML
> you
> could reasonably expect that all images had an alt tag and those *might*
> include the substring "molecule", e.g:
> 	<img src="fig32" alt="picture of aspirin molecule" />
> Retrieve the alt attribute values and search for the substring "molecule".
> Or even search the content of the containing element. In this way it would
> be trivial to identify all sites which contained pictures of molecules. 
> 
> Chemists responded very rapidly to the idea of chemical/MIME that Henry, I
> and ben Whittaker promoted. I am confident that if we published this idea
> tomorrow, a number of sites would start using it profitably.
> 
> Another very powerful way that HTML can be used is through the <div>
> elements. I have now taken to writing HTML like:
> 
> <div class="chapter" title="Drugs">
>   <div class="section" title="beta-lactams">...
>   </div>
> </div>
> 
> With the use of CSS only (i.e. not even XSLT) this can be made to display
> as if it contained H2 and H3 elements, but moreover contains complete
> structuring information which can be searched, re-used, transformed, etc.
> I
> suppose the reason it isn't commoner is because you have to do some
> document design (i.e. thinking) before writing. But presumably authoring
> packages could be made to output this if there was demand. Again a
> virtually cost-free solution.
> 
> This is an example of the human factor in XML - often forgotten in the
> Schema discussions. It's a simple, cost-free, exercise that would
> dramatically increase the value of published information for those who
> believe that XML is a tool, not a weapon.
> 
> 	P.
> 
> 
> **************************************************************************
> *
> This is xml-dev, the mailing list for XML developers.
> To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
> List archives are available at http://xml.org/archives/xml-dev/
> **************************************************************************
> *

**********************************************************************************
This message may contain information which is confidential and subject to
legal privilege. If you are not the intended recipient, you may not peruse,
use, disseminate, distribute or copy this message. If you have received this
message in error, please notify the sender immediately by email, facsimile
or telephone and return and/or destroy the original message.
**********************************************************************************

***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************
Prev by Date: In praise of tidy [was: Re: Conversion of existing web pages from HTML]
Next by Date: RE: SAX2/Java: less than a week left for comments
Previous by thread: Now playing on Technetcast: XML Power Panel (SD2000)
Next by thread: ANN: XSLT Programmer's Reference
Index(es):
- Date
- Thread