OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   RE: Using Tidy for XML correction

[ Lists Home | Date Index | Thread Index ]
  • From: Aleksi Niemelä <aleksi.niemela@cinnober.com>
  • To: 'xml-dev' <xml-dev@xml.org>
  • Date: Thu, 05 Oct 2000 17:30:53 +0200

Linda asked what to do to get documents like the following example to
vaguely resemble XML:

> <p>
> <list>
> <listitem>
> <courier>
> Some text
> </courier>
> </p>

If the problems in the "XML"-files are really like this one, I'd write a
small program(s) to fix things, and rush on.

For this, I might consider taking some HTML parser, which usually accept
somewhat broken texts (I guess at least Perl has something like that
already) and read in text, process and output them. Large number of files
seems to indicate they're quite small so you can load them into the memory
as one piece, which eases processing even more. 

And when you output what's parsed, just delete or add tags, or do what's
needed. Or maybe you'd like to go even more brutal and effective way for
very simple cases and apply some regular expressions or some other neat
small hacks to get around.

The nice thing is that the broken files are produced by a program, so
they're probably systematically broken; not insanely broken like humans tend
to do.

	- Aleksi




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS