OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] sets of parsing rules


Nathan Young -X (natyoung - Artizen at Cisco) wrote:
> Hi.
> I have seen parts of this question addressed but I think it's worth
> asking the whole question anyway, since I'm sure others have run into
> this problem but I haven't been able to dig up any best practices in my
> searching so far.  I may just need to search with the right terminology,
> in which case this should be any easy one for someone who already
> knows...
> I have an application that parses a large number of HTML pages.  A few
> of them are well formed XHTML but that's the exception rather than the
> rule.  By grabbing pages, manipulating them a bit (regexps have been
> sufficient here so far), then tidying them I can get them to a state
> where they are parsable XML.

TagSoup and NekoHTML are tools that are doing the job

NekoHTML is bundled in RefleX, so getting a DOM tree from ill-formed 
HTML sources is straightforward :

<xcl:parse-html name="myHtml" source="file:///path/to/file.html"/>

then you can use XPath on it : $myHtml//div

(beware of the namespaces that CyberNeko might set on HTML, I don't 
remember what is the default, but you can of course change it)

   From there I can use XSL to get them the
> rest of the way (although I have a process that allows me to run regexps
> here too, supplementing XSLT 1.0).
> The wrinkle is that I have several kinds of pages, each one requiring a
> distinct set of steps in order to parse it.  I'm starting down the road
> of modularizing the transforms so that I can handle more page types over
> time in a way that's transparent to the rest of my application.
> I've been exposed XML only pipelines, are there pipeline tools that
> allow for non-XML steps?

See the section "dealing with non-XML data source" :

There are also tutorials that show you how to convert plain-text source 
to XML :

or how to parse a multipart SOAP message with a regular expression :

Another usefull example shows how to filter with XPath patterns a very 
big XML source that would cause an OutOfMemoryError if you were using 
XSLT or DOM-based processing :



              (. .)
|      Philippe Poulard       |
        Have the RefleX !

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS