Re: [xml-dev] sets of parsing rules

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Philippe Poulard <Philippe.Poulard@sophia.inria.fr>
To: "Nathan Young -X (natyoung - Artizen at Cisco)" <natyoung@cisco.com>
Date: Thu, 08 Feb 2007 09:41:43 +0100

Hi,

Nathan Young -X (natyoung - Artizen at Cisco) wrote:
> Hi.
> 
> I have seen parts of this question addressed but I think it's worth
> asking the whole question anyway, since I'm sure others have run into
> this problem but I haven't been able to dig up any best practices in my
> searching so far.  I may just need to search with the right terminology,
> in which case this should be any easy one for someone who already
> knows...
> 
> I have an application that parses a large number of HTML pages.  A few
> of them are well formed XHTML but that's the exception rather than the
> rule.  By grabbing pages, manipulating them a bit (regexps have been
> sufficient here so far), then tidying them I can get them to a state
> where they are parsable XML.

TagSoup and NekoHTML are tools that are doing the job

NekoHTML is bundled in RefleX, so getting a DOM tree from ill-formed 
HTML sources is straightforward :

<xcl:parse-html name="myHtml" source="file:///path/to/file.html"/>

then you can use XPath on it : $myHtml//div

(beware of the namespaces that CyberNeko might set on HTML, I don't 
remember what is the default, but you can of course change it)

   From there I can use XSL to get them the
> rest of the way (although I have a process that allows me to run regexps
> here too, supplementing XSLT 1.0).
> 
> The wrinkle is that I have several kinds of pages, each one requiring a
> distinct set of steps in order to parse it.  I'm starting down the road
> of modularizing the transforms so that I can handle more page types over
> time in a way that's transparent to the rest of my application.
> 
> I've been exposed XML only pipelines, are there pipeline tools that
> allow for non-XML steps?
> 

See the section "dealing with non-XML data source" :
http://reflex.gforge.inria.fr/tips.html

There are also tutorials that show you how to convert plain-text source 
to XML :
http://reflex.gforge.inria.fr/tutorial.html#textToXML

or how to parse a multipart SOAP message with a regular expression :
http://reflex.gforge.inria.fr/tutorial.html#N801BD1

Another usefull example shows how to filter with XPath patterns a very 
big XML source that would cause an OutOfMemoryError if you were using 
XSLT or DOM-based processing :
http://reflex.gforge.inria.fr/tutorial.html#N801C30

etc

-- 
Cordialement,

               ///
              (. .)
  --------ooO--(_)--Ooo--------
|      Philippe Poulard       |
  -----------------------------
  http://reflex.gforge.inria.fr/
        Have the RefleX !

References:
- sets of parsing rules
  - From: "Nathan Young -X \(natyoung - Artizen at Cisco\)" <natyoung@cisco.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]