[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] sets of parsing rules
- From: Philippe Poulard <Philippe.Poulard@sophia.inria.fr>
- To: "Nathan Young -X (natyoung - Artizen at Cisco)" <natyoung@cisco.com>
- Date: Thu, 08 Feb 2007 09:41:43 +0100
Hi,
Nathan Young -X (natyoung - Artizen at Cisco) wrote:
> Hi.
>
> I have seen parts of this question addressed but I think it's worth
> asking the whole question anyway, since I'm sure others have run into
> this problem but I haven't been able to dig up any best practices in my
> searching so far. I may just need to search with the right terminology,
> in which case this should be any easy one for someone who already
> knows...
>
> I have an application that parses a large number of HTML pages. A few
> of them are well formed XHTML but that's the exception rather than the
> rule. By grabbing pages, manipulating them a bit (regexps have been
> sufficient here so far), then tidying them I can get them to a state
> where they are parsable XML.
TagSoup and NekoHTML are tools that are doing the job
NekoHTML is bundled in RefleX, so getting a DOM tree from ill-formed
HTML sources is straightforward :
<xcl:parse-html name="myHtml" source="file:///path/to/file.html"/>
then you can use XPath on it : $myHtml//div
(beware of the namespaces that CyberNeko might set on HTML, I don't
remember what is the default, but you can of course change it)
From there I can use XSL to get them the
> rest of the way (although I have a process that allows me to run regexps
> here too, supplementing XSLT 1.0).
>
> The wrinkle is that I have several kinds of pages, each one requiring a
> distinct set of steps in order to parse it. I'm starting down the road
> of modularizing the transforms so that I can handle more page types over
> time in a way that's transparent to the rest of my application.
>
> I've been exposed XML only pipelines, are there pipeline tools that
> allow for non-XML steps?
>
See the section "dealing with non-XML data source" :
http://reflex.gforge.inria.fr/tips.html
There are also tutorials that show you how to convert plain-text source
to XML :
http://reflex.gforge.inria.fr/tutorial.html#textToXML
or how to parse a multipart SOAP message with a regular expression :
http://reflex.gforge.inria.fr/tutorial.html#N801BD1
Another usefull example shows how to filter with XPath patterns a very
big XML source that would cause an OutOfMemoryError if you were using
XSLT or DOM-based processing :
http://reflex.gforge.inria.fr/tutorial.html#N801C30
etc
--
Cordialement,
///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]