XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] sets of parsing rules

Hi,

Nathan Young -X (natyoung - Artizen at Cisco) wrote:
> Hi.
> 
> I have seen parts of this question addressed but I think it's worth
> asking the whole question anyway, since I'm sure others have run into
> this problem but I haven't been able to dig up any best practices in my
> searching so far.  I may just need to search with the right terminology,
> in which case this should be any easy one for someone who already
> knows...
> 
> I have an application that parses a large number of HTML pages.  A few
> of them are well formed XHTML but that's the exception rather than the
> rule.  By grabbing pages, manipulating them a bit (regexps have been
> sufficient here so far), then tidying them I can get them to a state
> where they are parsable XML.

TagSoup and NekoHTML are tools that are doing the job

NekoHTML is bundled in RefleX, so getting a DOM tree from ill-formed 
HTML sources is straightforward :

<xcl:parse-html name="myHtml" source="file:///path/to/file.html"/>

then you can use XPath on it : $myHtml//div

(beware of the namespaces that CyberNeko might set on HTML, I don't 
remember what is the default, but you can of course change it)

   From there I can use XSL to get them the
> rest of the way (although I have a process that allows me to run regexps
> here too, supplementing XSLT 1.0).
> 
> The wrinkle is that I have several kinds of pages, each one requiring a
> distinct set of steps in order to parse it.  I'm starting down the road
> of modularizing the transforms so that I can handle more page types over
> time in a way that's transparent to the rest of my application.
> 
> I've been exposed XML only pipelines, are there pipeline tools that
> allow for non-XML steps?
> 

See the section "dealing with non-XML data source" :
http://reflex.gforge.inria.fr/tips.html

There are also tutorials that show you how to convert plain-text source 
to XML :
http://reflex.gforge.inria.fr/tutorial.html#textToXML

or how to parse a multipart SOAP message with a regular expression :
http://reflex.gforge.inria.fr/tutorial.html#N801BD1

Another usefull example shows how to filter with XPath patterns a very 
big XML source that would cause an OutOfMemoryError if you were using 
XSLT or DOM-based processing :
http://reflex.gforge.inria.fr/tutorial.html#N801C30

etc

-- 
Cordialement,

               ///
              (. .)
  --------ooO--(_)--Ooo--------
|      Philippe Poulard       |
  -----------------------------
  http://reflex.gforge.inria.fr/
        Have the RefleX !


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS