OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: parsing XML using regular expressions

[ Lists Home | Date Index | Thread Index ]
  • From: Steve Muench <smuench@us.oracle.com>
  • To: "Simon St.Laurent" <simonstl@simonstl.com>
  • Date: Wed, 09 Aug 2000 09:58:01 -0700

Simon,

| Has anyone written a generic XML parser, even a somewhat broken one, that's
| built on regular expressions?  I remember hearing of something a long while
| ago, but I can't find it.
|
| I'm not concerned with the efficiency/viability/profitability/wisdom of
| such a solution, just whether or not it's been done - especially if it's
| available open source.

Regarding the technique, this is the note I bookmarked from a long time ago:

http://www.cs.sfu.ca/~cameron/REX.html

There's an interactive demo at the end of the page.

At the time I was playing with it I wrote the Java class
below (no guarantee how well it works) to see if I could
apply the technique using the OROMatcher RegExp library.
(http://www.savarese.org/oro/software/OROMatcher.html)

If for nothing else than to save you the time of
typing in the RegExp's, I include it below.

Have fun.

______________________________________________________________
Steve Muench, Lead XML Evangelist & Consulting Product Manager
BC4J & XSQL Servlet Development Teams, Oracle Rep to XSL WG
Author "Building Oracle XML Applications", O'Reilly
http://www.oreilly.com/catalog/orxmlapp/


import java.io.*;
import com.oroinc.text.regex.*;

public final class XMLParser {

  public static final void main(String args[]) {

    // XML_SPE Regular Expressions from http://www.cs.sfu.ca/~cameron/REX.html

     String TextSE = "[^<]+";
     String UntilHyphen = "[^-]*-";
     String Until2Hyphens = UntilHyphen + "([^-]" + UntilHyphen + ")*-";
     String CommentCE = Until2Hyphens + ">?";
     String UntilRSBs = "[^]]*]([^]]+])*]+";
     String CDATA_CE = UntilRSBs + "([^]>]" + UntilRSBs + ")*>";
     String S = "[ \\n\\t\\r]+";
     String NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]";
     String NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]";
     String Name = "(" + NameStrt + ")(" + NameChar + ")*";
     String QuoteSE = "\"[^\"]" + "*" + "\"" + "|'[^']*'";
     String DT_IdentSE = S + Name + "(" + S + "(" + Name + "|" + QuoteSE + "))*";
     String MarkupDeclCE = "([^]\"'><]+|" + QuoteSE + ")*>";
     String S1 = "[\\n\\r\\t ]";
     String UntilQMs = "[^?]*\\?+";
     String PI_Tail = "\\?>|" + S1 + UntilQMs + "([^>?]" + UntilQMs + ")*>";
     String DT_ItemSE = "<(!(--" + Until2Hyphens + ">|[^-]" + MarkupDeclCE + ")|\\?" + Name + "(" +
PI_Tail + "))|%" + Name + ";|" + S;
     String DocTypeCE = DT_IdentSE + "(" + S + ")?(\\[(" + DT_ItemSE + ")*](" + S + ")?)?>?";
     String DeclCE = "--(" + CommentCE + ")?|\\[CDATA\\[(" + CDATA_CE + ")?|DOCTYPE(" + DocTypeCE +
")?";
     String PI_CE = Name + "(" + PI_Tail + ")?";
     String EndTagCE = Name + "(" + S + ")?>?";
     String AttValSE = "\"[^<\"]" + "*" + "\"" + "|'[^<']*'";
     String ElemTagCE = Name + "(" + S + Name + "(" + S + ")?=(" + S + ")?(" + AttValSE + "))*(" + S
+ ")?/?>?";
     String MarkupSPE = "<(!(" + DeclCE + ")?|\\?(" + PI_CE + ")?|/(" + EndTagCE + ")?|(" +
ElemTagCE + ")?)";
     String XML_SPE = TextSE + "|" + MarkupSPE;


    Perl5Matcher matcher;
    Perl5Compiler compiler;
    Perl5Pattern pattern = null;
    Perl5StreamInput input;
    MatchResult result;
    InputStream file = null;

    // Create Perl5Compiler and Perl5Matcher instances.
    compiler = new Perl5Compiler();
    matcher  = new Perl5Matcher();

    // Attempt to compile the pattern.  If the pattern is not valid,
    // report the error and exit.
    try {
      pattern
     = (Perl5Pattern)compiler.compile(XML_SPE);

    } catch(MalformedPatternException e) {
      System.err.println("Bad pattern.");
      System.err.println(e.getMessage());
      System.exit(1);
    }


    // Open input file.
    try {
      file = new FileInputStream("C:\\javadev\\OROMatcher-1.0.7\\examples\\oracle.xml");
    } catch(IOException e) {
      System.err.println("Error opening streamInputExample.txt.");
      System.err.println(e.getMessage());
      System.exit(1);
    }

    // Create a Perl5StreamInput instance to search the input stream.
    input   = new Perl5StreamInput(file);

    // We need to put the search loop in a try block because when searching
    // a Perl5StreamInput instance, an IOException may occur, and it must be
    // caught.
    long time = System.currentTimeMillis();

    try {
      // Loop until there are no more matches left.
      while(matcher.contains(input, pattern)) {
     // Since we're still in the loop, fetch match that was found.

      }
    } catch(IOException e) {
      System.err.println("Error occurred while reading file.");
      System.err.println(e.getMessage());
      System.exit(1);
    }
     time = System.currentTimeMillis() - time;
     System.out.println("Parsed the file in " + time + " milliseconds.");
  }
}

______________________________________________________________
Steve Muench, Lead XML Evangelist & Consulting Product Manager
BC4J & XSQL Servlet Development Teams, Oracle Rep to XSL WG
Author "Building Oracle XML Applications", O'Reilly
http://www.oreilly.com/catalog/orxmlapp/






 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS