[
Lists Home |
Date Index |
Thread Index
]
- From: Steve Muench <smuench@us.oracle.com>
- To: "Simon St.Laurent" <simonstl@simonstl.com>
- Date: Wed, 09 Aug 2000 09:58:01 -0700
Simon,
| Has anyone written a generic XML parser, even a somewhat broken one, that's
| built on regular expressions? I remember hearing of something a long while
| ago, but I can't find it.
|
| I'm not concerned with the efficiency/viability/profitability/wisdom of
| such a solution, just whether or not it's been done - especially if it's
| available open source.
Regarding the technique, this is the note I bookmarked from a long time ago:
http://www.cs.sfu.ca/~cameron/REX.html
There's an interactive demo at the end of the page.
At the time I was playing with it I wrote the Java class
below (no guarantee how well it works) to see if I could
apply the technique using the OROMatcher RegExp library.
(http://www.savarese.org/oro/software/OROMatcher.html)
If for nothing else than to save you the time of
typing in the RegExp's, I include it below.
Have fun.
______________________________________________________________
Steve Muench, Lead XML Evangelist & Consulting Product Manager
BC4J & XSQL Servlet Development Teams, Oracle Rep to XSL WG
Author "Building Oracle XML Applications", O'Reilly
http://www.oreilly.com/catalog/orxmlapp/
import java.io.*;
import com.oroinc.text.regex.*;
public final class XMLParser {
public static final void main(String args[]) {
// XML_SPE Regular Expressions from http://www.cs.sfu.ca/~cameron/REX.html
String TextSE = "[^<]+";
String UntilHyphen = "[^-]*-";
String Until2Hyphens = UntilHyphen + "([^-]" + UntilHyphen + ")*-";
String CommentCE = Until2Hyphens + ">?";
String UntilRSBs = "[^]]*]([^]]+])*]+";
String CDATA_CE = UntilRSBs + "([^]>]" + UntilRSBs + ")*>";
String S = "[ \\n\\t\\r]+";
String NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]";
String NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]";
String Name = "(" + NameStrt + ")(" + NameChar + ")*";
String QuoteSE = "\"[^\"]" + "*" + "\"" + "|'[^']*'";
String DT_IdentSE = S + Name + "(" + S + "(" + Name + "|" + QuoteSE + "))*";
String MarkupDeclCE = "([^]\"'><]+|" + QuoteSE + ")*>";
String S1 = "[\\n\\r\\t ]";
String UntilQMs = "[^?]*\\?+";
String PI_Tail = "\\?>|" + S1 + UntilQMs + "([^>?]" + UntilQMs + ")*>";
String DT_ItemSE = "<(!(--" + Until2Hyphens + ">|[^-]" + MarkupDeclCE + ")|\\?" + Name + "(" +
PI_Tail + "))|%" + Name + ";|" + S;
String DocTypeCE = DT_IdentSE + "(" + S + ")?(\\[(" + DT_ItemSE + ")*](" + S + ")?)?>?";
String DeclCE = "--(" + CommentCE + ")?|\\[CDATA\\[(" + CDATA_CE + ")?|DOCTYPE(" + DocTypeCE +
")?";
String PI_CE = Name + "(" + PI_Tail + ")?";
String EndTagCE = Name + "(" + S + ")?>?";
String AttValSE = "\"[^<\"]" + "*" + "\"" + "|'[^<']*'";
String ElemTagCE = Name + "(" + S + Name + "(" + S + ")?=(" + S + ")?(" + AttValSE + "))*(" + S
+ ")?/?>?";
String MarkupSPE = "<(!(" + DeclCE + ")?|\\?(" + PI_CE + ")?|/(" + EndTagCE + ")?|(" +
ElemTagCE + ")?)";
String XML_SPE = TextSE + "|" + MarkupSPE;
Perl5Matcher matcher;
Perl5Compiler compiler;
Perl5Pattern pattern = null;
Perl5StreamInput input;
MatchResult result;
InputStream file = null;
// Create Perl5Compiler and Perl5Matcher instances.
compiler = new Perl5Compiler();
matcher = new Perl5Matcher();
// Attempt to compile the pattern. If the pattern is not valid,
// report the error and exit.
try {
pattern
= (Perl5Pattern)compiler.compile(XML_SPE);
} catch(MalformedPatternException e) {
System.err.println("Bad pattern.");
System.err.println(e.getMessage());
System.exit(1);
}
// Open input file.
try {
file = new FileInputStream("C:\\javadev\\OROMatcher-1.0.7\\examples\\oracle.xml");
} catch(IOException e) {
System.err.println("Error opening streamInputExample.txt.");
System.err.println(e.getMessage());
System.exit(1);
}
// Create a Perl5StreamInput instance to search the input stream.
input = new Perl5StreamInput(file);
// We need to put the search loop in a try block because when searching
// a Perl5StreamInput instance, an IOException may occur, and it must be
// caught.
long time = System.currentTimeMillis();
try {
// Loop until there are no more matches left.
while(matcher.contains(input, pattern)) {
// Since we're still in the loop, fetch match that was found.
}
} catch(IOException e) {
System.err.println("Error occurred while reading file.");
System.err.println(e.getMessage());
System.exit(1);
}
time = System.currentTimeMillis() - time;
System.out.println("Parsed the file in " + time + " milliseconds.");
}
}
______________________________________________________________
Steve Muench, Lead XML Evangelist & Consulting Product Manager
BC4J & XSQL Servlet Development Teams, Oracle Rep to XSL WG
Author "Building Oracle XML Applications", O'Reilly
http://www.oreilly.com/catalog/orxmlapp/
|