[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] [OT] bugs in JDK regex engine ?
- From: Amelia A Lewis <amyzing@talsever.com>
- To: xml-dev@lists.xml.org
- Date: Sun, 03 Feb 2008 23:17:36 -0500
On 2008-02-03 23:26:58 -0500 "Mukul Gandhi" <gandhi.mukul@gmail.com>
wrote:
> String str = "<root><abc x='1'>text1</abc><pqr
> y='1'>text2</pqr></root>";
>
> Pattern pattern = Pattern.compile("<[^/]+>"); //anything from '<' to
> '>', and not having '/'
> Matcher matcher = pattern.matcher(str);
>
> while (matcher.find()) {
> String group = matcher.group();
> System.out.println(group);
> }
>
> 'str' is a String representation of a XML fragment.
>
> I want to extract all pieces from the string (the tokens), which form
> a start tag (including attribute parts).
>
> I am expecting output:
> <root>
> <abc x='1'>
> <pqr y='1'>
But that's not what you asked for. You said "longest string starting
with '<' and ending with '>' that doesn't contain '/'.
> But the output produced by the above program is:
> <root><abc x='1'>
> <pqr y='1'>
Yup. Exactly matches the regex. No / in either one, is there?
Specifically, even though you think you asked for "just the start
tag," you have <abc> nested inside <root>; there's no / anywhere
around to prevent the regex from matching to the end of <abc>
The problem with using regular expressions to parse any grammar with
paired tokens (XML for example, but also most programming languages
with paired braces of any sort, or comments in a language that permits
comment nesting) is that regular expressions can't handle parity.
You need something more powerful than regex.
If you're determined to find the next layer of problems associated
with using a too-weak tool to do the job, you should find it shortly
after making this change:
Pattern.compile("<[^/<]+>");
That prevents it from picking up a nested element tag. Most of the
time.
For giggles:
<root><?my-pi wotsit ?><abc x='1'><![CDATA[<?xml version="1.0?>
<root><abc x='1'>text1]]></abc>
</root>
HTH.
Amy!
--
Amelia A. Lewis amyzing {at} talsever.com
Confidence: a feeling peculiar to the stage just before full
comprehension of the problem.
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]