Re: [xml-dev] [OT] bugs in JDK regex engine ?

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: "Mukul Gandhi" <gandhi.mukul@gmail.com>
To: xml-dev@lists.xml.org
Date: Mon, 4 Feb 2008 10:00:36 +0530

Thanks for your reply, and help.

My present problem is resolved.

I'll try few bit complex use cases, and post my questions ...

On Feb 4, 2008 9:47 AM, Amelia A Lewis <amyzing@talsever.com> wrote:
> On 2008-02-03 23:26:58 -0500 "Mukul Gandhi" <gandhi.mukul@gmail.com>
> wrote:
> > String str = "<root><abc x='1'>text1</abc><pqr
> > y='1'>text2</pqr></root>";
> >
> > Pattern pattern = Pattern.compile("<[^/]+>");  //anything from '<' to
> > '>', and not having '/'
> > Matcher matcher = pattern.matcher(str);
> >
> > while (matcher.find()) {
> >    String group = matcher.group();
> >    System.out.println(group);
> > }
> >
> > 'str' is a String representation of a XML fragment.
> >
> > I want to extract all pieces from the string (the tokens), which form
> > a start tag (including attribute parts).
> >
> > I am expecting output:
> > <root>
> > <abc x='1'>
> > <pqr y='1'>
>
> But that's not what you asked for.  You said "longest string starting
> with '<' and ending with '>' that doesn't contain '/'.
>
> > But the output produced by the above program is:
> > <root><abc x='1'>
> > <pqr y='1'>
>
> Yup.  Exactly matches the regex.  No / in either one, is there?
> Specifically, even though you think you asked for "just the start
> tag," you have <abc> nested inside <root>; there's no / anywhere
> around to prevent the regex from matching to the end of <abc>
>
> The problem with using regular expressions to parse any grammar with
> paired tokens (XML for example, but also most programming languages
> with paired braces of any sort, or comments in a language that permits
> comment nesting) is that regular expressions can't handle parity.
>
> You need something more powerful than regex.
>
> If you're determined to find the next layer of problems associated
> with using a too-weak tool to do the job, you should find it shortly
> after making this change:
>
> Pattern.compile("<[^/<]+>");
>
> That prevents it from picking up a nested element tag.  Most of the
> time.
>
> For giggles:
>
> <root><?my-pi wotsit ?><abc x='1'><![CDATA[<?xml version="1.0?>
> <root><abc x='1'>text1]]></abc>
> </root>
>
> HTH.
>
> Amy!
> --
> Amelia A. Lewis                    amyzing {at} talsever.com
> Confidence: a feeling peculiar to the stage just before full
> comprehension of the problem.


-- 
Regards,
Mukul Gandhi

References:
- Re: [xml-dev] [OT] bugs in JDK regex engine ?
  - From: "Mukul Gandhi" <gandhi.mukul@gmail.com>
- Re: [xml-dev] [OT] bugs in JDK regex engine ?
  - From: Amelia A Lewis <amyzing@talsever.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]