[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] newline/form feed valid as attribute value?
- From: Mike Sokolov <sokolov@ifactory.com>
- To: David Carlisle <davidc@nag.co.uk>
- Date: Mon, 02 Jul 2012 17:06:25 -0400
Can anybody explain to me why the comment subexpression wouldn't just
swallow an entire document like this:
<!-- -->
<a />
-->
On 07/02/2012 04:57 PM, David Carlisle wrote:
> On 02/07/2012 21:43, Dan Shelton wrote:
>> On 2 July 2012 22:17, Michael Kay <mike@saxonica.com> wrote:
>>> It's theoretically impossible to write an XML parser using regular
>>> expressions alone, because XML is not a regular language.
>>
>> So what's wrong with the following regex pattern? It was passed around
>> by Roland Mainz in David Korn's ksh93 mailing list a few weeks ago and
>> is used as a *core* (there's more prep and postprocess code, but the
>> parsing alone is done by repeatedly applying the regex to a character
>> stream) for a xml fragment parser (brackets not postfixed with ?:
>> capture data and are stored in the 2D array .sh.match):
>> ---------------
>> dummy="${xmltext//~(Ex-p)(?:
>> (<!--.+-->)+?| # xml comments
>> (<[:_[:alnum:]-]+
>> (?: # attributes
>> [[:space:]]+
>> (?: # four different types of name=value syntax
>> (?:[:_[:alnum:]-]+=[^\"\'[:space:]]+?)|
>> #x='foo=bar huz=123'
>> (?:[:_[:alnum:]-]+=\"[^\"]*?\")| #x='foo="ba=r
>> o" huz=123'
>> (?:[:_[:alnum:]-]+=\'[^\']*?\')| #x="foox
>> huz=123"
>> (?:[:_[:alnum:]-]+) #x="foox huz=123"
>> )
>> )*
>> [[:space:]]*
>> \/? # start tags which are end tags, too (like <foo\/>)
>> >)+?| # xml start tags
>> (<\/[:_[:alnum:]-]+>)+?| # xml end tags
>> ([^><]+) # xml text
>> )/D}"
>> ---------------
>
> The comment regexp also does not match the XML 9or HTML) syntax. the
> .+ in the middle means it won't accept <!----> which is a well formed
> XML comment and it will accept <!-----> which is not a well formed
> comment.
>
> The regexp for element and attribute names does not match that of xml
> (4th edition) xml (5th edition) or html. the details vary but in all
> cases the set of characters allowed for the first letter is more
> restricted than the set of characters allowed for following letters.
> <a1> is well formed but <1a> is not.
>
> David
>
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]