XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] newline/form feed valid as attribute value?

Can anybody explain to me why the comment subexpression wouldn't just 
swallow an entire document like this:

<!-- -->
<a />
  -->

On 07/02/2012 04:57 PM, David Carlisle wrote:
> On 02/07/2012 21:43, Dan Shelton wrote:
>> On 2 July 2012 22:17, Michael Kay <mike@saxonica.com> wrote:
>>> It's theoretically impossible to write an XML parser using regular
>>> expressions alone, because XML is not a regular language.
>>
>> So what's wrong with the following regex pattern? It was passed around
>> by Roland Mainz in David Korn's ksh93 mailing list a few weeks ago and
>> is used as a *core* (there's more prep and postprocess code, but the
>> parsing alone is done by repeatedly applying the regex to a character
>> stream) for a xml fragment parser (brackets not postfixed with ?:
>> capture data and are stored in the 2D array .sh.match):
>> ---------------
>> dummy="${xmltext//~(Ex-p)(?:
>>     (<!--.+-->)+?|    # xml comments
>>     (<[:_[:alnum:]-]+
>>         (?: # attributes
>>             [[:space:]]+
>>             (?: # four different types of name=value syntax
>>                 (?:[:_[:alnum:]-]+=[^\"\'[:space:]]+?)|    
>> #x='foo=bar huz=123'
>>                 (?:[:_[:alnum:]-]+=\"[^\"]*?\")|        #x='foo="ba=r 
>> o" huz=123'
>>                 (?:[:_[:alnum:]-]+=\'[^\']*?\')|        #x="foox 
>> huz=123"
>>                 (?:[:_[:alnum:]-]+)                #x="foox huz=123"
>>             )
>>         )*
>>         [[:space:]]*
>>         \/?    # start tags which are end tags, too (like <foo\/>)
>> >)+?|                # xml start tags
>>     (<\/[:_[:alnum:]-]+>)+?|    # xml end tags
>>     ([^><]+)            # xml text
>>     )/D}"
>> ---------------
>
> The comment regexp also does not match the XML 9or HTML) syntax. the 
> .+ in the middle means it won't accept <!----> which is a well formed 
> XML comment and it will accept <!-----> which is not a well formed 
> comment.
>
> The regexp for element and attribute names does not match that of xml 
> (4th edition) xml (5th edition) or html. the details vary but in all 
> cases the set of characters allowed for the first letter is more 
> restricted than the set of characters allowed for following letters. 
> <a1> is well formed but <1a> is not.
>
> David
>


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS