XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Napkin grammar

In case anyone is interested, I made a little grammar up to show the kind of thing that I was thinking of as a start point not an end poit, based on recent posts. Maybe having something concrete helps. 

So it is two parts: 

This uses some extensions:
     == means "if" 
     -->  $something means a data type conversion
     -> means a substitution (handling references)
     .  means a look-up in the lexical context, just a shorthand.
 

GRAMMAR:

    document =   (element | comment | pi )+

    element =  start-tag ( CHARACTER+ | element | comment | pi)*  end-tag

    start-tag = name attribute* EOM   

    name = START-TAG.TOKEN 

    attribute =  attname ( typeable-token | ATTRIBUTE-TEXT)

    attname = TOKEN

    typeable-token = boolean |  year |  |  symbol

    boolean = TOKEN 

        ==  ("true" | "false" )  

        --> $boolean
    year = TOKEN
        ==  ( DECIMAL+ "-" CHARACTER*  ) 

        -->  $yearDate

    number = TOKEN
        == (""-")? DECIMAL+ ("."   CHARACTER+)?    

       --> $integer or $decimal

    symbol = TOKEN

    end-tag = END-TAG.TOKEN  EOM

    comment = COMMENT-TAG.CHARACTER*  EOM

    pi = piname  CHAR*  EOM

    piname = PI-TAG.TOKEN  E)M


Each lexical pass can be thread-parallelized by section.  And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed.  And the recognition can be parallelized using SIMD.

LEXICAL PASS 1: TAG DEMARCATION

    TEXT = ws*  ("<"  MARKUP EOM==">"  DATA?  )+ 

    Note: A terminating "data" section should be marked as ws.

    Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.


LEXICAL PASS 2:  ATTRIBUTE DEMARCATION

    MARKUP =  ((?=[^!/?])  START-TAG  |  COMPLEX-TAG

    START-TAG =  (TAG-TEXT   \"  ATTRIBUTE-TAG  \"? ) +

   Note:  apos not supported as attribute delimiter here. 


LEXICAL PASS 3: REFERENCE SUBSTITUTION

   ( DATA | ATTRIBUTE-TEXT  | SIMPLE-TAG | COMPLEX-TAG )

              ->  (CHARACTER 

               | NUMERIC-CHARACTER-REFERENCE -> CHARACTER 

               | ENTITY-REFERENCE  -> CHARACTER+)*  

     Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the  production in, but it looks for &. 

   Note:


 LEXICAL PASS 4: TOKENIZATION

     TAG-TEXT = ( ws | "=" | TOKEN )+

     COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG 
     COMMENT-TAG = "!--"  CHARACTER*  "--"

     PI-TAG = "?" TOKEN ws* CHARACTER* "?" 

     END-TAG = "/" TOKEN ws*

      





[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS