XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: Napkin grammar

Here is an updated grammar and examples. Added are
   Clark names  {URL}:name
   Link tags  <:  :> 
   scoped IDREFs    rootid:myid
   short tags

So it is two parts: 

This uses some extensions:
     == means "if" 
     -->  $something means a data type conversion
     -> means a substitution (handling references)
     .  means a look-up in the lexical context, just a shorthand.
 

GRAMMAR:

    document = (link | comment | pi )*  element (element | comment | pi )*

    Comment: a document can have multiple branches not a single root


    link = prefix  attribute EOM

    Comment: a link is a kind of element that is scoped by namespace prefix or branch id:
         it declares property values for every element/attribute with the same namespace
         or branch id.  A branch id is the id of the branch root.

    prefix = LINK-START.TOKEN 

           == TOKEN (could be empty for defalt)

    element =  start-tag ( CHARACTER+ | element | comment | pi)*  end-tag

    start-tag = name attribute* EOM   

    name = START-TAG.BI_TOKEN

        --> clark-name

    attribute =  attname ( typeable-token | ATTRIBUTE-TEXT)

    attname = BI_TOKEN

        --> clark-name

    typeable-token = boolean |  year |  |  symbol  | id | prefixed-name

    prefixed-name = BI_TOKEN | clark-name

      == contains ":"

       --> clark-name

    boolean = TOKEN 

        ==  ("true" | "false" )  

        --> $boolean
    year = TOKEN
        ==  ( DECIMAL+ "-" CHARACTER*  ) 

        -->  $yearDate

    number = TOKEN
        == (""-")? DECIMAL+ ("."   CHARACTER+)?    

       --> $integer or $decimal

    id = TOKEN 

        --> ID

        // iff lexer knows that this is a branch root and attribute name is "id", it can do this

    symbol = TOKEN

 

    end-tag = END-TAG.BI_TOKEN  EOM

        --> clark-name EOM

      Comment: the name in an end tag does not require a prefix or {} url


    comment = COMMENT-TAG.CHARACTER*  EOM

        --> clark-name EOM

    pi = piname  CHAR*  EOM

    piname = PI-TAG.BI-TOKEN  EOM

        --> clark-name  EOM

    clark-name =  ("{" .* "}": )? TOKEN


Each lexical pass can be thread-parallelized by section.  And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed.  And the recognition can be parallelized using SIMD.

LEXICAL PASS 1: TAG DEMARCATION

    TEXT = ws*  ("<"  MARKUP EOM==">"  DATA?  )+ 

    Note: A terminating "data" section should be marked as ws.

    Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.


LEXICAL PASS 2:  ATTRIBUTE DEMARCATION

    MARKUP =  ((?=[^!/?:])  START-TAG  |  COMPLEX-TAG

    START-TAG =  (TAG-TEXT   \"  ATTRIBUTE-TAG  \"? ) +

   Note:  apos not supported as attribute delimiter here. 


LEXICAL PASS 3: REFERENCE SUBSTITUTION

   ( DATA | ATTRIBUTE-TEXT  | SIMPLE-TAG | COMPLEX-TAG LINK-TAG)

              ->  (CHARACTER 

               | NUMERIC-CHARACTER-REFERENCE -> CHARACTER 

               | ENTITY-REFERENCE  -> CHARACTER+)*  

     Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the  production in, but it looks for &. 

   Note:


 LEXICAL PASS 4: TOKENIZATION

     TAG-TEXT = ( ws | "=" | BI_TOKEN )+

     COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG | LINK-TAG
     COMMENT-TAG = "!--"  CHARACTER*  "--"

     PI-TAG = "?" BI-TOKEN ws* CHARACTER* "?"

     END-TAG = "/" BI_TOKEN ws*

     LINK-TAG = ":" TOKEN? ws* (TAG-TEXT   \"  ATTRIBUTE-TAG  \"? ) + ":"

     BI_TOKEN = [^\S<"=]+


So an example: the Purchase order example could come in without change, but here I have some typed recognition of numbers, dates and tokens in attributes.

   <?hello abcd ?>

  <!-- comment -->

<PurchaseOrder PurchaseOrderNumber=99503 OrderDate=1999-10-20>
  <Address Type=Shipping>
    <Name>Ellen Adams</Name>
    <Street>123 Maple Street</Street>
    <City>Mill Valley</City>
    <State>CA</State>
    <Zip>10999</Zip>
    <Country>USA</Country>
  </Address>
  <Address Type=Billing>
    <Name>Tai Yee</Name>
    <Street>8 Oak Avenue</Street>
    <City>Old Town</City>
    <State>PA</State>
    <Zip>95819</Zip>
    <Country>USA</Country>
  </Address>
  <DeliveryNotes>Please leave packages in shed by driveway.</DeliveryNotes>
  <Items>
    <Item PartNumber="872-AA">
      <ProductName>Lawnmower</ProductName>
      <Quantity>1</Quantity>
      <USPrice>148.95</USPrice>
      <Comment>Confirm this is electric</Comment>
    </Item>
    <Item PartNumber="926-AA">
      <ProductName>Baby Monitor</ProductName>
      <Quantity>2</Quantity>
      <USPrice>39.98</USPrice>
      <ShipDate>1999-05-21</ShipDate>
    </Item>
  </Items>
</PurchaseOrder>

A more wild example:

   <?hello  References can go everywhere &#xAB; &#mdash; but only standard entities ?>
   <!-- same with comments &#xAB; &#mdash; -->

 <!-- a link tag for the whole document -->
  <:"/" 
Content-Type="text/
plain"
:>


<!-- Link tag for svg prefix. -->
   <:svg 
xmlns="http://www.w3.org/2000/svg"
         version ="1.1"
         schema="svg.rlx" :> 
 

           <svg:svg height=100 width=100  id=ABC>
              <svg:circle cx=50 cy=50 r=40 stroke=black stroke-width=3 fill=red   id=XYZ />
            </sv&#x67;>


             <!--  Below we have examples of a full QName used, a scoped link, and a dropped-prefix end-tag -->

             <svg:svg width=400 height=110>
                   <svg:rect width=300 height=100 id=XYZ />

                  <{http://www.example.com/link}:somelink    to=ABC:XYZ ></somelink>

             </svg>

   <!-- note: end of document -->


On Thu, Jul 22, 2021 at 8:06 PM Rick Jelliffe <rjelliffe@allette.com.au> wrote:
In case anyone is interested, I made a little grammar up to show the kind of thing that I was thinking of as a start point not an end poit, based on recent posts. Maybe having something concrete helps. 

So it is two parts: 
  • First, a grammar which not made with parallel parsing considerations particularly in mind.  The capitalized names in the grammar are the non-terminals determined by the lexical processing.  (The sub-rules for recognizing the types of undelimited data values are given in the grammar not the lexer, which I think is easiest to read if unfamiliar.)
  • Second, the lexical processing is specified as given as a series  of logical passes. Each pass is amenable to be divided and run in a parallel fashion or as a pipeline or some event system or folded into the grammar; of course a real implementation of them might coalesce them or rearrange with the same intent. 

This uses some extensions:
     == means "if" 
     -->  $something means a data type conversion
     -> means a substitution (handling references)
     .  means a look-up in the lexical context, just a shorthand.
 

GRAMMAR:

    document =   (element | comment | pi )+

    element =  start-tag ( CHARACTER+ | element | comment | pi)*  end-tag

    start-tag = name attribute* EOM   

    name = START-TAG.TOKEN 

    attribute =  attname ( typeable-token | ATTRIBUTE-TEXT)

    attname = TOKEN

    typeable-token = boolean |  year |  |  symbol

    boolean = TOKEN 

        ==  ("true" | "false" )  

        --> $boolean
    year = TOKEN
        ==  ( DECIMAL+ "-" CHARACTER*  ) 

        -->  $yearDate

    number = TOKEN
        == (""-")? DECIMAL+ ("."   CHARACTER+)?    

       --> $integer or $decimal

    symbol = TOKEN

    end-tag = END-TAG.TOKEN  EOM

    comment = COMMENT-TAG.CHARACTER*  EOM

    pi = piname  CHAR*  EOM

    piname = PI-TAG.TOKEN  E)M


Each lexical pass can be thread-parallelized by section.  And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed.  And the recognition can be parallelized using SIMD.

LEXICAL PASS 1: TAG DEMARCATION

    TEXT = ws*  ("<"  MARKUP EOM==">"  DATA?  )+ 

    Note: A terminating "data" section should be marked as ws.

    Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.


LEXICAL PASS 2:  ATTRIBUTE DEMARCATION

    MARKUP =  ((?=[^!/?])  START-TAG  |  COMPLEX-TAG

    START-TAG =  (TAG-TEXT   \"  ATTRIBUTE-TAG  \"? ) +

   Note:  apos not supported as attribute delimiter here. 


LEXICAL PASS 3: REFERENCE SUBSTITUTION

   ( DATA | ATTRIBUTE-TEXT  | SIMPLE-TAG | COMPLEX-TAG )

              ->  (CHARACTER 

               | NUMERIC-CHARACTER-REFERENCE -> CHARACTER 

               | ENTITY-REFERENCE  -> CHARACTER+)*  

     Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the  production in, but it looks for &. 

   Note:

  • I didn't bother to put the reference production: just & is start. Lazy.  
  • Hex NCR only?
  • Entity reference is to all ISO/SGML/W3C/MathML entities with W3C (MathML) mappings. Implementation can override, good for some publishers?
  •   In SGML terms, all entities are CDATA: No markup or references allowed in entity references, and must not expand to more characters than reference.
  •   There is one MathML character that needs bold tagging: if used, it must be explicitly put into bold by tags, the bold cannot transport.


 LEXICAL PASS 4: TOKENIZATION

     TAG-TEXT = ( ws | "=" | TOKEN )+

     COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG 
     COMMENT-TAG = "!--"  CHARACTER*  "--"

     PI-TAG = "?" TOKEN ws* CHARACTER* "?" 

     END-TAG = "/" TOKEN ws*

      





[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS