Re: Napkin grammar

Here is an updated grammar and examples. Added are

Clark names {URL}:name

Link tags <: :>

scoped IDREFs rootid:myid

short tags

So it is two parts:

First, a grammar which not made with parallel parsing considerations particularly in mind. The capitalized names in the grammar are the non-terminals determined by the lexical processing. (The sub-rules for recognizing the types of undelimited data values are given in the grammar not the lexer, which I think is easiest to read if unfamiliar.)
Second, the lexical processing is specified as given as a series of logical passes. Each pass is amenable to be divided and run in a parallel fashion or as a pipeline or some event system or folded into the grammar; of course a real implementation of them might coalesce them or rearrange with the same intent.

This uses some extensions:
== means "if"

--> $something means a data type conversion

-> means a substitution (handling references)

. means a look-up in the lexical context, just a shorthand.

GRAMMAR:

document = (link | comment | pi )* element (element | comment | pi )*

Comment: a document can have multiple branches not a single root

link = prefix attribute EOM

    Comment: a link is a kind of element that is scoped by namespace prefix or branch id:
         it declares property values for every element/attribute with the same namespace
         or branch id. A branch id is the id of the branch root.

prefix = LINK-START.TOKEN

== TOKEN (could be empty for defalt)

element = start-tag ( CHARACTER+ | element | comment | pi)* end-tag

start-tag = name attribute* EOM

name = START-TAG.BI_TOKEN

--> clark-name

attribute = attname ( typeable-token | ATTRIBUTE-TEXT)

attname = BI_TOKEN

--> clark-name

prefixed-name = BI_TOKEN | clark-name

== contains ":"

--> clark-name

boolean = TOKEN

== ("true" | "false" )

--> $boolean
year = TOKEN
== ( DECIMAL+ "-" CHARACTER* )

--> $yearDate

number = TOKEN
== (""-")? DECIMAL+ ("." CHARACTER+)?

--> $integer or $decimal

id = TOKEN

--> ID

// iff lexer knows that this is a branch root and attribute name is "id", it can do this

symbol = TOKEN

end-tag = END-TAG.BI_TOKEN EOM

--> clark-name EOM

Comment: the name in an end tag does not require a prefix or {} url

comment = COMMENT-TAG.CHARACTER* EOM

--> clark-name EOM

pi = piname CHAR* EOM

piname = PI-TAG.BI-TOKEN EOM

--> clark-name EOM

clark-name = ("{" .* "}": )? TOKEN

Each lexical pass can be thread-parallelized by section. And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed. And the recognition can be parallelized using SIMD.

LEXICAL PASS 1: TAG DEMARCATION

TEXT = ws* ("<" MARKUP EOM==">" DATA? )+

Note: A terminating "data" section should be marked as ws.

Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.

LEXICAL PASS 2: ATTRIBUTE DEMARCATION

MARKUP = ((?=[^!/?:]) START-TAG | COMPLEX-TAG

START-TAG = (TAG-TEXT \" ATTRIBUTE-TAG \"? ) +

Note: apos not supported as attribute delimiter here.

LEXICAL PASS 3: REFERENCE SUBSTITUTION

( DATA | ATTRIBUTE-TEXT | SIMPLE-TAG | COMPLEX-TAG LINK-TAG)

-> (CHARACTER

| NUMERIC-CHARACTER-REFERENCE -> CHARACTER

| ENTITY-REFERENCE -> CHARACTER+)*

Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the production in, but it looks for &.

Note:

I didn't bother to put the reference production: just & is start. Lazy.
Hex NCR only?
Entity reference is to all ISO/SGML/W3C/MathML entities with W3C (MathML) mappings. Implementation can override, good for some publishers?
In SGML terms, all entities are CDATA: No markup or references allowed in entity references, and must not expand to more characters than reference.
There is one MathML character that needs bold tagging: if used, it must be explicitly put into bold by tags, the bold cannot transport.

LEXICAL PASS 4: TOKENIZATION

TAG-TEXT = ( ws | "=" | BI_TOKEN )+

COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG | LINK-TAG
COMMENT-TAG = "!--" CHARACTER* "--"

PI-TAG = "?" BI-TOKEN ws* CHARACTER* "?"

END-TAG = "/" BI_TOKEN ws*

LINK-TAG = ":" TOKEN? ws* (TAG-TEXT \" ATTRIBUTE-TAG \"? ) + ":"

BI_TOKEN = [^\S<"=]+

So an example: the Purchase order example could come in without change, but here I have some typed recognition of numbers, dates and tokens in attributes.

<?hello abcd ?>

<PurchaseOrder PurchaseOrderNumber=99503 OrderDate=1999-10-20>
  <Address Type=Shipping>
    <Name>Ellen Adams</Name>
    <Street>123 Maple Street</Street>
    <City>Mill Valley</City>
    <State>CA</State>
    <Zip>10999</Zip>
    <Country>USA</Country>
  </Address>
  <Address Type=Billing>
    <Name>Tai Yee</Name>
    <Street>8 Oak Avenue</Street>
    <City>Old Town</City>
    <State>PA</State>
    <Zip>95819</Zip>
    <Country>USA</Country>
  </Address>
  <DeliveryNotes>Please leave packages in shed by driveway.</DeliveryNotes>
  <Items>
    <Item PartNumber="872-AA">
      <ProductName>Lawnmower</ProductName>
      <Quantity>1</Quantity>
      <USPrice>148.95</USPrice>
      <Comment>Confirm this is electric</Comment>
    </Item>
    <Item PartNumber="926-AA">
      <ProductName>Baby Monitor</ProductName>
      <Quantity>2</Quantity>
      <USPrice>39.98</USPrice>
      <ShipDate>1999-05-21</ShipDate>
    </Item>
  </Items>
</PurchaseOrder>

A more wild example:

   <?hello  References can go everywhere &#xAB; &#mdash; but only standard entities ?>

   <!-- same with comments &#xAB; &#mdash; -->

 <!-- a link tag for the whole document -->

  <:"/" 
           Content-Type="text/plain"
  :>


  <!-- Link tag for svg prefix. -->

   <:svg 
         xmlns="http://www.w3.org/2000/svg"

         version ="1.1"

         schema="svg.rlx" :>

<{http://www.example.com/link}:somelink to=ABC:XYZ ></somelink>

</svg>

On Thu, Jul 22, 2021 at 8:06 PM Rick Jelliffe <rjelliffe@allette.com.au> wrote:

In case anyone is interested, I made a little grammar up to show the kind of thing that I was thinking of as a start point not an end poit, based on recent posts. Maybe having something concrete helps.

So it is two parts:
First, a grammar which not made with parallel parsing considerations particularly in mind. The capitalized names in the grammar are the non-terminals determined by the lexical processing. (The sub-rules for recognizing the types of undelimited data values are given in the grammar not the lexer, which I think is easiest to read if unfamiliar.)
Second, the lexical processing is specified as given as a series of logical passes. Each pass is amenable to be divided and run in a parallel fashion or as a pipeline or some event system or folded into the grammar; of course a real implementation of them might coalesce them or rearrange with the same intent.

This uses some extensions:
== means "if"
--> $something means a data type conversion
-> means a substitution (handling references)
. means a look-up in the lexical context, just a shorthand.

GRAMMAR:

document = (element | comment | pi )+

element = start-tag ( CHARACTER+ | element | comment | pi)* end-tag

start-tag = name attribute* EOM

name = START-TAG.TOKEN

attribute = attname ( typeable-token | ATTRIBUTE-TEXT)

attname = TOKEN

typeable-token = boolean | year | | symbol

boolean = TOKEN
== ("true" | "false" )
--> $boolean
year = TOKEN
== ( DECIMAL+ "-" CHARACTER* )
--> $yearDate

number = TOKEN
== (""-")? DECIMAL+ ("." CHARACTER+)?
--> $integer or $decimal

symbol = TOKEN
end-tag = END-TAG.TOKEN EOM

comment = COMMENT-TAG.CHARACTER* EOM

pi = piname CHAR* EOM

piname = PI-TAG.TOKEN E)M

Each lexical pass can be thread-parallelized by section. And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed.  And the recognition can be parallelized using SIMD.

LEXICAL PASS 1: TAG DEMARCATION

TEXT = ws* ("<" MARKUP EOM==">" DATA? )+

Note: A terminating "data" section should be marked as ws.
Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.

LEXICAL PASS 2: ATTRIBUTE DEMARCATION

MARKUP = ((?=[^!/?]) START-TAG | COMPLEX-TAG

START-TAG = (TAG-TEXT \" ATTRIBUTE-TAG \"? ) +

Note: apos not supported as attribute delimiter here.

LEXICAL PASS 3: REFERENCE SUBSTITUTION

( DATA | ATTRIBUTE-TEXT | SIMPLE-TAG | COMPLEX-TAG )

-> (CHARACTER
| NUMERIC-CHARACTER-REFERENCE -> CHARACTER
| ENTITY-REFERENCE  -> CHARACTER+)*

    Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the production in, but it looks for &.

Note:
I didn't bother to put the reference production: just & is start. Lazy.
Hex NCR only?
Entity reference is to all ISO/SGML/W3C/MathML entities with W3C (MathML) mappings. Implementation can override, good for some publishers?
In SGML terms, all entities are CDATA: No markup or references allowed in entity references, and must not expand to more characters than reference.
There is one MathML character that needs bold tagging: if used, it must be explicitly put into bold by tags, the bold cannot transport.

LEXICAL PASS 4: TOKENIZATION

TAG-TEXT = ( ws | "=" | TOKEN )+

COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG
COMMENT-TAG = "!--" CHARACTER* "--"

PI-TAG = "?" TOKEN ws* CHARACTER* "?"

END-TAG = "/" TOKEN ws*