Napkin grammar

In case anyone is interested, I made a little grammar up to show the kind of thing that I was thinking of as a start point not an end poit, based on recent posts. Maybe having something concrete helps.

So it is two parts:

First, a grammar which not made with parallel parsing considerations particularly in mind. The capitalized names in the grammar are the non-terminals determined by the lexical processing. (The sub-rules for recognizing the types of undelimited data values are given in the grammar not the lexer, which I think is easiest to read if unfamiliar.)
Second, the lexical processing is specified as given as a series of logical passes. Each pass is amenable to be divided and run in a parallel fashion or as a pipeline or some event system or folded into the grammar; of course a real implementation of them might coalesce them or rearrange with the same intent.

This uses some extensions:
== means "if"

--> $something means a data type conversion

-> means a substitution (handling references)

. means a look-up in the lexical context, just a shorthand.

GRAMMAR:

document = (element | comment | pi )+

element = start-tag ( CHARACTER+ | element | comment | pi)* end-tag

start-tag = name attribute* EOM

name = START-TAG.TOKEN

attribute = attname ( typeable-token | ATTRIBUTE-TEXT)

attname = TOKEN

typeable-token = boolean | year | | symbol

boolean = TOKEN

== ("true" | "false" )

--> $boolean
year = TOKEN
== ( DECIMAL+ "-" CHARACTER* )

--> $yearDate

number = TOKEN
== (""-")? DECIMAL+ ("." CHARACTER+)?

--> $integer or $decimal

symbol = TOKEN

end-tag = END-TAG.TOKEN EOM

comment = COMMENT-TAG.CHARACTER* EOM

pi = piname CHAR* EOM

piname = PI-TAG.TOKEN E)M

Each lexical pass can be thread-parallelized by section. And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed. And the recognition can be parallelized using SIMD.

LEXICAL PASS 1: TAG DEMARCATION

TEXT = ws* ("<" MARKUP EOM==">" DATA? )+

Note: A terminating "data" section should be marked as ws.

Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.

LEXICAL PASS 2: ATTRIBUTE DEMARCATION

MARKUP = ((?=[^!/?]) START-TAG | COMPLEX-TAG

START-TAG = (TAG-TEXT \" ATTRIBUTE-TAG \"? ) +

Note: apos not supported as attribute delimiter here.

LEXICAL PASS 3: REFERENCE SUBSTITUTION

( DATA | ATTRIBUTE-TEXT | SIMPLE-TAG | COMPLEX-TAG )

-> (CHARACTER

| NUMERIC-CHARACTER-REFERENCE -> CHARACTER

| ENTITY-REFERENCE -> CHARACTER+)*

Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the production in, but it looks for &.

Note:

I didn't bother to put the reference production: just & is start. Lazy.
Hex NCR only?
Entity reference is to all ISO/SGML/W3C/MathML entities with W3C (MathML) mappings. Implementation can override, good for some publishers?
In SGML terms, all entities are CDATA: No markup or references allowed in entity references, and must not expand to more characters than reference.
There is one MathML character that needs bold tagging: if used, it must be explicitly put into bold by tags, the bold cannot transport.

LEXICAL PASS 4: TOKENIZATION

TAG-TEXT = ( ws | "=" | TOKEN )+

COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG
COMMENT-TAG = "!--" CHARACTER* "--"

PI-TAG = "?" TOKEN ws* CHARACTER* "?"

END-TAG = "/" TOKEN ws*