In case anyone is interested, I made a little grammar up to show the kind of thing that I was thinking of as a start point not an end poit, based on recent posts. Maybe having something concrete helps.So it is two parts:
- First, a grammar which not made with parallel parsing considerations particularly in mind. The capitalized names in the grammar are the non-terminals determined by the lexical processing. (The sub-rules for recognizing the types of undelimited data values are given in the grammar not the lexer, which I think is easiest to read if unfamiliar.)
- Second, the lexical processing is specified as given as a series of logical passes. Each pass is amenable to be divided and run in a parallel fashion or as a pipeline or some event system or folded into the grammar; of course a real implementation of them might coalesce them or rearrange with the same intent.
This uses some extensions:
== means "if"--> $something means a data type conversion-> means a substitution (handling references). means a look-up in the lexical context, just a shorthand.
GRAMMAR:
document = (element | comment | pi )+
element = start-tag ( CHARACTER+ | element | comment | pi)* end-tag
start-tag = name attribute* EOM
name = START-TAG.TOKEN
attribute = attname ( typeable-token | ATTRIBUTE-TEXT)
attname = TOKEN
typeable-token = boolean | year | | symbol
boolean = TOKEN
== ("true" | "false" )
--> $boolean
year = TOKEN
== ( DECIMAL+ "-" CHARACTER* )--> $yearDate
number = TOKEN
== (""-")? DECIMAL+ ("." CHARACTER+)?--> $integer or $decimal
symbol = TOKEN
end-tag = END-TAG.TOKEN EOM
comment = COMMENT-TAG.CHARACTER* EOM
pi = piname CHAR* EOM
piname = PI-TAG.TOKEN E)M
Each lexical pass can be thread-parallelized by section. And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed. And the recognition can be parallelized using SIMD.
LEXICAL PASS 1: TAG DEMARCATION
TEXT = ws* ("<" MARKUP EOM==">" DATA? )+
Note: A terminating "data" section should be marked as ws.
Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.
LEXICAL PASS 2: ATTRIBUTE DEMARCATION
MARKUP = ((?=[^!/?]) START-TAG | COMPLEX-TAG
START-TAG = (TAG-TEXT \" ATTRIBUTE-TAG \"? ) +
Note: apos not supported as attribute delimiter here.
LEXICAL PASS 3: REFERENCE SUBSTITUTION
( DATA | ATTRIBUTE-TEXT | SIMPLE-TAG | COMPLEX-TAG )
-> (CHARACTER
| NUMERIC-CHARACTER-REFERENCE -> CHARACTER
| ENTITY-REFERENCE -> CHARACTER+)*
Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the production in, but it looks for &.
Note:
- I didn't bother to put the reference production: just & is start. Lazy.
- Hex NCR only?
- Entity reference is to all ISO/SGML/W3C/MathML entities with W3C (MathML) mappings. Implementation can override, good for some publishers?
- In SGML terms, all entities are CDATA: No markup or references allowed in entity references, and must not expand to more characters than reference.
- There is one MathML character that needs bold tagging: if used, it must be explicitly put into bold by tags, the bold cannot transport.
LEXICAL PASS 4: TOKENIZATION
TAG-TEXT = ( ws | "=" | TOKEN )+
COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG
COMMENT-TAG = "!--" CHARACTER* "--"PI-TAG = "?" TOKEN ws* CHARACTER* "?"
END-TAG = "/" TOKEN ws*