GRAMMAR:
document = (element | comment | pi )+
element = start-tag ( CHARACTER+ | element | comment | pi)* end-tag
start-tag = name attribute* EOM
name = START-TAG.TOKEN
attribute = attname ( typeable-token | ATTRIBUTE-TEXT)
attname = TOKEN
typeable-token = boolean | year | | symbol
boolean = TOKEN
== ("true" | "false" )
--> $boolean
year = TOKEN
== ( DECIMAL+ "-" CHARACTER* )
--> $yearDate
number
= TOKEN
== (""-")? DECIMAL+ ("."
CHARACTER+)?
--> $integer or $decimal
symbol = TOKEN
end-tag = END-TAG.TOKEN EOM
comment = COMMENT-TAG.CHARACTER* EOM
pi = piname CHAR* EOM
piname = PI-TAG.TOKEN E)M
Each lexical pass can be thread-parallelized by section. And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed. And the recognition can be parallelized using SIMD.
LEXICAL PASS 1: TAG DEMARCATION
TEXT = ws* ("<" MARKUP EOM==">" DATA? )+
Note: A terminating "data" section should be marked as ws.
Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.
LEXICAL PASS 2: ATTRIBUTE DEMARCATION
MARKUP = ((?=[^!/?]) START-TAG | COMPLEX-TAG
START-TAG = (TAG-TEXT \" ATTRIBUTE-TAG \"? ) +
Note: apos not supported as attribute delimiter here.
LEXICAL PASS 3: REFERENCE SUBSTITUTION
( DATA | ATTRIBUTE-TEXT | SIMPLE-TAG | COMPLEX-TAG )
-> (CHARACTER
| NUMERIC-CHARACTER-REFERENCE -> CHARACTER
| ENTITY-REFERENCE -> CHARACTER+)*
Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the production in, but it looks for &.
Note:
LEXICAL PASS 4: TOKENIZATION
TAG-TEXT = ( ws | "=" | TOKEN )+
COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG
COMMENT-TAG = "!--" CHARACTER*
"--"
PI-TAG = "?" TOKEN ws* CHARACTER* "?"
END-TAG = "/" TOKEN ws*