GRAMMAR:
document = (link | comment | pi )* element (element | comment | pi )*
Comment: a document can have multiple branches not a single root
link = prefix attribute EOM
Comment: a link is a kind of element that is scoped by namespace prefix or branch id:
it declares property values for every element/attribute with the same namespace
or branch id. A branch id is the id of the branch root.
prefix = LINK-START.TOKEN
== TOKEN (could be empty for defalt)
element = start-tag ( CHARACTER+ | element | comment | pi)* end-tag
start-tag = name attribute* EOM
name = START-TAG.BI_TOKEN
--> clark-name
attribute = attname ( typeable-token | ATTRIBUTE-TEXT)
attname = BI_TOKEN
--> clark-name
typeable-token = boolean | year | | symbol | id | prefixed-name
prefixed-name = BI_TOKEN | clark-name
== contains ":"
--> clark-name
boolean = TOKEN
== ("true" | "false" )
--> $boolean
year = TOKEN
== ( DECIMAL+ "-" CHARACTER* )
--> $yearDate
number
= TOKEN
== (""-")? DECIMAL+ ("."
CHARACTER+)?
--> $integer or $decimal
id = TOKEN
--> ID
// iff lexer knows that this is a branch root and attribute name is "id", it can do this
symbol = TOKEN
end-tag = END-TAG.BI_TOKEN EOM
--> clark-name EOM
Comment: the name in an end tag does not require a prefix or {} url
comment = COMMENT-TAG.CHARACTER* EOM
--> clark-name EOM
pi = piname CHAR* EOM
piname = PI-TAG.BI-TOKEN EOM
--> clark-name EOM
clark-name = ("{" .* "}": )? TOKEN
Each lexical pass can be thread-parallelized by section. And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed. And the recognition can be parallelized using SIMD.
LEXICAL PASS 1: TAG DEMARCATION
TEXT = ws* ("<" MARKUP EOM==">" DATA? )+
Note: A terminating "data" section should be marked as ws.
Note: EOM is the only delimiter signal the lexer needs to provide up,
but it is only actually needed for start-tags, and would not be part of
an infoset.
LEXICAL PASS 2: ATTRIBUTE DEMARCATION
MARKUP = ((?=[^!/?:]) START-TAG | COMPLEX-TAG
START-TAG = (TAG-TEXT \" ATTRIBUTE-TAG \"? ) +
Note: apos not supported as attribute delimiter here.
LEXICAL PASS 3: REFERENCE SUBSTITUTION
( DATA | ATTRIBUTE-TEXT | SIMPLE-TAG | COMPLEX-TAG LINK-TAG)
-> (CHARACTER
| NUMERIC-CHARACTER-REFERENCE -> CHARACTER
| ENTITY-REFERENCE -> CHARACTER+)*
Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the production in, but it looks for &.
Note:
LEXICAL PASS 4: TOKENIZATION
TAG-TEXT = ( ws | "=" | BI_TOKEN )+
COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG | LINK-TAG
COMMENT-TAG = "!--" CHARACTER*
"--"
PI-TAG = "?" BI-TOKEN ws* CHARACTER* "?"
END-TAG = "/" BI_TOKEN ws*
LINK-TAG = ":" TOKEN? ws* (TAG-TEXT \" ATTRIBUTE-TAG \"? ) + ":"
BI_TOKEN = [^\S<"=]+
So an example: the Purchase order example could come in without change, but here I have some typed recognition of numbers, dates and tokens in attributes.
<?hello abcd ?>
<!-- comment -->
<PurchaseOrder PurchaseOrderNumber=99503 OrderDate=1999-10-20>
<Address Type=Shipping>
<Name>Ellen Adams</Name>
<Street>123 Maple Street</Street>
<City>Mill Valley</City>
<State>CA</State>
<Zip>10999</Zip>
<Country>USA</Country>
</Address>
<Address Type=Billing>
<Name>Tai Yee</Name>
<Street>8 Oak Avenue</Street>
<City>Old Town</City>
<State>PA</State>
<Zip>95819</Zip>
<Country>USA</Country>
</Address>
<DeliveryNotes>Please leave packages in shed by driveway.</DeliveryNotes>
<Items>
<Item PartNumber="872-AA">
<ProductName>Lawnmower</ProductName>
<Quantity>1</Quantity>
<USPrice>148.95</USPrice>
<Comment>Confirm this is electric</Comment>
</Item>
<Item PartNumber="926-AA">
<ProductName>Baby Monitor</ProductName>
<Quantity>2</Quantity>
<USPrice>39.98</USPrice>
<ShipDate>1999-05-21</ShipDate>
</Item>
</Items>
</PurchaseOrder>
A more wild example:
<?hello References can go everywhere « &#mdash; but only standard entities ?>
<!-- same with comments « &#mdash; -->
<!-- a link tag for the whole document -->
<:"/"
Content-Type="text/plain"
:>
<!-- Link tag for svg prefix. -->
<:s
vg
xmlns="http://www.w3.org/2000/svg"
version ="1.1"
schema="svg.rlx" :>
<svg:svg height=100 width=100 id=ABC>
<svg:circle cx=50 cy=50 r=40 stroke=black stroke-width=3 fill=red id=XYZ />
</svg>
<!-- Below we have examples of a full QName used, a scoped link, and a dropped-prefix end-tag -->
<svg:svg width=400 height=110>
<svg:rect width=300 height=100 id=XYZ />
<{http://www.example.com/link}:somelink to=ABC:XYZ ></somelink>
</svg>
<!-- note: end of document -->
In case anyone is interested, I made a little grammar up to show the kind of thing that I was thinking of as a start point not an end poit, based on recent posts. Maybe having something concrete helps.So it is two parts:
- First, a grammar which not made with parallel parsing considerations particularly in mind. The capitalized names in the grammar are the non-terminals determined by the lexical processing. (The sub-rules for recognizing the types of undelimited data values are given in the grammar not the lexer, which I think is easiest to read if unfamiliar.)
- Second, the lexical processing is specified as given as a series of logical passes. Each pass is amenable to be divided and run in a parallel fashion or as a pipeline or some event system or folded into the grammar; of course a real implementation of them might coalesce them or rearrange with the same intent.
This uses some extensions:
== means "if"--> $something means a data type conversion-> means a substitution (handling references). means a look-up in the lexical context, just a shorthand.
GRAMMAR:
document = (element | comment | pi )+
element = start-tag ( CHARACTER+ | element | comment | pi)* end-tag
start-tag = name attribute* EOM
name = START-TAG.TOKEN
attribute = attname ( typeable-token | ATTRIBUTE-TEXT)
attname = TOKEN
typeable-token = boolean | year | | symbol
boolean = TOKEN
== ("true" | "false" )
--> $boolean
year = TOKEN
== ( DECIMAL+ "-" CHARACTER* )--> $yearDate
number = TOKEN
== (""-")? DECIMAL+ ("." CHARACTER+)?--> $integer or $decimal
symbol = TOKEN
end-tag = END-TAG.TOKEN EOM
comment = COMMENT-TAG.CHARACTER* EOM
pi = piname CHAR* EOM
piname = PI-TAG.TOKEN E)M
Each lexical pass can be thread-parallelized by section. And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed. And the recognition can be parallelized using SIMD.
LEXICAL PASS 1: TAG DEMARCATION
TEXT = ws* ("<" MARKUP EOM==">" DATA? )+
Note: A terminating "data" section should be marked as ws.
Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.
LEXICAL PASS 2: ATTRIBUTE DEMARCATION
MARKUP = ((?=[^!/?]) START-TAG | COMPLEX-TAG
START-TAG = (TAG-TEXT \" ATTRIBUTE-TAG \"? ) +
Note: apos not supported as attribute delimiter here.
LEXICAL PASS 3: REFERENCE SUBSTITUTION
( DATA | ATTRIBUTE-TEXT | SIMPLE-TAG | COMPLEX-TAG )
-> (CHARACTER
| NUMERIC-CHARACTER-REFERENCE -> CHARACTER
| ENTITY-REFERENCE -> CHARACTER+)*
Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the production in, but it looks for &.
Note:
- I didn't bother to put the reference production: just & is start. Lazy.
- Hex NCR only?
- Entity reference is to all ISO/SGML/W3C/MathML entities with W3C (MathML) mappings. Implementation can override, good for some publishers?
- In SGML terms, all entities are CDATA: No markup or references allowed in entity references, and must not expand to more characters than reference.
- There is one MathML character that needs bold tagging: if used, it must be explicitly put into bold by tags, the bold cannot transport.
LEXICAL PASS 4: TOKENIZATION
TAG-TEXT = ( ws | "=" | TOKEN )+
COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG
COMMENT-TAG = "!--" CHARACTER* "--"PI-TAG = "?" TOKEN ws* CHARACTER* "?"
END-TAG = "/" TOKEN ws*