Re: [xml-dev] Napkin grammar

(Hi Tim!)

This is the kind of thing, at this premature stage:

<?whoami answer a PI: but even here 1 < 2 ?>

<tom age=3 birthday=1980-04-26

eyes=blue olympian=false

favorite-identity=" 1 < 2">some mixed content<x />.

</tom>

<dick note="second root!" xml:id=d123

X&x#36;YZ="references anywhere!">

element content here, but no CDATA marked sections

</dick> <and:harry xmlns:and="http://www.whatever.com/" />

So perhaps a less nannying, more consistent, more HTML-ish thing. (My productions may not support the empty tags)

The attribute values with no double quote are parsed/transduced into integer, gDate, name and boolean values. xml:id is parsed into an id, unique per document not per root (I guess.)

The element name tom is in no namespace, not because there is no default declaration in scope, but because no prefix means no namespace: we dont need to look for a declaration.

Another idea (was this an old james Clark suggestion?) would be to also have string literals as names. That does not create lexical modes, but does allow much more idiomatic names for humans that might make for better messages too.

<"person of interest" "given name" = "Mary Grace" age=13 />

Cheers,

Rick

On Fri, 23 Jul. 2021, 00:40 Tim Bray, <tbray@textuality.com> wrote:

Example of what a document looks like?

On Thu., Jul. 22, 2021, 3:06 a.m. Rick Jelliffe, <rjelliffe@allette.com.au> wrote:
In case anyone is interested, I made a little grammar up to show the kind of thing that I was thinking of as a start point not an end poit, based on recent posts. Maybe having something concrete helps.

So it is two parts:
First, a grammar which not made with parallel parsing considerations particularly in mind. The capitalized names in the grammar are the non-terminals determined by the lexical processing. (The sub-rules for recognizing the types of undelimited data values are given in the grammar not the lexer, which I think is easiest to read if unfamiliar.)
Second, the lexical processing is specified as given as a series of logical passes. Each pass is amenable to be divided and run in a parallel fashion or as a pipeline or some event system or folded into the grammar; of course a real implementation of them might coalesce them or rearrange with the same intent.

This uses some extensions:
== means "if"
--> $something means a data type conversion
-> means a substitution (handling references)
. means a look-up in the lexical context, just a shorthand.

GRAMMAR:

document = (element | comment | pi )+

element = start-tag ( CHARACTER+ | element | comment | pi)* end-tag

start-tag = name attribute* EOM

name = START-TAG.TOKEN

attribute = attname ( typeable-token | ATTRIBUTE-TEXT)

attname = TOKEN

typeable-token = boolean | year | | symbol

boolean = TOKEN
== ("true" | "false" )
--> $boolean
year = TOKEN
== ( DECIMAL+ "-" CHARACTER* )
--> $yearDate

number = TOKEN
== (""-")? DECIMAL+ ("." CHARACTER+)?
--> $integer or $decimal

symbol = TOKEN
end-tag = END-TAG.TOKEN EOM

comment = COMMENT-TAG.CHARACTER* EOM

pi = piname CHAR* EOM

piname = PI-TAG.TOKEN E)M

Each lexical pass can be thread-parallelized by section. And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed.  And the recognition can be parallelized using SIMD.

LEXICAL PASS 1: TAG DEMARCATION

TEXT = ws* ("<" MARKUP EOM==">" DATA? )+

Note: A terminating "data" section should be marked as ws.
Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.

LEXICAL PASS 2: ATTRIBUTE DEMARCATION

MARKUP = ((?=[^!/?]) START-TAG | COMPLEX-TAG

START-TAG = (TAG-TEXT \" ATTRIBUTE-TAG \"? ) +

Note: apos not supported as attribute delimiter here.

LEXICAL PASS 3: REFERENCE SUBSTITUTION

( DATA | ATTRIBUTE-TEXT | SIMPLE-TAG | COMPLEX-TAG )

-> (CHARACTER
| NUMERIC-CHARACTER-REFERENCE -> CHARACTER
| ENTITY-REFERENCE  -> CHARACTER+)*

    Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the production in, but it looks for &.

Note:
I didn't bother to put the reference production: just & is start. Lazy.
Hex NCR only?
Entity reference is to all ISO/SGML/W3C/MathML entities with W3C (MathML) mappings. Implementation can override, good for some publishers?
In SGML terms, all entities are CDATA: No markup or references allowed in entity references, and must not expand to more characters than reference.
There is one MathML character that needs bold tagging: if used, it must be explicitly put into bold by tags, the bold cannot transport.

LEXICAL PASS 4: TOKENIZATION

TAG-TEXT = ( ws | "=" | TOKEN )+

COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG
COMMENT-TAG = "!--" CHARACTER* "--"

PI-TAG = "?" TOKEN ws* CHARACTER* "?"

END-TAG = "/" TOKEN ws*