XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Napkin grammar

(Hi Tim!)

This is the kind of thing, at this premature stage: 

<!-- I am a comment: 1 &lt; 2 -->
<?whoami  answer a PI: but even here 1 &lt; 2 ?>

<tom age=3 birthday=1980-04-26 
       eyes=blue olympian=false
      favorite-identity=" 1 &lt; 2">some mixed content<x />.
 </tom>
<dick note="second root!"  xml:id=d123
      X&x#36;YZ="references anywhere!">
    element content here, but no CDATA marked sections
</dick> <and:harry xmlns:and="http://www.whatever.com/" />

So perhaps a less nannying, more consistent, more HTML-ish thing. (My productions may not support the empty tags)

The attribute values with no double quote are parsed/transduced into integer, gDate, name and boolean values. xml:id is parsed into an id, unique per document not per root (I guess.)

The element name tom is in no namespace, not because there is no default declaration in scope, but because no prefix means no namespace: we dont need to look for a declaration.

Another idea (was this an old james Clark suggestion?) would be to also have string literals as names. That does not create lexical modes, but does allow much more idiomatic names for humans that might make for better messages too.

  <"person of interest"  "given name" = "Mary Grace" age=13  />

Cheers,
Rick

On Fri, 23 Jul. 2021, 00:40 Tim Bray, <tbray@textuality.com> wrote:
Example of what a document looks like?

On Thu., Jul. 22, 2021, 3:06 a.m. Rick Jelliffe, <rjelliffe@allette.com.au> wrote:
In case anyone is interested, I made a little grammar up to show the kind of thing that I was thinking of as a start point not an end poit, based on recent posts. Maybe having something concrete helps. 

So it is two parts: 
  • First, a grammar which not made with parallel parsing considerations particularly in mind.  The capitalized names in the grammar are the non-terminals determined by the lexical processing.  (The sub-rules for recognizing the types of undelimited data values are given in the grammar not the lexer, which I think is easiest to read if unfamiliar.)
  • Second, the lexical processing is specified as given as a series  of logical passes. Each pass is amenable to be divided and run in a parallel fashion or as a pipeline or some event system or folded into the grammar; of course a real implementation of them might coalesce them or rearrange with the same intent. 

This uses some extensions:
     == means "if" 
     -->  $something means a data type conversion
     -> means a substitution (handling references)
     .  means a look-up in the lexical context, just a shorthand.
 

GRAMMAR:

    document =   (element | comment | pi )+

    element =  start-tag ( CHARACTER+ | element | comment | pi)*  end-tag

    start-tag = name attribute* EOM   

    name = START-TAG.TOKEN 

    attribute =  attname ( typeable-token | ATTRIBUTE-TEXT)

    attname = TOKEN

    typeable-token = boolean |  year |  |  symbol

    boolean = TOKEN 

        ==  ("true" | "false" )  

        --> $boolean
    year = TOKEN
        ==  ( DECIMAL+ "-" CHARACTER*  ) 

        -->  $yearDate

    number = TOKEN
        == (""-")? DECIMAL+ ("."   CHARACTER+)?    

       --> $integer or $decimal

    symbol = TOKEN

    end-tag = END-TAG.TOKEN  EOM

    comment = COMMENT-TAG.CHARACTER*  EOM

    pi = piname  CHAR*  EOM

    piname = PI-TAG.TOKEN  E)M


Each lexical pass can be thread-parallelized by section.  And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed.  And the recognition can be parallelized using SIMD.

LEXICAL PASS 1: TAG DEMARCATION

    TEXT = ws*  ("<"  MARKUP EOM==">"  DATA?  )+ 

    Note: A terminating "data" section should be marked as ws.

    Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.


LEXICAL PASS 2:  ATTRIBUTE DEMARCATION

    MARKUP =  ((?=[^!/?])  START-TAG  |  COMPLEX-TAG

    START-TAG =  (TAG-TEXT   \"  ATTRIBUTE-TAG  \"? ) +

   Note:  apos not supported as attribute delimiter here. 


LEXICAL PASS 3: REFERENCE SUBSTITUTION

   ( DATA | ATTRIBUTE-TEXT  | SIMPLE-TAG | COMPLEX-TAG )

              ->  (CHARACTER 

               | NUMERIC-CHARACTER-REFERENCE -> CHARACTER 

               | ENTITY-REFERENCE  -> CHARACTER+)*  

     Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the  production in, but it looks for &. 

   Note:

  • I didn't bother to put the reference production: just & is start. Lazy.  
  • Hex NCR only?
  • Entity reference is to all ISO/SGML/W3C/MathML entities with W3C (MathML) mappings. Implementation can override, good for some publishers?
  •   In SGML terms, all entities are CDATA: No markup or references allowed in entity references, and must not expand to more characters than reference.
  •   There is one MathML character that needs bold tagging: if used, it must be explicitly put into bold by tags, the bold cannot transport.


 LEXICAL PASS 4: TOKENIZATION

     TAG-TEXT = ( ws | "=" | TOKEN )+

     COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG 
     COMMENT-TAG = "!--"  CHARACTER*  "--"

     PI-TAG = "?" TOKEN ws* CHARACTER* "?" 

     END-TAG = "/" TOKEN ws*

      





[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS