XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Napkin grammar

Ameila writes:
> This also meets the stated purpose more nearly: start parsing anywhere.
> Well, any time you've got namespace bindings (whether to a non-default
> prefix or to the default prefix), you can't start parsing anywhere; you
> have to know what the bindings are. On the other hand, however you got
> to that position in the document, you do know what the context is.

I think Pete's idea requires that the (:whatever ... :)  tag comes at the top of
the document, before the first root. So it can reliably be available before
any further parsing.  

Default namespaces?   Scoped IDs?

About using the default namespace, as a simplification, It is an idea with
looking at. But I think the problem (for parsing from some random point
in a document) is not the prefixes but locating the in-scope declarations.

All we strictly need is that the namespace binding be known by the time
someone looks at that name. Now if we assume some SAX style stream and
non-lazt parsing, then  that means we indeed will have access to the context of ancestor elements.
I think that is the scenario behind Amelia's comment "On the other hand, however 
you got to that position in the document, you do know what the context is."

However, if our scenario is that we are building a lazy tree, and avoiding as far as 
possible any lexing or parsing on any text until the info is needed. In that case,
we really do need to have the declarations at the start (or at the end, or have some
index info to allow random access...)    Therefore we cannot merely redefine the 
default namespace at any element we need, I think.

What kinds of other things might be needed to make this practical in some sitations?

One is this:  the language allows multiple branches (like a WF external general entity
considered in isolation); lets say each branch root has an @id  ID identifier: A1, A2, A3, etc. 
Now lets say that IDs only need to be unique within each branch, but that 
IDs can be referenced between branches by prefixing them with the @id of that
branch. E.g. if there is some descented element of A1 with an id  ABC then it
can be referenced as  idref ABC (meaning the ABC in the current branch,
otherwise, say, the first that appears in the document) at but also A1:ABC.
I'll call that a scoped ID.

So If I want to open a document, and only look at the sub-branch of A1:ABC,
the minimal work is to lexically scan all <> (e.g with SIMD) to find each branch root, for which
parse the root start-tag to find the @id of A1, then under that scan all <> (we may have
memoized the previous scan, of course) and find any attribute literals
and resolve any reference in them, find any with "ABC" and lex.parse that
element as needed.  

So this hierarchical id is one opportunity for parallelism made relatively easy
 by having the multiple branches in the document.  But if this then requires us to locate
in-scope namespace declaration the only
 feasible place for them that is constant-time is at/on the element that uses the
prefix or at/on the current branch root, or at the document start, it seems to me.

So this means that we could allow different branches to have different prefix
bindings and even different default namespaces (if they are allowed) without
it having a performance impact.  It is declaration or redeclaration within a branch 
that  can blow out (in the lazy scenario above, at least). So the issue of having
default namespaces or not, is not one of performance (except for the cost
of tokenizing prefixes.)   

(I still think that default namespaces suck.  IIRC the rationale for
the default namespace was so that mooted XHTML documents could be
in a namespace but look like HTML, and be edited by HTML editors. I don't
think that is remotely a requirement, on the lines both that we don't
need to, and that it has not taken off.)

Regards
Rick


Background thoughts on modes: probably wrong post

A very conventional way to generalize computer language parsers used to be
as two independent co-routines 1.lexical analysis  goes from text to tokens, 
2.  the parser proper go from tokens to the parse tree.   

I think it should be a technical gaol to allow lexing (finding tokens and delimiters) 
from  any milestone point in parallel,  and as much  parsing as possible.

(Please forgive if I flip between automata and grammar terminology below,
or make up my own sometimes: it is not an academic paper.)

(We can say a language is modal if the grammar co-routine does anything 
to set the state of the lexer co-routine. SGML is modal at this level. )

We might distinguish different kinds of lexers from the kind of grammar/automata
 they can have:

1.  Single-state 
      You can have only one state and all allowed symbols do not change it.
       e.g.    document =  ( A | B | ... all other allowed characters )*

2.  Unique-entry point state machine  (what is the actual name, I wonder?)
        This is a state machine where for any input symbol, all transitions go to 
       a single state. This is again one where you can start at any point in the input
       and know where you are. (We can call this non-modal at the lexical level.)

3.  (Deterministic) State machine where at least one input symbol does not always
      transition to the same state.
       This kind of state machine, you cannot know which state you are in to start
       parsing if you happen to start from any non-unique-entry symbol. You have
       to progress forward until you do find a unique-entry symbol. Lets call it
      a milestone.

  4.  Unique-entry-point non-recursive Stack machine
       This is a stack machine where for any input symbol there is only one transition 
       destination (and push/pop) and transitions form a directed graph, so that
       the states allowed at each stack level are unique (so there are no state transitions
       between the states of different levels, only push/pop transitions)
       This means the possible stack paths are finite. For any point in the input, you can 
       establish  both the state and the necessary call stack: you know where you are.
       (We can call this non-modal at the lexical level)

    5.  (Deterministic) Stack Machine with at least one transition  that does not match the requirements of 4
        e.g. because of multiple possible destination states for a symbol  may exhibit recursion (direct or indirect) .
        This kind of stack machine you can only know where you are when starting from any 
       arbitrary point in the input stream if you start with a symbol that is only reachable from the start state, 
       i.e, any non-modal part of the machine.
       Otherwise, you don't know  the stack depth or which transition was taken. 
       So you have to scan forward until you find a unique-entry-point character which, if non-recursive, allows
       you to recreate the state.

Now lets imagine versions of these four which allow parsing backwards as well as forwards. So from any point
in the input, we can  parse backwards from our milestone.  (I *think* it is not necessary that the type of machine
 that lets you   scan forwards is the same as the one that lets you scan backwards.)  Providing we can scan back
enough  

But some issues repeat at the grammar level.  So if we want to be able to process  from any point in the input,
we have to find both lexer and parser milestones, and we need to know that lexer state could not have been 
influenced by signals from the parser co-routine. 

Bottom line: The more that the lexer and grammar can be non-modal,  the more of the work that can be 
performed in parallel, and less the amount of work that may need to be done completing the parse-tree fragments 
or token lists and stitching them together.
         
       
     
       
        
        
       





On Fri, Jul 23, 2021 at 1:07 PM Amelia A Lewis <amyzing@talsever.com> wrote:
On Thu, 22 Jul 2021 23:20:37 +0100, Pete Cordell wrote:
> If you're not interested in the new syntax being a subset of XML and
> you still want namespaces, you'll want to consider an alternative way
> of mapping namespace prefixes to namespaces so that the mapping is
> available BEFORE it is required.  Currently the mapping mechanism
> requires a fair bit of pre-fetching and caching which is sub-optimal.
>
> Something like the following might work:
>
> <:and http://www.whatever.com/:>
> <and:harry />

Why? That still separates :and from and:, so if you start parsing on
the <and:harry /> line, you're busted. And why use a non-default prefix
for an element?

<harry xmlns="http://www.whatever.com">

It is almost never *necessary* to bind a namespace to a prefix for use
with elements. It can be verbose to repeatedly re-declare with a bad
schema design like this:

<root>
  <othernschild xmlns="http://other.ns.example.com" />
  <othernschild xmlns="http://other.ns.example.com" />
  <othernschild xmlns="http://other.ns.example.com" />
  <othernschild xmlns="http://other.ns.example.com" />
  <!-- ... -->
</root>

This also meets the stated purpose more nearly: start parsing anywhere.
Well, any time you've got namespace bindings (whether to a non-default
prefix or to the default prefix), you can't start parsing anywhere; you
have to know what the bindings are. On the other hand, however you got
to that position in the document, you do know what the context is.

The only times that you *need* to use a non-default prefix are: (1)
global attributes (so use xmlns:attr-prefix="xyzzy" for
attr-prefix:attr="nothing happens") and (2) QNames in content (a
layering violation, so do not use QNames in content in new
specifications, and deal with the breakage for XML Schema, XPath/XSLT,
and WSDL (and other currently-defind specifications that followed their
lead toward using QNames in content) (wherever QNames in content
appear, a better specification would provide namespace-name pairs, or
if it has to use abbreviations for namespaces, then it should define
its own syntax entirely in content, so that it passes through without
namespace mangling by an XML processor). Note that XPath expressions
also contain QNames in content as steps in path expressions, but that's
just an extended case of QNames in content. For elements, the XML is
almost always easier to read if you rebind each namespace to the
default prefix at the point that it's needed (it keeps the declaration
more local).

And you don't have to redefine any current specs (well, it would be
nice to take QNames in content out of XML Schema and XPath/XSLT and
others, but ain't gonna happen, so accept that they've got that wart
and deal with it). It's just a best (or merely a "better") practice for
the handling of elements in different namespaes in a single document.

Amy!
--
Amelia A. Lewis                    amyzing {at} talsever.org
  "You go on.  You just go on.  There's nothing more to it, and there's
no trick to make it easier.  You just go on."
  "What do you find on the other side?  When you go on?"
  She shrugged.  "Your life again.  What else?"
                -- Harra Csurik and Miles Vorkosigan

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS