Re: [xml-dev] Napkin grammar

Ameila writes:

> This also meets the stated purpose more nearly: start parsing anywhere.
> Well, any time you've got namespace bindings (whether to a non-default
> prefix or to the default prefix), you can't start parsing anywhere; you
> have to know what the bindings are. On the other hand, however you got
> to that position in the document, you do know what the context is.

I think Pete's idea requires that the (:whatever ... :) tag comes at the top of

the document, before the first root. So it can reliably be available before

any further parsing.

Default namespaces? Scoped IDs?

About using the default namespace, as a simplification, It is an idea with

looking at. But I think the problem (for parsing from some random point

in a document) is not the prefixes but locating the in-scope declarations.

All we strictly need is that the namespace binding be known by the time

someone looks at that name. Now if we assume some SAX style stream and

non-lazt parsing, then that means we indeed will have access to the context of ancestor elements.

I think that is the scenario behind Amelia's comment "On the other hand, however

you got to that position in the document, you do know what the context is."

However, if our scenario is that we are building a lazy tree, and avoiding as far as

possible any lexing or parsing on any text until the info is needed. In that case,

we really do need to have the declarations at the start (or at the end, or have some

index info to allow random access...) Therefore we cannot merely redefine the

default namespace at any element we need, I think.

What kinds of other things might be needed to make this practical in some sitations?

One is this: the language allows multiple branches (like a WF external general entity

considered in isolation); lets say each branch root has an @id ID identifier: A1, A2, A3, etc.

Now lets say that IDs only need to be unique within each branch, but that

IDs can be referenced between branches by prefixing them with the @id of that

branch. E.g. if there is some descented element of A1 with an id ABC then it

can be referenced as idref ABC (meaning the ABC in the current branch,

otherwise, say, the first that appears in the document) at but also A1:ABC.

I'll call that a scoped ID.

So If I want to open a document, and only look at the sub-branch of A1:ABC,

the minimal work is to lexically scan all <> (e.g with SIMD) to find each branch root, for which

parse the root start-tag to find the @id of A1, then under that scan all <> (we may have

memoized the previous scan, of course) and find any attribute literals

and resolve any reference in them, find any with "ABC" and lex.parse that

element as needed.

So this hierarchical id is one opportunity for parallelism made relatively easy

by having the multiple branches in the document. But if this then requires us to locate

in-scope namespace declaration the only

feasible place for them that is constant-time is at/on the element that uses the

prefix or at/on the current branch root, or at the document start, it seems to me.

So this means that we could allow different branches to have different prefix

bindings and even different default namespaces (if they are allowed) without

it having a performance impact. It is declaration or redeclaration within a branch

that can blow out (in the lazy scenario above, at least). So the issue of having

default namespaces or not, is not one of performance (except for the cost

of tokenizing prefixes.)

(I still think that default namespaces suck. IIRC the rationale for

the default namespace was so that mooted XHTML documents could be

in a namespace but look like HTML, and be edited by HTML editors. I don't

think that is remotely a requirement, on the lines both that we don't

need to, and that it has not taken off.)

Regards

Rick

Background thoughts on modes: probably wrong post

A very conventional way to generalize computer language parsers used to be

as two independent co-routines 1.lexical analysis goes from text to tokens,

2. the parser proper go from tokens to the parse tree.

I think it should be a technical gaol to allow lexing (finding tokens and delimiters)

from any milestone point in parallel, and as much parsing as possible.

(Please forgive if I flip between automata and grammar terminology below,

or make up my own sometimes: it is not an academic paper.)

(We can say a language is modal if the grammar co-routine does anything

to set the state of the lexer co-routine. SGML is modal at this level. )

We might distinguish different kinds of lexers from the kind of grammar/automata

they can have:

1. Single-state

You can have only one state and all allowed symbols do not change it.

e.g. document = ( A | B | ... all other allowed characters )*

2. Unique-entry point state machine (what is the actual name, I wonder?)

This is a state machine where for any input symbol, all transitions go to

a single state. This is again one where you can start at any point in the input

and know where you are. (We can call this non-modal at the lexical level.)

3. (Deterministic) State machine where at least one input symbol does not always

transition to the same state.

This kind of state machine, you cannot know which state you are in to start

parsing if you happen to start from any non-unique-entry symbol. You have

to progress forward until you do find a unique-entry symbol. Lets call it

a milestone.

4. Unique-entry-point non-recursive Stack machine

This is a stack machine where for any input symbol there is only one transition

destination (and push/pop) and transitions form a directed graph, so that

the states allowed at each stack level are unique (so there are no state transitions

between the states of different levels, only push/pop transitions)

This means the possible stack paths are finite. For any point in the input, you can

establish both the state and the necessary call stack: you know where you are.

(We can call this non-modal at the lexical level)

5. (Deterministic) Stack Machine with at least one transition that does not match the requirements of 4

e.g. because of multiple possible destination states for a symbol may exhibit recursion (direct or indirect) .

This kind of stack machine you can only know where you are when starting from any

arbitrary point in the input stream if you start with a symbol that is only reachable from the start state,

i.e, any non-modal part of the machine.

Otherwise, you don't know the stack depth or which transition was taken.

So you have to scan forward until you find a unique-entry-point character which, if non-recursive, allows

you to recreate the state.

Now lets imagine versions of these four which allow parsing backwards as well as forwards. So from any point

in the input, we can parse backwards from our milestone. (I *think* it is not necessary that the type of machine

that lets you scan forwards is the same as the one that lets you scan backwards.) Providing we can scan back

enough

But some issues repeat at the grammar level. So if we want to be able to process from any point in the input,

we have to find both lexer and parser milestones, and we need to know that lexer state could not have been

influenced by signals from the parser co-routine.

Bottom line: The more that the lexer and grammar can be non-modal, the more of the work that can be

performed in parallel, and less the amount of work that may need to be done completing the parse-tree fragments

or token lists and stitching them together.

On Fri, Jul 23, 2021 at 1:07 PM Amelia A Lewis <amyzing@talsever.com> wrote:

On Thu, 22 Jul 2021 23:20:37 +0100, Pete Cordell wrote:
> If you're not interested in the new syntax being a subset of XML and
> you still want namespaces, you'll want to consider an alternative way
> of mapping namespace prefixes to namespaces so that the mapping is
> available BEFORE it is required. Currently the mapping mechanism
> requires a fair bit of pre-fetching and caching which is sub-optimal.
>
> Something like the following might work:
>
> <:and http://www.whatever.com/:>
> <and:harry />

Why? That still separates :and from and:, so if you start parsing on
the <and:harry /> line, you're busted. And why use a non-default prefix
for an element?

<harry xmlns="http://www.whatever.com">

It is almost never *necessary* to bind a namespace to a prefix for use
with elements. It can be verbose to repeatedly re-declare with a bad
schema design like this:

<root>
<othernschild xmlns="http://other.ns.example.com" />
<othernschild xmlns="http://other.ns.example.com" />
<othernschild xmlns="http://other.ns.example.com" />
<othernschild xmlns="http://other.ns.example.com" />

</root>

This also meets the stated purpose more nearly: start parsing anywhere.
Well, any time you've got namespace bindings (whether to a non-default
prefix or to the default prefix), you can't start parsing anywhere; you
have to know what the bindings are. On the other hand, however you got
to that position in the document, you do know what the context is.

The only times that you *need* to use a non-default prefix are: (1)
global attributes (so use xmlns:attr-prefix="xyzzy" for
attr-prefix:attr="nothing happens") and (2) QNames in content (a
layering violation, so do not use QNames in content in new
specifications, and deal with the breakage for XML Schema, XPath/XSLT,
and WSDL (and other currently-defind specifications that followed their
lead toward using QNames in content) (wherever QNames in content
appear, a better specification would provide namespace-name pairs, or
if it has to use abbreviations for namespaces, then it should define
its own syntax entirely in content, so that it passes through without
namespace mangling by an XML processor). Note that XPath expressions
also contain QNames in content as steps in path expressions, but that's
just an extended case of QNames in content. For elements, the XML is
almost always easier to read if you rebind each namespace to the
default prefix at the point that it's needed (it keeps the declaration
more local).

And you don't have to redefine any current specs (well, it would be
nice to take QNames in content out of XML Schema and XPath/XSLT and
others, but ain't gonna happen, so accept that they've got that wart
and deal with it). It's just a best (or merely a "better") practice for
the handling of elements in different namespaes in a single document.

Amy!
--
Amelia A. Lewis amyzing {at} talsever.org
"You go on. You just go on. There's nothing more to it, and there's
no trick to make it easier. You just go on."
"What do you find on the other side? When you go on?"
She shrugged. "Your life again. What else?"
-- Harra Csurik and Miles Vorkosigan

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php