Ameila writes:
> This also meets the stated purpose more nearly: start parsing anywhere.
> Well, any time you've got namespace bindings (whether to a non-default
> prefix or to the default prefix), you can't start parsing anywhere; you
> have to know what the bindings are. On the other hand, however you got
> to that position in the document, you do know what the context is.
I think Pete's idea requires that the (:whatever ... :) tag comes at the top of
the document, before the first root. So it can reliably be available before
any further parsing.
Default namespaces? Scoped IDs?
About using the default namespace, as a simplification, It is an idea with
looking at. But I think the problem (for parsing from some random point
in a document) is not the prefixes but locating the in-scope declarations.
All we strictly need is that the namespace binding be known by the time
someone looks at that name. Now if we assume some SAX style stream and
non-lazt parsing, then that means we indeed will have access to the context of ancestor elements.
I think that is the scenario behind Amelia's comment "On the other hand, however
you got to that position in the document, you do know what the context is."
However, if our scenario is that we are building a lazy tree, and avoiding as far as
possible any lexing or parsing on any text until the info is needed. In that case,
we really do need to have the declarations at the start (or at the end, or have some
index info to allow random access...) Therefore we cannot merely redefine the
default namespace at any element we need, I think.
What kinds of other things might be needed to make this practical in some sitations?
One is this: the language allows multiple branches (like a WF external general entity
considered in isolation); lets say each branch root has an @id ID identifier: A1, A2, A3, etc.
Now lets say that IDs only need to be unique within each branch, but that
IDs can be referenced between branches by prefixing them with the @id of that
branch. E.g. if there is some descented element of A1 with an id ABC then it
can be referenced as idref ABC (meaning the ABC in the current branch,
otherwise, say, the first that appears in the document) at but also A1:ABC.
I'll call that a scoped ID.
So If I want to open a document, and only look at the sub-branch of A1:ABC,
the minimal work is to lexically scan all <> (e.g with SIMD) to find each branch root, for which
parse the root start-tag to find the @id of A1, then under that scan all <> (we may have
memoized the previous scan, of course) and find any attribute literals
and resolve any reference in them, find any with "ABC" and lex.parse that
element as needed.
So this hierarchical id is one opportunity for parallelism made relatively easy
by having the multiple branches in the document. But if this then requires us to locate
in-scope namespace declaration the only
feasible place for them that is constant-time is at/on the element that uses the
prefix or at/on the current branch root, or at the document start, it seems to me.
So this means that we could allow different branches to have different prefix
bindings and even different default namespaces (if they are allowed) without
it having a performance impact. It is declaration or redeclaration within a branch
that can blow out (in the lazy scenario above, at least). So the issue of having
default namespaces or not, is not one of performance (except for the cost
of tokenizing prefixes.)
(I still think that default namespaces suck. IIRC the rationale for
the default namespace was so that mooted XHTML documents could be
in a namespace but look like HTML, and be edited by HTML editors. I don't
think that is remotely a requirement, on the lines both that we don't
need to, and that it has not taken off.)
Regards
Rick
Background thoughts on modes: probably wrong post
A very conventional way to generalize computer language parsers used to be
as two independent co-routines 1.lexical analysis goes from text to tokens,
2. the parser proper go from tokens to the parse tree.
I think it should be a technical gaol to allow lexing (finding tokens and delimiters)
from any milestone point in parallel, and as much parsing as possible.
(Please forgive if I flip between automata and grammar terminology below,
or make up my own sometimes: it is not an academic paper.)
(We can say a language is modal if the grammar co-routine does anything
to set the state of the lexer co-routine. SGML is modal at this level. )
We might distinguish different kinds of lexers from the kind of grammar/automata
they can have:
1. Single-state
You can have only one state and all allowed symbols do not change it.
e.g. document = ( A | B | ... all other allowed characters )*
2. Unique-entry point state machine (what is the actual name, I wonder?)
This is a state machine where for any input symbol, all transitions go to
a single state. This is again one where you can start at any point in the input
and know where you are. (We can call this non-modal at the lexical level.)
3. (Deterministic) State machine where at least one input symbol does not always
transition to the same state.
This kind of state machine, you cannot know which state you are in to start
parsing if you happen to start from any non-unique-entry symbol. You have
to progress forward until you do find a unique-entry symbol. Lets call it
a milestone.
4. Unique-entry-point non-recursive Stack machine
This is a stack machine where for any input symbol there is only one transition
destination (and push/pop) and transitions form a directed graph, so that
the states allowed at each stack level are unique (so there are no state transitions
between the states of different levels, only push/pop transitions)
This means the possible stack paths are finite. For any point in the input, you can
establish both the state and the necessary call stack: you know where you are.
(We can call this non-modal at the lexical level)
5. (Deterministic) Stack Machine with at least one transition that does not match the requirements of 4
e.g. because of multiple possible destination states for a symbol may exhibit recursion (direct or indirect) .
This kind of stack machine you can only know where you are when starting from any
arbitrary point in the input stream if you start with a symbol that is only reachable from the start state,
i.e, any non-modal part of the machine.
Otherwise, you don't know the stack depth or which transition was taken.
So you have to scan forward until you find a unique-entry-point character which, if non-recursive, allows
you to recreate the state.
Now lets imagine versions of these four which allow parsing backwards as well as forwards. So from any point
in the input, we can parse backwards from our milestone. (I *think* it is not necessary that the type of machine
that lets you scan forwards is the same as the one that lets you scan backwards.) Providing we can scan back
enough
But some issues repeat at the grammar level. So if we want to be able to process from any point in the input,
we have to find both lexer and parser milestones, and we need to know that lexer state could not have been
influenced by signals from the parser co-routine.
Bottom line: The more that the lexer and grammar can be non-modal, the more of the work that can be
performed in parallel, and less the amount of work that may need to be done completing the parse-tree fragments
or token lists and stitching them together.