Hi Rick,
reading through your spec just now, in pursuit of compact XML I
suggest to maybe change end-element requirements. According to the
spec as I understand, end-element tags (other than tags with blank
element names, a concept I haven't yet come to appreciate it seems :)
must either contain ns:name (like in XML) or just the name (even if
the matching start-element has a ns), like this
<foo:bar>...</foo:bar>
or
<foo:bar>...</bar>.
And the end-tag can contain any text after it, treated as a comment.
</bar whatever this is supposed to be>
In RAN, tags can start lexically with any of the tokens: a string in "",
a name, a number, a date, a url, true or false. (If any of these are to
be disallowed, that takes place at the parser level, not the token level.)
So this allows an element with an empty string as its name e.g.
<""> Hello <x>W</x>orld </"" >
And this in turn allows some lexical sugar
<[ Hello <x>W</x>orld ]>
So what use is an element with an empty string for a name? An array, for example.
Now I'm not entirely sure why end-element tags as they are even made
it into XML (maybe for improved diagnostics/feedback in editors?),
For two reasons, at least:
First, because it makes it explicit what is actually ending at some point:
remembering that in marked-up documents, the end-tag may be tens of
thousands of tags away from its start-tag, for large documents. This makes
it easier for humans.
Second, because it scopes where an error is detected. There is always
an issue that missing-end-tag errors are reported where they are detected,
not where they occur. Having an explicit name means that the scope of
where the missing end-tag could have been is scoped to the parent of the start-tag,
not beyond it (as would be the case of </>).
So I think explicit end-tags do serve a good purpose.
but
XML has for a long time been criticized for lacking simple empty
end-tags as supported in SGML:
<bar>...</>
where </> terminates the right-most open element.
Yes, I guess my method of allowing everything that is simple and does not compromise
lexing leans in favour of </> doesn't it? Hmmm. On the other hand, there is no point in
being able to start from any point and be lexing as soon as you find a < or >, yet not be
able to parse from there: if we find a </> we don't know what it is, and we have to
go back to find it (back from where we are or forwards from the start of the last milestone,
i.e. the last fragment start, or potentially the start of the document.
So I am really not sure about it.
There is also a compromise: it can only be used for leaf elements, i.e. the lexer could fill
in the name with only a stack of 1, i.e. a variable, or only have to scan back to the last
< to find it.
-
so <p>hello <b><i>W</>orld</b> !!! </p>
.
I am not sure that it would be good enough: it might cause frustration that if you want to add a
subelement of i, you have to close off that before adding the new one.
Lack of empty end-tags is particularly verbose as XML doesn't support
SGML CONCUR ie where tagging can overlap:
<(foo)x>...<(bar)y>...</(foo)x>...</(bar)y>
or even
<(foo|baz)x>...<(bar|baz)y>...</(foo|baz)x>...</(bar|baz)y>
provided foo, bar, baz are the names of declaration sets in the document prolog.
IIRC correctly, CONCUR was only implemented two or three times in general SGML parsers, and
there was some incompatability arising from disputed interpretation of the standards. (Do I recall
correctly that OmniMark did it one way, and ARC did it another, and James Clark thought it was
a coin toss (or that IS8879 needed to be clarified): maybe to do with "whitespace attributable to markup"
differing...? I never used CONCUR, but I get the impression it didn't quite do what it was supposed
to, and that milestones using PIs ended up being more flexible?)
Regards
Rick