Re: ***SPAM*** [xml-dev] Re: The Goals of XML at 25, and the onething th

(Follow on)

On Thu, Jul 22, 2021 at 7:42 AM Liam R. E. Quin <liam@fromoldbooks.org> wrote:

On Wed, 2021-07-21 at 14:29 +1000, Rick Jelliffe wrote:

> *NON-GOALS*
>
> 1. The language* MUST NOT* be lexically identical to or a subset of
> XML.
>
So, deliberately incompatible. Or do you mean, the process of
developing the language must not be constrained to be, forced to be,
lexically identical to.. etc etc?

Neither. Deliberately not a subset or identical. But no gratuitous differences.

> 2. The language *MUST NOT* have an identical or subset infoset to the
> XML Infoset.
Strictly speaking the XML Information Set is a vocabulary of terms. In
particular it is emphatically not a data model

Good point. But no change is needed. Because the new language MUST do something

different, it cannot have an identical or subset vocabulary: it must have something else.

>
> 3. The language *MUST NOT* be characterizable by WebSGML

I doubt many people care about WebSGML today.

Indeed. But 100% of the people who do are probably on this list. Anyway, the point is

exclusion again: if it is limited to what SGML can do (without the stable door of SEEALSO)

then there is no point.

> 4. The language *MUST NOT *be, for every possible document,
> completely
> interconvertable with *JSON.

As John Cowan pointed out, this is a nonsense.

I don't see it. What standard method do you have of converting a JSON document with

type information to XML with no schema, using only mechanisms in XML and no schema? You end

up with a bag of names, and nothing in the XML rules lets you infer a relationship between some

value and a JSON storage type.

Please note "inter-convertable with" not merely "convertable to" JSON. That you can add an extra

processing layer is irrelevant, since we are talking language features not subsequent processing;

or I am at least.

Are we perhaps meaning something different by "every" here? I mean no more than the other requirements.

If there is some JSON document that cannot be directly represented in this language, it is no problem, and

quite likely. As for the vice versa, without knowing what the features are of the language there is

no need to assert that every document could be round-tripped into JSON and back, even though it is certainly

likely, not a goal.

>
> 5. The language *MUST NOT* support all declarative possibilities of
> XML
> Namespaces.

So it must be a subset of a spec that deosn't do very much...

Made me laugh. Every non-goal does nothing in the final result, in a sense... :-)

We already have XML Namespaces: what is the point in merely replicating its virtues and flaws?

> It *MUST* be possible to know that a name has a namespace from
> its lexical form.

So, no default namespaces. This removes support some of the use cases
we had, of course.

Yes indeed. The road to hell is paved with good intentions.

One of the reason people like Schematron is that it uses that regime.

I don't think it is inconceivable that apart from human writers/readers, there may be

some processing and developer benefit if, when we see a name, we immediately know

whether we have to look up a namespace, and that (if there is no redeclaration allowed)

if we compare two names, we can do it merely using the prefixes not the URLs. In an XML parser, I expect

that there would be code and data arranged for efficiency, but it is otiose to the requirement of knowing whether

two names are the same and binding them to some other process that is free to use its own prefixes.

> It *MUST* be possible to determine a namespace URL by
> scanning back far enough in the document to find the lexically most
> recent
> xmln:XX declaration for that value

This is not the case in XML today, since attributes using a prefix can
appear lexically before the declaration.

Yes.

>
> 6. Language design choices *MUST NOT* be made which compromise the
> potential efficiency of parsing,

So, developers are more important than users

Developers are humans too, and just as much worthy of a standards-maker's consideration

as Joe Public, surely? Isn't that the basis of all RFC-based technologies? And (oops this

sentence probably has regressed into trolling) we already have XML 1.n --as it has turned out with XSD etc--

capably fulfilling the niche of something that is way too difficult for non-corporate developers

to implement, yes?

More seriously, isn't that a false opposition? Why isn't it possible to have both: a language that users

and developers will find convenient enough, though not as user-friendly in some cases as XML or as

developer-friendly in probably all cases as JSON? Yet one that can commend itself by also supporting

something they don't?

Pfooey.

Duck!

>
> 0. The language is a markup language. It should support mixed
> content. It
> should support humans.

Desn't this contradict must-not goal 6

No. 6 tempers this.

I thought there should be some very vague scoping statement, to say it is not an EXI

or JSON substitute, but in a similar family to XML and HTML. (On the rationale that

making it easy to tart up existing (XML) parsers is a proven method of boostrapping.)

But this is just my opinion.

>
> 1. The language should support non-modal parsing: at every point in a
> document, the parsing mode can be re-established by scanning forward
> without knowledge of prior context until a milestone is found.

The second sentence does not expound upon the first. The use of tags
implies a modal parser - in-tag or outside-tag.

If it reads clearer to have a ';" rather than the ":", please consider that.

But I think the second sentence does expound on the first: it gives an intended

consequence of the non-modal parsing, though it does not define it.

Lets consider a non-modal parser as one which can does not need to know

the current state in order to parse.

I think it is kinda the difference between these

modal = B* ( "<" B* ">" B*)*

non-modal = B* ("<" | ">")* B*

Say we generate a string using "modal". Every substring of it does not also match "modal".

But if we use that same generated string, then every substring of it should match "non-modal" (if I have it right.)

More background might make my comment less confusing. It is sometimes often possible to unwrap a grammar

that we might think should use a simple left-to-right state or stack machine as, instead, a series of simpler passes.

Indeed, where some productions in a grammar are to be interpreted as longest-match-first (greedy) but others as

shortest-match first, it may be the most straight-forward method. A good example of this is tokenizing: we may

find the end of the token using one rule (e.g. whitespace) which then simplifies our parsing/lexing inside or with that

token.

I'll post a little example grammar separately to be more concrete, and seek out better terminology, perhaps

> In other words, [ "<" and ">"] must only ever be delimiters or
> part of delimiter strings.

It's true that unquoted > makes some parsing techniques difficult -
when i added XML support to lq-text i used backwards parsing, and > in
text content confuses it. The answer was to ignore > and look only for
< though.

Yes, only "<" is strictly necessary. But the more that a parallel process has to look outside the block it is allocated to

parse (or the more that it has too assign initial strings as unknown, to be reconciled by a stitch process) the less useful

it is to have had parallel parsing in the first place. So knowing where data content begins (and that, say, our begining state

is as part of an attribute), the better.

>
> 2. The language should support straightforward right-to-left parsing
> with
> the same ultimate result as left-to-right parsing.

oops see above.
>
> 3. The language should support arbitrary streams of elements,

the Jabber folks would have loved this.

>
> 4. The language must support some significant extra features to XML,

This, i think, is the crux of the matter - "we must add a killer
feature so that people want our system, even in a world in which data
transfer formats are not considered exciting."

If you are saying that there is nothing new under the sun, I cannot agree.

If you are saying that you don't expect any new markup language to have nearly

the same hype curve as XML and JSON, then I completely agree.

> It should attempt to do this by assigning meaning to existing
> lexical charactistics: these alternatives include that the empty-end
> tag
> versus a matched pair, or attribute values with no delimiter, or
> double
> quotes or apostrophe.

Simon's xmlents did this two decades ago. Since the XML stack has
irregular escaping, you end up with problems when e.g. you want to have
a double qote inside a string in an XPath expression in an attribute.

Yes, XML compatability is a rock that many a good idea has foundered on.

But 20 years ago, it made sense to make sure XML did not fragment;

and a big selling point for it was that developers would be more productive

if they didn't have invent new syntaxes for essentially the same thing.

But now XML is well established, and JSON has

relieved XML of the need to do that kind of datatyping.

But haven't JSON and time also shown that though "terseness is of minimal importance"

was a fine rule-of-thumb for figleafing the amputation of SGML of extra

limbs and carbuncles, it is not actually a good principle in itself?

I think if i had to redo XML without backward compatibility constraints
i'd want to have a reliable escaping mechanism (even though, like XML
text entity references, you end up with yet another parsing mode).

Yes, modes are great! Scala is good for that, and now Java's """ text blocks,

and I guess the poster boy for modes is RTF, where you could even have

chunks using different character encodings inside the same file.

But I suggest that modes preclude many and perhaps most parallel implementation methods,

which has helped XML adroitly avoid most of the improvements to CPUs in the

last 25 years. And a technology in that situation is not in a healthy place.

Cheers

Rick