Re: [xml-dev] The illusion of simplicity and low cost in data designand

On Mon, 15 Aug 2022, 11:06 pm C. M. Sperberg-McQueen, <cmsmcq@blackmesatech.com> wrote:

I'm not sure what that "Even" is doing in that sentence -- are you
seeking to want to suggest that failure to understand and use NOTATION
is more surprising in the XSD working group than in any other group

"Even" is not to indicate a failure, but some botched version of "Even outside XML, in XML Schemas..."

As you will doubtless remember (since you were a member of the working
group), there were a lot of people in the XML Schema WG with what we
might call a limited appreciation of anything present in DTDs. And it
was already clear that very few people understood how to use NOTATION
declarations usefully (whether in XML or in SGML).

XML and SGML did not go far enough: replicating them is as useless as subletting them.

My point (which is the same one I presented mid-1999 as "Notation Schemas") is that it would be better to be able to parse non-XML text (whether in attribute values, element content, or internal and external entities or links or any other location) as XML and have the schema mechanism extend to this.

So my suggestion was some extension mechanism where there could be a few different kinds of transducers: I think InvisibleXML fits the bill for what I was hoping for (for BNF-ish grammars); it would go with some version of regex that produced an XML element with captured groups as supplements.

As I have long said, I think it is natural to humans that a change in category of information is accompanied by a change in notation. Hence URLs, XPaths, embedded SQL, CSS, Northerly/Easterly/Elevation number triplets for geographic data, maths notation, and so on.

Schema languages that support do not support co-occurrence constraints (DTD, RELAX NG, XSD 1.0) are pretty crippled; ones that support co-occurence constraints only with the document (XSD 2.0) are less so; ones that support co-occurrence constraints between information in multiple documents (Schematron) are even better (allowing, e.g. UBL Code Lists), but there still is no standard language for validating co-occurrence constraints using information in embedded notations in attributes or elements etc in the same document IIRC.

It is a big gap.

(Now you can pull in an external XSLT or Java function or Web call in Schematron and use that to parse the text into a variable. And InvisibleXML may make this more trivial for the developer. But still the notation used is not given inline, it is considered a Schema thing, which goes against the principle of self-advertisement. )

> In any case, even embedded NOTATIONS were supported, you need to look
> in a schema or markup declarations so it is not inline. (By embedded I
> mean the content is in an element, by inline I mean that the the
> markup to say what the notation is at the same location.)

If you have no objection to placing information of this kind inline,
instead of factoring it out to avoid duplication, then why not just
place the schema inline? There is nothing in the XSD spec that prevents
inline schemas: the term 'schema document' is used because most people
will want the schema out of line, not inline, but its extension is
effectively any xsd:schema element. Spreading it out so that the
declaration of an element is repeated for every instance of that element
type may, I admit, be a bit trickier.

Hmmm. JSON etc seem to be doing fine not resorting to schemas (inline or external) to say that some token is a number, a boolean, or that there is an array.

So I see two flaws in XML:

First that it's impoverished use of delimiters (e.g. to express even minimal data types) forces the use of schemas (of any type) and so makes XML quite heavyweight (unless you your data is only plain text and tokens). ...Hence the rise of JSON.

Second, that XML is impoverished because it does not have a standard mechanism (and neither do schema languages) for specifying embedded notations. And, once specified, to do anything useful: like giving some transducer name and information (e.g. a grammar) that would allow the validator to parse the information, report lexical or parsing errors, produce XML, validate that XML and access information in it for co-occurrence constraints between the main document and the embedded information.

(This is something I often see in Schematron schemas: attempts to hack a parse of some attribute etc using XPath and variables. For example: is someone's birth year greater than their death year?)

The success of XML can be judged by how well it allows representation, validation and use of the kind of documents it is used for. I think in the case of embedded notations, the answer is: not at all. Documents have numerous embedded notations, but there has been, to some extent, a blind spot about this: circular thinking that if someone wants structure, they should use tags.

I see this issue as underlying the calls sometimes for "structured attributes", where proponents say XML 2.0 should provide a standard tree markup inside start tags. I think that is circular thinking, because as I mentioned above, I think that humans find embedded notations congenial: just as XML tags is a great notation to express semi-structure data with simple attributes, embedded notations are great for structured information put as an attribute value.

Cheers