Re: [xml-dev] Quiz: is this XML well-formed?

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
From: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
To: Marcus Reichardt <u123724@gmail.com>
Date: Thu, 4 Feb 2021 14:55:58 -0700
Many thanks to Marcus Reichardt for attempting to ground this
discussion a little more firmly in actual facts.  There are some
oddities in XML which are there for no better reason than that SGML
had them and we needed to retain them for compatibility.  But I think
this is not one of them.  First of all, as MR has pointed out, the
rule in question did not come from SGML at all.  And second, I don't
think it's an oddity.

But the construct may perhaps nevertheless still best be viewed in the
light of SGML.

The passage quoted is one example of the meaning of the grammar in ISO
8879 being modified by prose, in a way that means anyone who
transcribes the grammar from the standard and tries to build a parser
for it is facing a long series of frustrating problems, first as
oddities in the grammar must be interpreted in order to build the
parser and get it to pass rudimentary type checking, and then as
example after example show that the parser must be modified because
the language one wants it to accept (SGML, or some subset thereof) is
not fully defined by the grammar, but only by the grammar together
with the entire body of rather dry, legalistic prose that makes up the
standard and in some cases (as here) bluntly contradicts the grammar.

I do not remember any specific discussions on this topic (which does
not mean we didn’t have them), but I do remember that at least one of
the editors of the XML spec wanted (a) to limit the number of cases in
which the meaning of the grammar was modified by normative prose, and
(b) to mark those places explicitly *in the grammar* so that people
turning to the grammar for guidance would be made aware that they
needed to look at the prose as well.

The named constraints on validity and well-formedness included in the
comments in the productions of the grammar in the XML spec serve this
purpose.  They can be thought of as informal signals that some
additional machinery is needed.  (E.g. an attribute, if one happens to
be working in an attribute-grammar system.)  They are thus by nature
signals of some complication or other in the spec.  Part of the goal
of the XML effort was to reduce complications, and thus part of the
goal was to reduce the number of places where well-formedness and
validity constraints had to be described in prose, instead of
following naturally from the grammar.

In ISO 8879 a prose note is necessary in order to say that the first
’s’ nonterminal in production rule 32 is not in fact always optional,
even though it is marked optional in the grammar.  The result is that
anyone who generated a parser from the grammar in the usual way is
required to go in and jigger things by hand to make the parser agree
with the spec.  That by itself would have made some members of the XML
working group view this particular grammatical construct with a
jaundiced eye.  But I think there are other reasons the XML WG did not
preserve it.

Since the condition set by 8879 for omitting the whitespace is always
met in XML, no prose note would have been necessary and the XML spec
could, I suppose, have written the rule for start-tag as

    STag ::= '<' Name (S (Attribute S?)*)? ‘>' [WFC: Unique Att Spec]

instead of as it is currently written:

    STag ::= '<' Name (S Attribute)* S? ‘>' [WFC: Unique Att Spec]

If that change would have made XML easier to parse, or easier to
handle with the kind of ad hoc tools we imagined the Desperate Perl
Hacker to possess, or easier to read for humans, then it might have
been worth considering, despite the small but measurable increase in
complexity in the expression on the right-hand side.

But I don’t think it makes start-tags easier to parse.  I don’t think
it makes them easier to process without a full XML parser.  And I
don’t think it makes start-tags easier to read for humans.

So my response to Roger Costello, who started this thread, would be an
answer and a question:

    No, of course the example you give is not well-formed XML.
    
    Why on earth would anyone want it to be?


My two cents.  Use only as directed.  In case of dizziness or nausea,
cut power to your nearest network router.

Michael Sperberg-McQueen



> On 4,Feb2021, at 1:45 PM, Marcus Reichardt <u123724@gmail.com> wrote:
> 
> Actually, the space character is *not* required by SGML.
> 
> From ISO 8879:
> 
> "The leading [space] can only be omitted from an attribute specification
>  that follows a delimiter" (clause 7.9)
> 
> meaning that spaces are optional as long as the attribute value
> literal is enclosed in single or double quotes; which, in SGML (and
> HTML), they don't have to be as long as the attribute value
> 
> "contains nothing but name characters and either [it's declared] or
> [SHORTTAG YES is effective]." (clause 7.9.3.1)
> 
> Try it out yourself, using either osgmlnorm (OpenSP) or sgmlproc (sgmljs.net):
> 
>    $ cat test.sgm
>    <!DOCTYPE test [
>      <!ELEMENT test - - ANY>
>      <!ATTLIST test
>        att CDATA #IMPLIED
>        otheratt CDATA #IMPLIED>
>    ]>
>    <test att="x"otheratt="y"></test>
> 
>    $ osgmlnorm test.sgm
>    <TEST ATT="x" OTHERATT="y"></TEST>
> 
>    $ sgmlproc test.sgm
>    <!DOCTYPE test [
>      <!ELEMENT test - - ANY>
>      <!ATTLIST test
>        att CDATA #IMPLIED
> 	otheratt CDATA #IMPLIED>
>    ]>
>    <test att="x" otheratt="y"></test>
> 
> So, once again, XML heads can't blame it on SGML :)
> 
> Best,
> M. Reichardt
> sgmljs.net
> 
> On 2/4/21, Liam R. E. Quin <liam@fromoldbooks.org> wrote:
>> On Thu, 2021-02-04 at 18:18 +0000, Roger L Costello wrote:
>>> But, but, but, ... Why is space required between attributes? Surely a
>>> parser can recognize the start of the next attribute given the end-
>>> delimiter of the previous attribute's value, yes?
>> 
>> In SGML the rules for attributes were (are) much more complex, because
>> of minimization. At SoftQuad, minimization and related features
>> represented approximately 80% of our support costs for Author/Editor
>> (an SGML editor).
>> 
>> I won't say you couldn't set up an SGML declaration to permit those
>> spaces to be elided, but they were generally needed because in SGML the
>> quotation marks were optional if the attribute value was declared as a
>> list,
>> <!ATTLIST boy
>>  socks (grey|grubby|torn|lost) #REQUIRED
>>  weapon (conker|plushie|manupulative-cry|none) "none"
>>> 
>> 
>> let you write <boy grey plushie>...
>> 
>> Note it's the attribute names that can be omitted, so that e.g. in
>> HTML, strictly speaking when you write <table border>you are not
>> setting an attribute called border, you are supplying the value of
>> border and the parser has to determine to which attribute it belongs
>> (in practice you are setting border="border"). Unfortnately,
>> implementers of HTML didn't have access to the actual SGML spec (i sent
>> a copy of the SGML Handbook to one of them!) because at ₤300 or more,
>> it was well over the budget of a graduate student, so they guessed, and
>> it looked like <table border> was the same as <table border="true">, so
>> that's more like how HTML works. But XML came out of SGML.
>> 
>> Now, as Ken pointed out, we wanted every valid XML document to be an
>> SGML document. After a huge fight, in the end, SGML was actually
>> modified to make this possible, and if we had not had the fight but had
>> been able to work together, we could have made a number of decisions
>> differently; wanting asymmetric comment tokens was another for example,
>> where i'd proposed <!--* .... *--> for comments. <!- ... -> would have
>> beeen possible too, if SGML could have been modified to allow it.
>> 
>> But above all else we wanted a language easy to process and parse - the
>> "desperate Perl hacker" should be able to change all occurrences of
>> 2021 in a part number without affecting dates. So we required the full
>> attribute="value" with no minimization, but didn't revisit the space
>> there.
>> 
>> It is not, i think, too onerous - a space is in any case needed after
>> the element name even though <boytornconker> might not be ambiguous in
>> a given DTD, but it can be difficult for a human to determine this.
>> 
>> Liam
>> 
>> 
>> --
>> Liam Quin, https://www.delightfulcomputing.com/
>> Available for XML/Document/Information Architecture/XSLT/
>> XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
>> Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org
>> 
>> 
>> _______________________________________________________________________
>> 
>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>> to support XML implementation and development. To minimize
>> spam in the archives, you must subscribe before posting.
>> 
>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>> subscribe: xml-dev-subscribe@lists.xml.org
>> List archive: http://lists.xml.org/archives/xml-dev/
>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>> 
>> 
> 
> _______________________________________________________________________
> 
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
> 
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
> 

********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
cmsmcq@blackmesatech.com
http://www.blackmesatech.com
********************************************
Follow-Ups:
- Re: [xml-dev] Quiz: is this XML well-formed?
  - From: Marcus Reichardt <u123724@gmail.com>
References:
- Quiz: is this XML well-formed?
  - From: Roger L Costello <costello@mitre.org>
- Re: [xml-dev] Quiz: is this XML well-formed?
  - From: "Liam R. E. Quin" <liam@fromoldbooks.org>
- Re: [xml-dev] Quiz: is this XML well-formed?
  - From: Marcus Reichardt <u123724@gmail.com>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]