[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: SGML's deterinism rules and SHORTREF (was: Re: [xml-dev] RE: Theremarkable similarities between XSLT and Flex/Lex)
- From: Marcus Reichardt <u123724@gmail.com>
- To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
- Date: Sat, 9 Jul 2022 16:23:00 +0200
So I hope you're not expecting me to close this case of all people ;)
but FWIW the SGML handbook itself ties the syntactical presentation of
content model expressions to its notion of ambiguity and tag inference in
Annex H "Theoretical basis for the SGML content model" (a heading
that would be disappointing if taken at face value). From p. 560:
> Minimization would be permitted for "b" in
> (a?, b)
> while it would not be permitted in
> ((a,b)|b)
> even though both models reduce to equivalent
> regular expressions.
What this supposedly means is that the start-element
tag for b in the 2nd variant must be specified (can't be
omitted) if no element a is preceding it in an instance,
which is what both Open SP and sgmljs.net SGML (the sgmlproc
command-line tool) are complaining about:
$ cat test.sgm
<!DOCTYPE doc [
<!ELEMENT doc - - ((a, b)|b)>
<!ELEMENT a - - (#PCDATA)>
<!ELEMENT b O - (#PCDATA)>
]>
<doc>
blah</b>
</doc>
$ osgmlnorm test.sgm
osgmlnorm:test.sgm:7:0:E: character data is not allowed here
osgmlnorm:test.sgm:7:7:E: end tag for element "B" which is not open
osgmlnorm:test.sgm:8:5:E: end tag for "DOC" which is not finished
...
$ sgmlproc test.sgm
"test.sgm": line 7: fatal: character data not accepted here
...
Note though that neither of the examples are cases of 1-ambiguous
content model expressions according to Brueggeman-Klein, which
as you of course know was adopted later as the canonical decision
procedure by James Clark. So this still leaves open the question of
"arbitraryness" of the ambigousness criterion, nor can any other
argument be made in the absence of a formal specification by ISO 8879
itself I'm afraid, including for SHORTREFs.
My reading of Annex H, and in particular the following paragraphs
on p. 559:
> Another problem lies with the construction of the DFA. One method
> is first to construct a nondeterministic finite automaton (NFA)
> directly from the regular expression, then to derive a deterministic
> equivalent. Allthough this and other algorithms for DFA construction
> are non-polynomial and hardly intractable for the human-readable
> models envisaged by SGML, they may be too resource-intensive [...]
> This International Standard avoids these these probelms by eliminating
> the need to construct a DFA. This is does by prohibiting models that
> are ambiguous or require "look-ahead" [...]
is that the intent was simply to provide a "natural" limited interpretation
of content model expressions.
Thanks for an interesting discussion!
Best,
Marcus Reichardt
sgml.io
On 7/6/22, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote:
>
> Marcus Reichardt <u123724@gmail.com> writes:
>
>>> Am 26.06.2022 um 15:01 schrieb Roger L Costello <costello@mitre.org>:
>>>
>>> There was also a strange
>>> unnecessary constraint on these expressions called "ambiguity", which
>>> *everybody* who wrote SGML software needed to understand, and so the idea
>>> of
>>> applying formal language techniques to SGML was inevitable.
>
>> Hmm, your anonymous source should've expanded on that; without proof
>> or anything the claim of "unnecessary-ness" is void.
>
>> SGML has tag inference (can infer arbitrarily many start- and
>> end-element tags) and moreover can expand short reference delimiters
>> (arbitrary tokens not used a markup delimiters, including newlines and
>> tabs/spaces) by context-dependendant replacement text (start- and
>> end-elements typically). Determinism in content models helps a lot to
>> make this even work, ...
>
> That's an interesting account; I think I would not be the only reader of
> xml-dev who would be interested to see a more concrete argument that the
> content-model determinism (confusingly called "ambiguity") rules of SGML
> help at all with tag inference or with the management of SHORTREF
> mappings.
>
> Can you expound? I'm sure a lot of us would be glad to learn more.
>
> One reason I ask is that I have been curious for twenty or thirty years
> about the rationale for those rules, and none of the members of WG 8
> whom I have had the opportunity to ask about it has mentioned anything
> to do with markup inference or short references. Now, it's entirely
> possible for some members of a working group to be unaware of technical
> arguments important for other WG members, so perhaps I have merely been
> unlucky in asking only people who didn't know or care about this
> particular piece of technical background.
>
> Another reason is that the essence of the determinism rules is that each
> element in the input match against an identifiable identifier in the
> content model. But since SHORTREF mappings are associated with element
> types and not with content-model tokens, I don't see a way for it to
> matter, for SHORTREF recognition, which token in a content model is
> being matched, or whether there is more than one.
>
>> ... and as I hear, the same thing is being
>> re-introduced as "Invisible XML", giving you a facility that was
>> already there in 1986 ;)
>
> I think it's fair to say that invisible XML and the SHORTREF and DATATAG
> (and markup inference) constructs of ISO 8879 have a certain rough
> similarity. But "same thing" sounds to me like an exaggeration. It has
> been a long time since I read 8879, but I do not remember anything like
> an explanation of how to take an arbitrary context-free grammar in EBNF
> form and formulate an equivalent set of declarations in a DTD. I don't
> remember any discussion of the expressive power of SHORTREF, let alone a
> claim that it will handle arbitrary context-free grammars.
>
>> Of course, with XML-style fully tagged markup, more relaxed models
>> could be used, hence RelaxNG.
>
>> What is the argument here? That a Thompson construction for finite
>> automata (as opposed to Glushkov/Antimirov derivatives) is more
>> convenient for programmers since an off-the-shelve regexp lib can be
>> used?
>
> No, I don't think that is the argument Roger Costello was making or
> reporting, nor is it one that I have heard anyone make.
>
> One argument I have heard people make, and have made myself, is that the
> determinism rules seem ad hoc and unconnected to the basics of automata
> theory. It was not until well after ISO 8879 became an international
> standard that there was a coherent statement of what they meant in terms
> of automata theory. The fact that routine techniques from automata
> theory could not always be applied (unless they could -- figuring out
> which is which is not easy, working from the text of 8879) certainly did
> make it harder for people to build conforming SGML processors, and
> certainly did affect the number of SGML processors available.
>
> The determinism rules reduce the expressive power of content model
> notation vis-a-vis regular expressions, so that although they look like
> regular expressions content models are strictly less expressive than
> regular expressions.
>
> Any design involves tradeoffs, but for costs like these I would want to
> see some substantial advantages. I first started looking for them 25 or
> 30 years ago, and in that time I have not yet been persuaded that they
> have any compensating advantages at all except that they simplify life
> for programmers who can't figure out how to make backtracking work and
> cannot be bothered to learn how to determinize a finite state automaton.
>
> The claim that the determinism rules make SHORTREF work would probably
> count as a coherent advantage, if it can be substantiated, though it
> would not at this date be a very compelling one for those who gave up on
> SHORTREF some time in the early 90s.
>
> --
> C. M. Sperberg-McQueen
> Black Mesa Technologies LLC
> http://blackmesatech.com
>
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]