Re: [xml-dev] SGML default attributes.

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Steve Newcomb <srn@coolheads.com>
To: xml-dev@lists.xml.org
Date: Wed, 4 May 2016 15:08:54 -0400

Eliot,

Like you, I'm not really wedded to the notion of parser-mediated transclusion. On the other hand, I'm not really convinced we can 100% jettison it, either, or preach that the very concept is somehow evil. It's a hack, that's all. (Frankly, hacks are what get us through the day.)

What you've said is packed, as usual, with terrific insights. I guess I just have trouble with the rhetoric. It wouldn't bother me so much if I didn't think your words are (quite deservedly) influential. If I didn't already know you so well, I might gather that it is your opinion that either:

(1) entities have no identity

or

(2) entities may have identity but it doesn't make any difference,

because

...entities have no purpose other than content-level reuse in the context of parsing operations.

Assuming I'm right, then I'm going to guess that it's your opinion that the *only* reason why we have the "lt" (less-than) general entity is to bypass the parser's natural inclination to recognize a STAGO (start tag open character). As a purely practical matter, I must admit that I don't think I've *ever* used the "lt" general entity name for any other purpose. And, truly, that purpose is a hack, pure and simple! BUT: there's a vital principle here, and I don't want it to be trampled and lost.

The principle at work here, at least for me, is that when I'm invoking an entity by name, I'm using a defined name to refer to an abstract thing, namely that abstraction which is shared by all "less-than" characters in all character sets, fonts, encodings, and whatnot. The fact that I'm invoking the notion of "the less-than character" in the context of parsed character data is irrelevant. I might instead use the entity name "lt" as the value of an ENTITY attribute, for example, or in any of the many ways that HyTime, for example, exploited the notion of entity identity.

In SGML, every aspect of the use of names to identify things is founded on the notions inherent in DTD syntax. And DTDs can be fully or partially shared among many documents that invoke those element-type names, attribute names, and entity names, so that they are all (presumably, *cough*) invoking the same things whenever they utter the same names. And the DTDs themselves can also have "universal" names by invoking the universes (somehow) identified in PUBLIC identifiers. So I would argue that you err in portraying entities and document types as different things. Instead, I think they are in fact best understood as different perspectives on one and the same organic whole, a single "grounding tree", if you like.

Entity identity is the invisible root of the grounding tree. In my view, the names declared in document types and invoked in document instances are merely the visible, above-ground parts of it. Now, one may claim that we don't need entity identity for that purpose, just as we don't need gold to back up the U.S. dollar. Hmmmm. But there's still identity, even there, and in the case of U.S. dollars -- even the huge majority of them that don't have individual identity -- their root-existence and root-nature is arguably testable in the form of U.S. military power.

Where's the power of URIs, if there no testable "there" there, and they don't even necessarily resolve? Where's the identity of a document type, if not in an entity of some kind that is testably somewhere and ideally has properties that are useful for testing instances that claim to be of the type?

I don't see how your explanation of DITA's approach resolves the problem. When you say:

...stop caring about the grammar as an artifact and care only
about the set of (abstract) vocabulary modules the document says it (may)
use. That is, actually declare the abstract document type in an
unambiguous way and worry about validation details separately.

...you don't say how to resolve the problem, other than, implicitly, anyway, via entity identity: the identities of the DITA modules, wherever they are. Right? You just don't admit it up front in ENTITY declarations. It's just understood by everybody, more or less intuitively, I guess.

How is that better?

Steve

On 05/04/2016 01:12 PM, Eliot Kimber wrote:

These are really two different subject domains: entities (content-level
reuse) and document types (defining and determining correctness of
instances against some understood set of rules).

On general entities:

General entities are absolute evil. They should never be used under any
circumstances. Fortunately, the practical reality of XML is that they
almost never are used. I only see them in XML applications that reflect
recent migration from legacy SGML systems.

The alternative is link-based reuse, that is, reuse at the application
processing level, not at the serialization parser level. Or more
precisely: reuse is an application concern, not a serialization concern.
Entities in SGML and XML are string macros. To the degree that string
macros are useful then they have value and in the context of DTD
declarations parameter entities have obvious value and utility. Parameter
entities are not evil.

But in the context of content, that is, the domain of the elements
themselves, string macros are a big problem, not because they aren't
useful, but because people think they do something they don't, namely
provide a way to do reliable reuse. The use cases where string macros are
useful relative to the use cases where they are actively dangerous is so
small as to make their value not at all worth the cost of their certain
misuse.

Even for apparently-simple use cases like string value parameterization in
content (e.g., product names or whatever), string macros fail because they
cannot be related to specific use contexts. When you push on the
requirements for reuse you quickly realize that only application-level
processing gives you the flexibility and opportunities required to
properly implement re-use requirements, in particular, providing the
correct resolution for a given use in a given use context.

The solution was in HyTime, namely the content reference link type, which
was a link with the base semantic of use by reference. Because it is a
link it is handled in the application domain, not the parsing domain. This
is transclusion as envisioned by Ted Nelson.

You see this in DITA through DITA's content reference facility and the
map-and-topic architecture, both of which use hyperlinks to establish
reuse relationships. With DITA 1.3 the addressing mechanism is
sufficiently complete to satisfy most of the requirements (the only
missing feature is indirection for references to elements within topics,
but I defined a potential solution that does not require any architectural
changes to DITA, just additional processing applied to specific
specializations).

I'm not aware of any other documentation XML application that has the
equivalent use-by-reference features, but DITA is somewhat unique in being
driven primarily by re-use requirements, which is not the case for older
specifications like DocBook, NLM/JATS, and TEI. Of course, there's no
barrier to adding similar features to any application. However, there are
complications and policy considerations that have to be carefully worked
out, such as what are the rules for consistency between referencing and
referenced elements? DITA has one policy, but it may not be the best
policy for all use cases.

On DTDs and grammars in general:

I do not say that DTDs (or grammars in general) are evil.

I only say that the way people applied them was (and is) misguided because
they misunderstood (or willfully ignored in the face of no better
alternative) their limitations as a way to associate documents with their
abstract document types. Of course DTDs and grammars in general have great
value as a way of imposing some order on data as it flows through its
communication channels and goes through its life cycle.

But grammars do not define document types.

At the time namespaces were being defined I tried to suggest some standard
way to identify abstract document types separate from any particular
implementation of them: basically a formal document that says "This is
what I mean by abstract document type 'X'". You give it a URI so it can be
referred to unambiguously and you can connect whatever additional
governing or augmenting artifacts to it you want. By such a mechanism you
could have as complete a definition of a given abstract document type as
you wanted, including prose definitions as well as any number of
implementing artifacts (grammars, Schematrons, validation applications,
phone numbers to call for usage advice, etc.).

But of course that was too heavy for the time (or for now). Either people
simply didn't need that level of definitional precision or they used the
workaround of pointing in the other direction, that is, by having
specifications that say "I define what abstract document type 'X'" is.

This is was in the context of the problem that namespace names don't point
to anything: people had the idea that namespace names told you something
but we were always clear that they did not--they were simply magic strings
that used the mechanics of URIs to ensure that you have a
universally-unique name.

But the namespace tells you nothing about the names in the space (that is,
what is the set of allowed names, where are their semantics and rules
defined, etc.). The namespace spec specifically says "You should not
expect to find anything at the end of the namespace URI and you should not
try to resolve it".

So if the namespace name is not the name of the document type, what is? I
wanted there to be one because I like definitional completeness.

But in fact it's clear now that that level of completeness is either not
practical or is not sufficiently desired to make it worth trying to
implement it.

So we're where we were 30 years ago: we have grammar definitions for
documents but we don't have a general way to talk about abstract document
types as distinct from their implementing artifacts (grammars, validation
processors, output processors, prose definitions, etc.).

But experience has shown that it's not that big of a deal in practice. In
practice, having standards or standards-like documents is sufficient for
those cases where it is important.

As far as addressing the problem that the reference from a document
instance a grammar in fact tells you nothing reliable, a solution is what
DITA does: stop caring about the grammar as an artifact and care only
about the set of (abstract) vocabulary modules the document says it (may)
use. That is, actually declare the abstract document type in an
unambiguous way and worry about validation details separately.

DITA does this as follows:

1. Defines an architecture for layered vocabulary.

The DITA standard defines an invariant and mandatory set of base element
types and a mechanism for the definition of new element types in terms of
the base types. All conforming DITA element types and attributes MUST be
based on one of the base types (directly or indirectly) and must be at
least as constrained as the base type (that is, you can't relax
constraints). This is DITA specialization. It ensures that all DITA
documents are minimally processable in terms of the base types (or any
known intermediate types). It allows for reliable interoperation and
interchange of all conforming DITA documents. Because the definitional
mechanism uses attributes it is not dependent on any particular grammar
feature in the way that HyTime is. Any normal XML processor (including CSS
selectors) can get access to the definitional base of any element and thus
do what it can with it. The definitional details of an element are
specified on the required @class attribute, e.g. class="- topic/p
mydomain/my-para ", which reflects a specialization of the base type "P"
in the module "topic" by the module "mydomain" with the name "my-para".
Any general DITA-aware processor can thus process "my-para" elements using
the rules for "p" or, through extension, can have "mydomain/my-para"
processing, which might be different. But in either case you'll get
something reasonable as a result.

2. Defines a modular architecture for vocabulary such that each kind of
vocabulary definition (map types, topic types, or mix-in "domains")
follows a regular pattern. There is no sense of "a" DITA DTD, only
collections of modules that can be combined into document types (both in
the abstract sense of "DITA document type" and in the implementation sense
of a "a working grammar file that governs document instances that use a
given set of modules").

DITA requires that a given version in time of a module is invariant,
meaning that every copy of the module should be identical to every other
(basically, you never directly modify a vocabulary module's grammar
implementation). Each module is given a name that should be globally
unique, or at least unique within its expected scope of use. Experience
has shown us that it's actually pretty easy to ensure practical uniqueness
just by judicious use of name prefixes and general respect for people's
namespaces. No need to step up to full GUID-style uniqueification ala XML
namespaces.

In addition to vocabulary modules, which define element types or
attributes, you can have "constraint modules", which impose constraints on
vocabulary defined in other modules. Constraint modules let you further
constrain the vocabulary without the need to directly modify a given
module's grammar definition. Again, the rule is that you can only
constrain, you can't relax.

3. Defines a "DITA document type" as a unique set of modules, identified
by module name. If two DITA documents declare the use of the same set of
modules then by definition they have the same DITA document type. This
works because of rule (2): all copies of a given module must be identical.
So it is sufficient to simply identify the modules. In theory one could go
from the module names to some set of implementations of the modules
although I don't know of any tools that do that because in practice most
DITA documents have associated DTDs that already integrate the grammars
for the modules being used. But it is possible. The DITA document type is
declared on the @domains attribute, which is required on DITA root
elements (maps and topics).

Note that you could have a conforming DITA vocabulary module that is only
ever defined in prose. As long as documents reflected the types correctly
in the @class attributes and reflected the module name in the @domains
attribute the DITA definitional requirements are met. It would be up to
tool implementors to do whatever was appropriate for your domain (which
might be nothing if your vocabulary exists only to provide distinguishing
names and doesn't require any processing different from the base). Nobody
would do this *but they could*.

Thus DITA completely divorces the notion of "document type" from any
implementation details of grammar, validation, or processing, with the
clear implication that there better be clear documentation of what a given
vocabulary module is.

Cheers,

E.
----
Eliot Kimber, Owner
Contrext, LLC
http://contrext.com




On 5/4/16, 11:06 AM, "Steve Newcomb" <srn@coolheads.com> wrote:

Eliot,

In order to avoid potential misunderstandings, I think it might be worth
clarifying your position on the following points:

(1) Resolved: the whole idea of entity identity was a mistake, is
worthless, and is evil.

(2) Resolved: the whole idea of document type identity was a mistake, is
worthless, and is evil.

I have deliberately made these statements extreme and obviously silly in
order to dramatize the fact that, even though there are problems with
SGML's and/or XML's operational approaches to them, we cannot discard
these ideas altogether.  The ideas themselves remain profound and
necessary.  They will always be needed.  The usefulness of their various
operational prostheses will always be limited to certain cultural
contexts.  Even within their specific contexts, those prostheses will
always be imperfect.  They will always require occasional repair and
replacement, in order that they remain available for use even as that
context's notions of "entity", "document", and "identity" continue to
evolve and diversify.

The operational prostheses with which these ideas were fitted at SGML's
birth are things of their time.  That was then, this is now, and "time
makes ancient good uncouth".  Their goodness in their earlier context is
a matter of record; they were used, a lot, for a lot of reasons and in a
lot of ways.  At the time, it was not stupid or evil to make the notion
of document type identity depend on the notion of entity identity, nor
was it stupid or evil to make the notion of entity identity dependent on
PUBLIC identifiers.  And in many ways, it still isn't.  What is your
proposed alternative, and why is it better?

Steve

On 05/04/2016 11:23 AM, Eliot Kimber wrote:

SGML requires the use of a DTD--there was no notion of a "default" DTD.
This requirement was, I'll argue, the result of a fundamental conceptual
mistake--understandable at the time but a mistake nevertheless.

The conceptual mistakes that SGML made was conflating the notion of an
abstract "document type" with the grammar definition for (partially)
validating documents against that document type. That is, SGML saw the
DTD
as being equal to the definition of the "document type" as an
abstraction.
But of course that is nonsense. There was (remains today) the misguided
notion that a reference to an external DTD subset somehow told you
something actionable about the document you had. But of course it tells
you nothing reliable because the document could define it's "real" DTD
in
the internal subset or the local environment could put whatever it wants
at the end of the public ID the document is referencing.

Consider this SGML document:

<!DOCTYPE notdocbook PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" [
    <!ELEMENT notdocbook ANY >
    <!ELEMENT bogus ANY >
]>
<notdocbook>
    <bogus><para>This is not a DocBook document</para></bogus>
</notdocbook>

This document will be taken as a DocBook document by any tool that
thinks
the public ID means something. But obviously it is not a DocBook
document.
It is, however, 100% DTD valid. QED DTDs are useless as tools of
document
type definition. The only reason the SGML (and now XML world) didn't
collapse under this fact is that the vast majority of SGML and XML
authoring and management tools simply refused to preserve internal
subsets
(going back to the discussion about DynaBase's problems with entity
preservation).

Standoff grammars like XSD and RELAX NG at least avoid the problem of
internal DTD subsets but they still fail to serve as reliable
definitions
of document types in abstract because they are still only defining the
grammar rules for a subset of all possible conforming documents in a
document document type.

Because of features like tag omission, inclusion exceptions, and short
references, it was simply impossible to parse an SGML document without
having both its DTD and its SGML declaration (which defined the lexical
syntax details). There is a default SGML declaration, but not a default
DTD.

A lot of what we did in XML was remove this dependency by having a fixed
syntax and removing all markup minimization except attribute defaults.

XML does retain one markup minimization feature, attribute defaults.
Fortunately, both XSD and RELAX NG provide alternatives to DTDs for
getting default attribute values.

Cheers,

Eliot
----
Eliot Kimber, Owner
Contrext, LLC
http://contrext.com




On 5/4/16, 6:16 AM, "Norman Gray" <norman@astro.gla.ac.uk> wrote:

Greetings.

(catching up ...)

On 29 Apr 2016, at 17:58, John Cowan wrote:

On Fri, Apr 29, 2016 at 8:54 AM, Norman Gray <norman@astro.gla.ac.uk>
wrote:

In the XML world, the DTD is just for validation


That turns out not to be the case.  There are a number of XML DTD
features
which affect the infoset returned by a compliant parser.  If they are
in
the internal subset, the parser MUST respect them;

I stand corrected; I was sloppy.  I think this doesn't change my
original point, however, which was that in SGML the DTD was integral to
the document, and to the parse of the document, and that it's easy to
forget this after one has got used to two decades of XML[1].  I can't
remember if there was a trivial or default DTD which was assumed in the
absence of a declared one, in the same way that there was a default
SGML
Declaration, but taking advantage of that would probably have been
regarded as a curiosity, rather than normal practice.

In XML, in contrast, the DTD has a more auxiliary role, and at a first
conceptual look, that role is validation (even though -- footnote! --
it
may change other things about the parse as well).  Thus _omitting_ an
XML DTD (or XSchema) is neither perverse nor curious.

Practical aspect: When I'm writing XML, I use a DTD (in whatever
syntax)
to help Emacs tell me if the document is valid, but I don't even know
whether the XML parsers I use are capable of using a DTD external
subset.  That careless ignorance would be impossible with SGML.

The rational extension of that attitude, of course, is MicroXML, which
(as you of course know) doesn't use any external resources at all, and
doesn't care about validation.

Best wishes,

Norman


[1] Hang on, _two_ decades?!  I've just checked and ... 1996 doesn't
seem that long ago.


--
Norman Gray  :  https://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php


_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

Follow-Ups:
- Re: [xml-dev] SGML default attributes.
  - From: Eliot Kimber <ekimber@contrext.com>

References:
- Re: [xml-dev] SGML default attributes.
  - From: "Norman Gray" <norman@astro.gla.ac.uk>
- Re: [xml-dev] SGML default attributes.
  - From: Eliot Kimber <ekimber@contrext.com>
- Re: [xml-dev] SGML default attributes.
  - From: Steve Newcomb <srn@coolheads.com>
- Re: [xml-dev] SGML default attributes.
  - From: Eliot Kimber <ekimber@contrext.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]