Re: [xml-dev] SGML default attributes.

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
From: Eliot Kimber <ekimber@contrext.com>
To: Steve Newcomb <srn@coolheads.com>, <xml-dev@lists.xml.org>
Date: Wed, 04 May 2016 12:12:39 -0500
These are really two different subject domains: entities (content-level
reuse) and document types (defining and determining correctness of
instances against some understood set of rules).

On general entities:

General entities are absolute evil. They should never be used under any
circumstances. Fortunately, the practical reality of XML is that they
almost never are used. I only see them in XML applications that reflect
recent migration from legacy SGML systems.

The alternative is link-based reuse, that is, reuse at the application
processing level, not at the serialization parser level. Or more
precisely: reuse is an application concern, not a serialization concern.
Entities in SGML and XML are string macros. To the degree that string
macros are useful then they have value and in the context of DTD
declarations parameter entities have obvious value and utility. Parameter
entities are not evil.

But in the context of content, that is, the domain of the elements
themselves, string macros are a big problem, not because they aren't
useful, but because people think they do something they don't, namely
provide a way to do reliable reuse. The use cases where string macros are
useful relative to the use cases where they are actively dangerous is so
small as to make their value not at all worth the cost of their certain
misuse. 

Even for apparently-simple use cases like string value parameterization in
content (e.g., product names or whatever), string macros fail because they
cannot be related to specific use contexts. When you push on the
requirements for reuse you quickly realize that only application-level
processing gives you the flexibility and opportunities required to
properly implement re-use requirements, in particular, providing the
correct resolution for a given use in a given use context.

The solution was in HyTime, namely the content reference link type, which
was a link with the base semantic of use by reference. Because it is a
link it is handled in the application domain, not the parsing domain. This
is transclusion as envisioned by Ted Nelson.

You see this in DITA through DITA's content reference facility and the
map-and-topic architecture, both of which use hyperlinks to establish
reuse relationships. With DITA 1.3 the addressing mechanism is
sufficiently complete to satisfy most of the requirements (the only
missing feature is indirection for references to elements within topics,
but I defined a potential solution that does not require any architectural
changes to DITA, just additional processing applied to specific
specializations). 

I'm not aware of any other documentation XML application that has the
equivalent use-by-reference features, but DITA is somewhat unique in being
driven primarily by re-use requirements, which is not the case for older
specifications like DocBook, NLM/JATS, and TEI. Of course, there's no
barrier to adding similar features to any application. However, there are
complications and policy considerations that have to be carefully worked
out, such as what are the rules for consistency between referencing and
referenced elements? DITA has one policy, but it may not be the best
policy for all use cases.

On DTDs and grammars in general:

I do not say that DTDs (or grammars in general) are evil.

I only say that the way people applied them was (and is) misguided because
they misunderstood (or willfully ignored in the face of no better
alternative) their limitations as a way to associate documents with their
abstract document types. Of course DTDs and grammars in general have great
value as a way of imposing some order on data as it flows through its
communication channels and goes through its life cycle.

But grammars do not define document types.

At the time namespaces were being defined I tried to suggest some standard
way to identify abstract document types separate from any particular
implementation of them: basically a formal document that says "This is
what I mean by abstract document type 'X'". You give it a URI so it can be
referred to unambiguously and you can connect whatever additional
governing or augmenting artifacts to it you want. By such a mechanism you
could have as complete a definition of a given abstract document type as
you wanted, including prose definitions as well as any number of
implementing artifacts (grammars, Schematrons, validation applications,
phone numbers to call for usage advice, etc.).

But of course that was too heavy for the time (or for now). Either people
simply didn't need that level of definitional precision or they used the
workaround of pointing in the other direction, that is, by having
specifications that say "I define what abstract document type 'X'" is.

This is was in the context of the problem that namespace names don't point
to anything: people had the idea that namespace names told you something
but we were always clear that they did not--they were simply magic strings
that used the mechanics of URIs to ensure that you have a
universally-unique name.

But the namespace tells you nothing about the names in the space (that is,
what is the set of allowed names, where are their semantics and rules
defined, etc.). The namespace spec specifically says "You should not
expect to find anything at the end of the namespace URI and you should not
try to resolve it".

So if the namespace name is not the name of the document type, what is? I
wanted there to be one because I like definitional completeness.

But in fact it's clear now that that level of completeness is either not
practical or is not sufficiently desired to make it worth trying to
implement it.

So we're where we were 30 years ago: we have grammar definitions for
documents but we don't have a general way to talk about abstract document
types as distinct from their implementing artifacts (grammars, validation
processors, output processors, prose definitions, etc.).

But experience has shown that it's not that big of a deal in practice. In
practice, having standards or standards-like documents is sufficient for
those cases where it is important.

As far as addressing the problem that the reference from a document
instance a grammar in fact tells you nothing reliable, a solution is what
DITA does: stop caring about the grammar as an artifact and care only
about the set of (abstract) vocabulary modules the document says it (may)
use. That is, actually declare the abstract document type in an
unambiguous way and worry about validation details separately.

DITA does this as follows:

1. Defines an architecture for layered vocabulary.

The DITA standard defines an invariant and mandatory set of base element
types and a mechanism for the definition of new element types in terms of
the base types. All conforming DITA element types and attributes MUST be
based on one of the base types (directly or indirectly) and must be at
least as constrained as the base type (that is, you can't relax
constraints). This is DITA specialization. It ensures that all DITA
documents are minimally processable in terms of the base types (or any
known intermediate types). It allows for reliable interoperation and
interchange of all conforming DITA documents. Because the definitional
mechanism uses attributes it is not dependent on any particular grammar
feature in the way that HyTime is. Any normal XML processor (including CSS
selectors) can get access to the definitional base of any element and thus
do what it can with it. The definitional details of an element are
specified on the required @class attribute, e.g. class="- topic/p
mydomain/my-para ", which reflects a specialization of the base type "P"
in the module "topic" by the module "mydomain" with the name "my-para".
Any general DITA-aware processor can thus process "my-para" elements using
the rules for "p" or, through extension, can have "mydomain/my-para"
processing, which might be different. But in either case you'll get
something reasonable as a result.

2. Defines a modular architecture for vocabulary such that each kind of
vocabulary definition (map types, topic types, or mix-in "domains")
follows a regular pattern. There is no sense of "a" DITA DTD, only
collections of modules that can be combined into document types (both in
the abstract sense of "DITA document type" and in the implementation sense
of a "a working grammar file that governs document instances that use a
given set of modules").

DITA requires that a given version in time of a module is invariant,
meaning that every copy of the module should be identical to every other
(basically, you never directly modify a vocabulary module's grammar
implementation). Each module is given a name that should be globally
unique, or at least unique within its expected scope of use. Experience
has shown us that it's actually pretty easy to ensure practical uniqueness
just by judicious use of name prefixes and general respect for people's
namespaces. No need to step up to full GUID-style uniqueification ala XML
namespaces. 

In addition to vocabulary modules, which define element types or
attributes, you can have "constraint modules", which impose constraints on
vocabulary defined in other modules. Constraint modules let you further
constrain the vocabulary without the need to directly modify a given
module's grammar definition. Again, the rule is that you can only
constrain, you can't relax.

3. Defines a "DITA document type" as a unique set of modules, identified
by module name. If two DITA documents declare the use of the same set of
modules then by definition they have the same DITA document type. This
works because of rule (2): all copies of a given module must be identical.
So it is sufficient to simply identify the modules. In theory one could go
from the module names to some set of implementations of the modules
although I don't know of any tools that do that because in practice most
DITA documents have associated DTDs that already integrate the grammars
for the modules being used. But it is possible. The DITA document type is
declared on the @domains attribute, which is required on DITA root
elements (maps and topics).

Note that you could have a conforming DITA vocabulary module that is only
ever defined in prose. As long as documents reflected the types correctly
in the @class attributes and reflected the module name in the @domains
attribute the DITA definitional requirements are met. It would be up to
tool implementors to do whatever was appropriate for your domain (which
might be nothing if your vocabulary exists only to provide distinguishing
names and doesn't require any processing different from the base). Nobody
would do this *but they could*.

Thus DITA completely divorces the notion of "document type" from any
implementation details of grammar, validation, or processing, with the
clear implication that there better be clear documentation of what a given
vocabulary module is.

Cheers,

E.
----
Eliot Kimber, Owner
Contrext, LLC
http://contrext.com




On 5/4/16, 11:06 AM, "Steve Newcomb" <srn@coolheads.com> wrote:

>Eliot,
>
>In order to avoid potential misunderstandings, I think it might be worth
>clarifying your position on the following points:
>
>(1) Resolved: the whole idea of entity identity was a mistake, is
>worthless, and is evil.
>
>(2) Resolved: the whole idea of document type identity was a mistake, is
>worthless, and is evil.
>
>I have deliberately made these statements extreme and obviously silly in
>order to dramatize the fact that, even though there are problems with
>SGML's and/or XML's operational approaches to them, we cannot discard
>these ideas altogether.  The ideas themselves remain profound and
>necessary.  They will always be needed.  The usefulness of their various
>operational prostheses will always be limited to certain cultural
>contexts.  Even within their specific contexts, those prostheses will
>always be imperfect.  They will always require occasional repair and
>replacement, in order that they remain available for use even as that
>context's notions of "entity", "document", and "identity" continue to
>evolve and diversify.
>
>The operational prostheses with which these ideas were fitted at SGML's
>birth are things of their time.  That was then, this is now, and "time
>makes ancient good uncouth".  Their goodness in their earlier context is
>a matter of record; they were used, a lot, for a lot of reasons and in a
>lot of ways.  At the time, it was not stupid or evil to make the notion
>of document type identity depend on the notion of entity identity, nor
>was it stupid or evil to make the notion of entity identity dependent on
>PUBLIC identifiers.  And in many ways, it still isn't.  What is your
>proposed alternative, and why is it better?
>
>Steve
>
>On 05/04/2016 11:23 AM, Eliot Kimber wrote:
>> SGML requires the use of a DTD--there was no notion of a "default" DTD.
>> This requirement was, I'll argue, the result of a fundamental conceptual
>> mistake--understandable at the time but a mistake nevertheless.
>>
>> The conceptual mistakes that SGML made was conflating the notion of an
>> abstract "document type" with the grammar definition for (partially)
>> validating documents against that document type. That is, SGML saw the
>>DTD
>> as being equal to the definition of the "document type" as an
>>abstraction.
>> But of course that is nonsense. There was (remains today) the misguided
>> notion that a reference to an external DTD subset somehow told you
>> something actionable about the document you had. But of course it tells
>> you nothing reliable because the document could define it's "real" DTD
>>in
>> the internal subset or the local environment could put whatever it wants
>> at the end of the public ID the document is referencing.
>>
>> Consider this SGML document:
>>
>> <!DOCTYPE notdocbook PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" [
>>    <!ELEMENT notdocbook ANY >
>>    <!ELEMENT bogus ANY >
>> ]>
>> <notdocbook>
>>    <bogus><para>This is not a DocBook document</para></bogus>
>> </notdocbook>
>>
>> This document will be taken as a DocBook document by any tool that
>>thinks
>> the public ID means something. But obviously it is not a DocBook
>>document.
>> It is, however, 100% DTD valid. QED DTDs are useless as tools of
>>document
>> type definition. The only reason the SGML (and now XML world) didn't
>> collapse under this fact is that the vast majority of SGML and XML
>> authoring and management tools simply refused to preserve internal
>>subsets
>> (going back to the discussion about DynaBase's problems with entity
>> preservation).
>>
>> Standoff grammars like XSD and RELAX NG at least avoid the problem of
>> internal DTD subsets but they still fail to serve as reliable
>>definitions
>> of document types in abstract because they are still only defining the
>> grammar rules for a subset of all possible conforming documents in a
>> document document type.
>>
>> Because of features like tag omission, inclusion exceptions, and short
>> references, it was simply impossible to parse an SGML document without
>> having both its DTD and its SGML declaration (which defined the lexical
>> syntax details). There is a default SGML declaration, but not a default
>> DTD.
>>
>> A lot of what we did in XML was remove this dependency by having a fixed
>> syntax and removing all markup minimization except attribute defaults.
>>
>> XML does retain one markup minimization feature, attribute defaults.
>> Fortunately, both XSD and RELAX NG provide alternatives to DTDs for
>> getting default attribute values.
>>
>> Cheers,
>>
>> Eliot
>> ----
>> Eliot Kimber, Owner
>> Contrext, LLC
>> http://contrext.com
>>
>>
>>
>>
>> On 5/4/16, 6:16 AM, "Norman Gray" <norman@astro.gla.ac.uk> wrote:
>>
>>> Greetings.
>>>
>>> (catching up ...)
>>>
>>> On 29 Apr 2016, at 17:58, John Cowan wrote:
>>>
>>>> On Fri, Apr 29, 2016 at 8:54 AM, Norman Gray <norman@astro.gla.ac.uk>
>>>> wrote:
>>>>
>>>> In the XML world, the DTD is just for validation
>>>>
>>>>
>>>> That turns out not to be the case.  There are a number of XML DTD
>>>> features
>>>> which affect the infoset returned by a compliant parser.  If they are
>>>> in
>>>> the internal subset, the parser MUST respect them;
>>> I stand corrected; I was sloppy.  I think this doesn't change my
>>> original point, however, which was that in SGML the DTD was integral to
>>> the document, and to the parse of the document, and that it's easy to
>>> forget this after one has got used to two decades of XML[1].  I can't
>>> remember if there was a trivial or default DTD which was assumed in the
>>> absence of a declared one, in the same way that there was a default
>>>SGML
>>> Declaration, but taking advantage of that would probably have been
>>> regarded as a curiosity, rather than normal practice.
>>>
>>> In XML, in contrast, the DTD has a more auxiliary role, and at a first
>>> conceptual look, that role is validation (even though -- footnote! --
>>>it
>>> may change other things about the parse as well).  Thus _omitting_ an
>>> XML DTD (or XSchema) is neither perverse nor curious.
>>>
>>> Practical aspect: When I'm writing XML, I use a DTD (in whatever
>>>syntax)
>>> to help Emacs tell me if the document is valid, but I don't even know
>>> whether the XML parsers I use are capable of using a DTD external
>>> subset.  That careless ignorance would be impossible with SGML.
>>>
>>> The rational extension of that attitude, of course, is MicroXML, which
>>> (as you of course know) doesn't use any external resources at all, and
>>> doesn't care about validation.
>>>
>>> Best wishes,
>>>
>>> Norman
>>>
>>>
>>> [1] Hang on, _two_ decades?!  I've just checked and ... 1996 doesn't
>>> seem that long ago.
>>>
>>>
>>> -- 
>>> Norman Gray  :  https://nxg.me.uk
>>> SUPA School of Physics and Astronomy, University of Glasgow, UK
>>>
>>> _______________________________________________________________________
>>>
>>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>>> to support XML implementation and development. To minimize
>>> spam in the archives, you must subscribe before posting.
>>>
>>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>>> subscribe: xml-dev-subscribe@lists.xml.org
>>> List archive: http://lists.xml.org/archives/xml-dev/
>>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>>>
>>
>>
>> _______________________________________________________________________
>>
>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>> to support XML implementation and development. To minimize
>> spam in the archives, you must subscribe before posting.
>>
>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>> subscribe: xml-dev-subscribe@lists.xml.org
>> List archive: http://lists.xml.org/archives/xml-dev/
>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
>
>_______________________________________________________________________
>
>XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>to support XML implementation and development. To minimize
>spam in the archives, you must subscribe before posting.
>
>[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>subscribe: xml-dev-subscribe@lists.xml.org
>List archive: http://lists.xml.org/archives/xml-dev/
>List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
Follow-Ups:
- Re: [xml-dev] SGML default attributes.
  - From: Steve Newcomb <srn@coolheads.com>
References:
- Re: [xml-dev] SGML default attributes.
  - From: "Norman Gray" <norman@astro.gla.ac.uk>
- Re: [xml-dev] SGML default attributes.
  - From: Eliot Kimber <ekimber@contrext.com>
- Re: [xml-dev] SGML default attributes.
  - From: Steve Newcomb <srn@coolheads.com>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]