XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] SGML default attributes.

I should clarify: it's not just that entities don't have identity but that
they are not *objects* in the way that elements are. That is, they do not
have identity in the parsed result, certainly not in any commonly-used XML
processing environment (e.g., XSLT, XQuery, DOM 1 or 2, etc.).

Entities are string macros.

Cheers,

E.

----
Eliot Kimber, Owner
Contrext, LLC
http://contrext.com




On 5/4/16, 3:15 PM, "Eliot Kimber" <ekimber@contrext.com> wrote:

>General entities are evil because they offer a false solution that leads
>to failure and pain.
>
>Note that in XML the 5 special escapes are defined as escapes, not as
>more-general text entities. They are built into the language. They are
>necessary, as you say, to be able to escape literal markup characters
>without resorting to numeric character references.
>
>In the context of general text entities there are two classes: internal
>general entities and external general entities. Internal entities are less
>bad only because their declaration must be part of the DOCTYPE declaration
>and thus it's more obvious that they are string macros and their value is
>clearer to the author. External general entities are the more insidious
>feature because they look and feel like real re-use when in fact they are
>not.
>
>The issue is entirely one of identity: In normal SGML and XML processing
>contexts (that is, any processor that reflects the results of parsing and
>that does not go to extremes to preserve in some way knowledge of the
>original entity boundaries), then not only do entities not have identity,
>they do not have existence at all.
>
>Parameter entities have the same problem: they do not have identity. The
>difference is the processing context: parameter entities are only of
>interest to the parser itself for task of composing the grammar used to
>validate the instance. [And I'll observe that efforts to add more
>application-level reuse features to XSD and, to a lesser degree, RELAX NG,
>have led to serious problems, such as the hideous redefine feature in XSD
>1.0. So to the degree that DTDs avoided those problems by limiting
>themselves to simple string macros with a basic positional configuration
>mechanism, it did the right thing.]
>
>That is, from the point of view of the processor operating on the parsed
>documents, the entities *never existed*. That means the processor has no
>opportunity to do the things that are always necessary when implementing
>re-use, such as validating the correctness of the reference, rewriting IDs
>and addresses to reflect the re-use, etc.
>
>The problem with things like public IDs and URNs for grammars is that we
>*want* them to have some meaning but in fact they do not have any reliable
>meaning because they are, fundamentally, just pointers to storage objects
>("files"). As I showed in my response to Roger Costello, with DTDs you can
>completely lie. With a catalog file you can completely lie. With a
>modified copy of a file you can completely lie. With a parameter entity
>redefined in the internal subset you can completely lie. Of course the lie
>can be detected by doing other validation but the point is that the
>ability to lie is inherent in the mechanism and cannot be detected by DTD
>validation itself.
>
>The solution is to decouple the identifier of the abstract document type
>from any references to any implementation expressions of the document
>type. 
>
>DITA does this decoupling by saying "There is a (potentially unbounded but
>finite) set of uniquely-named vocabulary modules that are, for a given
>version in time, invariant. For any given DITA document there is a set of
>modules that constitute that document's 'DITA document type'. By
>definition any two documents with the same set of modules have the same
>DITA document type."
>
>Because the DITA document type is defined in terms of the module *names*,
>which are simply names for the modules *as abstractions*, the definition
>is entirely in terms of the abstract document type.
>
>The modules must be defined somewhere--there has to be some
>definition--prose, DTD, XSD, Schematron, RNG, running code, whatever--but
>it doesn't matter *for conformance* what form it takes. The practical
>requirement is that the agents operating on the documents be able to make
>sense of the module definition *as needed*. But since every DITA element
>ultimately  maps back to a base type defined in the DITA standard, it's
>not even necessary to understand anything about the modules. Even when
>there is no formal grammar defined for a module (and thus nothing
>available to validate instances that use that module), the document can
>still be validated against the grammars for the base types, because those
>are known (provided as part of the DITA standard). So you can always know,
>for any DITA document, that it at least conforms to the grammar rules for
>the base element types. Because relaxation of constraint is not allow you
>also know that any specializations don't add anything you didn't expect
>(because addition is not allowed).
>
>Note the constraint: specializations must not relax constraints. This is
>the big DITA constraint, but it's necessary to make the mechanism work,
>because otherwise you'd unconstrained madness (that is, you'd have
>DocBook, JATS, TEI, every other standard XML application that allows
>unconstrained extension). The solution is to ensure that the base types
>allow all reasonable options. One of the things you can see in the
>evolution of the DITA standard is the removal of inappropriate constraints
>in the base types. It's certainly  not perfect but it's close enough for
>DITA 1.x. XML applications in other domains could, of course, choose
>different sets of starting constraints--"reasonable" is of course context
>dependent. 
>
>For the purposes of imposing DITA's content reference constraints, which
>are defined in terms of "document type compatibility" it is sufficient to
>know the DITA document types of the two documents involved and the @class
>values of the elements involved. There is no need to know anything about
>the actual grammar rules *because those rules are already defined in the
>DITA standard*. That is, because no specialization can be less constrained
>than the base, if you know about the base you know the minimum you need to
>know. Because constraints are defined through separate modules you also
>know if a given document type is more constrained than another if they
>otherwise use the same modules. You don't necessarily know *how* it's more
>constrained, just that it is. That is sufficient to know that you should
>not reuse elements from the less-constrained document in the
>more-constrained document if you do not want to risk including something
>the more-constrained document has chosen to disallow.
>
>The DITA standard says conforming DITA documents do not need to have any
>reference to any grammar because DITA doesn't depend on a particular
>grammar file or form of reference to determine that a given document is or
>is not a DITA document. DITA is an XML standard and XML does not require
>the use of grammar references from document instances. (Remember that
>there are many ways to associate validation with documents other than
>pointing from the document to the grammar--if you know the document's
>abstract type you can always provide a way to do the appropriate
>validation, whatever form it might take. Conversely, if you don't know the
>document type there's no much you can reliably do other than check that
>it's well formed. And remember that the DOCTYPE declaration (or schema
>reference or RELAX NG does not reliably tell you the document's document
>type, for the reasons I've given.)
>
>DITA depends on three things in document instances that are independent of
>any grammar use in order to determine if a document is (A) a DITA document
>and (B) what its DITA document type is:
>
>1. The presence of the @DITAArchVersion attribute, which is in a
>DITA-defined namespace (and is the only use of namespaces in DITA other
>than for non-DITA elements).
>2. The presence of the @domains attribute on map and topic elements
>3. The presence of the @class attribute with the DITA-defined syntax on at
>least the map or topic element (but ideally on all elements not within a
>DITA "foreign" element).
>
>If all three conditions are met the document is almost certainly a DITA
>document and can be reasonably validated against DITA requirements and
>processed as a DITA document. Other vocabularies could have attributes
>named @domains or @class and even have the same syntax, but none should
>ever have the DITAArchVersion attribute.
>
>My main issue with DTDs, in particular, is not that they weren't quite
>valuable--obviously they were a very important innovation and have
>tremendous practical value even today (the vast majority of DITA documents
>are DTD-based, for various reasons)--but that they were misunderstood as
>being THE primary or only definition of "document type" when they never
>were. This led to a lot of misplaced effort, inappropriate and unrealistic
>expectations, etc.
>
>Cheers,
>
>E.
>
>
>----
>Eliot Kimber, Owner
>Contrext, LLC
>http://contrext.com
>
>
>
>
>On 5/4/16, 2:08 PM, "Steve Newcomb" <srn@coolheads.com> wrote:
>
>>Eliot,
>>
>>Like you, I'm not really wedded to the notion of parser-mediated
>>transclusion.   On the other hand, I'm not really convinced we can 100%
>>jettison it, either, or preach that the very concept is somehow evil.
>>It's a hack, that's all.  (Frankly, hacks are what get us through the
>>day.)
>>
>>What you've said is packed, as usual, with terrific insights.  I guess I
>>just have trouble with the rhetoric.  It wouldn't bother me so much if I
>>didn't think your words are (quite deservedly) influential.  If I didn't
>>already know you so well, I might gather that it is your opinion that
>>either:
>>
>>(1) entities have no identity
>>
>>or
>>
>>(2) entities may have identity but it doesn't make any difference,
>>
>>because
>>
>>...entities have no purpose other than content-level reuse in the
>>context of parsing operations.
>>
>>Assuming I'm right, then I'm going to guess that it's your opinion that
>>the *only* reason why we have the "lt" (less-than) general entity is to
>>bypass the parser's natural inclination to recognize a STAGO (start tag
>>open character).  As a purely practical matter, I must admit that I
>>don't think I've *ever* used the "lt" general entity name for any other
>>purpose.  And, truly, that purpose is a hack, pure and simple!  BUT:
>>there's a vital principle here, and I don't want it to be trampled and
>>lost.
>>
>>The principle at work here, at least for me, is that when I'm invoking
>>an entity by name, I'm using a defined name to refer to an abstract
>>thing, namely that abstraction which is shared by all "less-than"
>>characters in all character sets, fonts, encodings, and whatnot.  The
>>fact that I'm invoking the notion of "the less-than character" in the
>>context of parsed character data is irrelevant. I might instead use the
>>entity name "lt" as the value of an ENTITY attribute, for example, or in
>>any of the many ways that HyTime, for example, exploited the notion of
>>entity identity.
>>
>>In SGML, every aspect of the use of names to identify things is founded
>>on the notions inherent in DTD syntax.  And DTDs can be fully or
>>partially shared among many documents that invoke those element-type
>>names, attribute names, and entity names, so that they are all
>>(presumably, *cough*) invoking the same things whenever they utter the
>>same names.  And the DTDs themselves can also have "universal" names by
>>invoking the universes (somehow) identified in PUBLIC identifiers.  So I
>>would argue that you err in portraying entities and document types as
>>different things.  Instead, I think they are in fact best understood as
>>different perspectives on one and the same organic whole, a single
>>"grounding tree", if you like.
>>
>>Entity identity is the invisible root of the grounding tree.  In my
>>view, the names declared in document types and invoked in document
>>instances are merely the visible, above-ground parts of it.  Now, one
>>may claim that we don't need entity identity for that purpose, just as
>>we don't need gold to back up the U.S. dollar.  Hmmmm.  But there's
>>still identity, even there, and in the case of U.S. dollars -- even the
>>huge majority of them that don't have individual identity -- their
>>root-existence and root-nature is arguably testable in the form of U.S.
>>military power.
>>
>>Where's the power of URIs, if there no testable "there" there, and they
>>don't even necessarily resolve?  Where's the identity of a document
>>type, if not in an entity of some kind that is testably somewhere and
>>ideally has properties that are useful for testing instances that claim
>>to be of the type?
>>
>>I don't see how your explanation of DITA's approach resolves the
>>problem.  When you say:
>>
>>> ...stop caring about the grammar as an artifact and care only
>>> about the set of (abstract) vocabulary modules the document says it
>>>(may)
>>> use. That is, actually declare the abstract document type in an
>>> unambiguous way and worry about validation details separately.
>>
>>...you don't say how to resolve the problem, other than, implicitly,
>>anyway, via entity identity: the identities of the DITA modules,
>>wherever they are.  Right?  You just don't admit it up front in ENTITY
>>declarations.  It's just understood by everybody, more or less
>>intuitively, I guess.
>>
>>How is that better?
>>
>>Steve
>>
>>On 05/04/2016 01:12 PM, Eliot Kimber wrote:
>>> These are really two different subject domains: entities (content-level
>>> reuse) and document types (defining and determining correctness of
>>> instances against some understood set of rules).
>>>
>>> On general entities:
>>>
>>> General entities are absolute evil. They should never be used under any
>>> circumstances. Fortunately, the practical reality of XML is that they
>>> almost never are used. I only see them in XML applications that reflect
>>> recent migration from legacy SGML systems.
>>>
>>> The alternative is link-based reuse, that is, reuse at the application
>>> processing level, not at the serialization parser level. Or more
>>> precisely: reuse is an application concern, not a serialization
>>>concern.
>>> Entities in SGML and XML are string macros. To the degree that string
>>> macros are useful then they have value and in the context of DTD
>>> declarations parameter entities have obvious value and utility.
>>>Parameter
>>> entities are not evil.
>>>
>>> But in the context of content, that is, the domain of the elements
>>> themselves, string macros are a big problem, not because they aren't
>>> useful, but because people think they do something they don't, namely
>>> provide a way to do reliable reuse. The use cases where string macros
>>>are
>>> useful relative to the use cases where they are actively dangerous is
>>>so
>>> small as to make their value not at all worth the cost of their certain
>>> misuse.
>>>
>>> Even for apparently-simple use cases like string value parameterization
>>>in
>>> content (e.g., product names or whatever), string macros fail because
>>>they
>>> cannot be related to specific use contexts. When you push on the
>>> requirements for reuse you quickly realize that only application-level
>>> processing gives you the flexibility and opportunities required to
>>> properly implement re-use requirements, in particular, providing the
>>> correct resolution for a given use in a given use context.
>>>
>>> The solution was in HyTime, namely the content reference link type,
>>>which
>>> was a link with the base semantic of use by reference. Because it is a
>>> link it is handled in the application domain, not the parsing domain.
>>>This
>>> is transclusion as envisioned by Ted Nelson.
>>>
>>> You see this in DITA through DITA's content reference facility and the
>>> map-and-topic architecture, both of which use hyperlinks to establish
>>> reuse relationships. With DITA 1.3 the addressing mechanism is
>>> sufficiently complete to satisfy most of the requirements (the only
>>> missing feature is indirection for references to elements within
>>>topics,
>>> but I defined a potential solution that does not require any
>>>architectural
>>> changes to DITA, just additional processing applied to specific
>>> specializations).
>>>
>>> I'm not aware of any other documentation XML application that has the
>>> equivalent use-by-reference features, but DITA is somewhat unique in
>>>being
>>> driven primarily by re-use requirements, which is not the case for
>>>older
>>> specifications like DocBook, NLM/JATS, and TEI. Of course, there's no
>>> barrier to adding similar features to any application. However, there
>>>are
>>> complications and policy considerations that have to be carefully
>>>worked
>>> out, such as what are the rules for consistency between referencing and
>>> referenced elements? DITA has one policy, but it may not be the best
>>> policy for all use cases.
>>>
>>> On DTDs and grammars in general:
>>>
>>> I do not say that DTDs (or grammars in general) are evil.
>>>
>>> I only say that the way people applied them was (and is) misguided
>>>because
>>> they misunderstood (or willfully ignored in the face of no better
>>> alternative) their limitations as a way to associate documents with
>>>their
>>> abstract document types. Of course DTDs and grammars in general have
>>>great
>>> value as a way of imposing some order on data as it flows through its
>>> communication channels and goes through its life cycle.
>>>
>>> But grammars do not define document types.
>>>
>>> At the time namespaces were being defined I tried to suggest some
>>>standard
>>> way to identify abstract document types separate from any particular
>>> implementation of them: basically a formal document that says "This is
>>> what I mean by abstract document type 'X'". You give it a URI so it can
>>>be
>>> referred to unambiguously and you can connect whatever additional
>>> governing or augmenting artifacts to it you want. By such a mechanism
>>>you
>>> could have as complete a definition of a given abstract document type
>>>as
>>> you wanted, including prose definitions as well as any number of
>>> implementing artifacts (grammars, Schematrons, validation applications,
>>> phone numbers to call for usage advice, etc.).
>>>
>>> But of course that was too heavy for the time (or for now). Either
>>>people
>>> simply didn't need that level of definitional precision or they used
>>>the
>>> workaround of pointing in the other direction, that is, by having
>>> specifications that say "I define what abstract document type 'X'" is.
>>>
>>> This is was in the context of the problem that namespace names don't
>>>point
>>> to anything: people had the idea that namespace names told you
>>>something
>>> but we were always clear that they did not--they were simply magic
>>>strings
>>> that used the mechanics of URIs to ensure that you have a
>>> universally-unique name.
>>>
>>> But the namespace tells you nothing about the names in the space (that
>>>is,
>>> what is the set of allowed names, where are their semantics and rules
>>> defined, etc.). The namespace spec specifically says "You should not
>>> expect to find anything at the end of the namespace URI and you should
>>>not
>>> try to resolve it".
>>>
>>> So if the namespace name is not the name of the document type, what is?
>>>I
>>> wanted there to be one because I like definitional completeness.
>>>
>>> But in fact it's clear now that that level of completeness is either
>>>not
>>> practical or is not sufficiently desired to make it worth trying to
>>> implement it.
>>>
>>> So we're where we were 30 years ago: we have grammar definitions for
>>> documents but we don't have a general way to talk about abstract
>>>document
>>> types as distinct from their implementing artifacts (grammars,
>>>validation
>>> processors, output processors, prose definitions, etc.).
>>>
>>> But experience has shown that it's not that big of a deal in practice.
>>>In
>>> practice, having standards or standards-like documents is sufficient
>>>for
>>> those cases where it is important.
>>>
>>> As far as addressing the problem that the reference from a document
>>> instance a grammar in fact tells you nothing reliable, a solution is
>>>what
>>> DITA does: stop caring about the grammar as an artifact and care only
>>> about the set of (abstract) vocabulary modules the document says it
>>>(may)
>>> use. That is, actually declare the abstract document type in an
>>> unambiguous way and worry about validation details separately.
>>>
>>> DITA does this as follows:
>>>
>>> 1. Defines an architecture for layered vocabulary.
>>>
>>> The DITA standard defines an invariant and mandatory set of base
>>>element
>>> types and a mechanism for the definition of new element types in terms
>>>of
>>> the base types. All conforming DITA element types and attributes MUST
>>>be
>>> based on one of the base types (directly or indirectly) and must be at
>>> least as constrained as the base type (that is, you can't relax
>>> constraints). This is DITA specialization. It ensures that all DITA
>>> documents are minimally processable in terms of the base types (or any
>>> known intermediate types). It allows for reliable interoperation and
>>> interchange of all conforming DITA documents. Because the definitional
>>> mechanism uses attributes it is not dependent on any particular grammar
>>> feature in the way that HyTime is. Any normal XML processor (including
>>>CSS
>>> selectors) can get access to the definitional base of any element and
>>>thus
>>> do what it can with it. The definitional details of an element are
>>> specified on the required @class attribute, e.g. class="- topic/p
>>> mydomain/my-para ", which reflects a specialization of the base type
>>>"P"
>>> in the module "topic" by the module "mydomain" with the name "my-para".
>>> Any general DITA-aware processor can thus process "my-para" elements
>>>using
>>> the rules for "p" or, through extension, can have "mydomain/my-para"
>>> processing, which might be different. But in either case you'll get
>>> something reasonable as a result.
>>>
>>> 2. Defines a modular architecture for vocabulary such that each kind of
>>> vocabulary definition (map types, topic types, or mix-in "domains")
>>> follows a regular pattern. There is no sense of "a" DITA DTD, only
>>> collections of modules that can be combined into document types (both
>>>in
>>> the abstract sense of "DITA document type" and in the implementation
>>>sense
>>> of a "a working grammar file that governs document instances that use a
>>> given set of modules").
>>>
>>> DITA requires that a given version in time of a module is invariant,
>>> meaning that every copy of the module should be identical to every
>>>other
>>> (basically, you never directly modify a vocabulary module's grammar
>>> implementation). Each module is given a name that should be globally
>>> unique, or at least unique within its expected scope of use. Experience
>>> has shown us that it's actually pretty easy to ensure practical
>>>uniqueness
>>> just by judicious use of name prefixes and general respect for people's
>>> namespaces. No need to step up to full GUID-style uniqueification ala
>>>XML
>>> namespaces.
>>>
>>> In addition to vocabulary modules, which define element types or
>>> attributes, you can have "constraint modules", which impose constraints
>>>on
>>> vocabulary defined in other modules. Constraint modules let you further
>>> constrain the vocabulary without the need to directly modify a given
>>> module's grammar definition. Again, the rule is that you can only
>>> constrain, you can't relax.
>>>
>>> 3. Defines a "DITA document type" as a unique set of modules,
>>>identified
>>> by module name. If two DITA documents declare the use of the same set
>>>of
>>> modules then by definition they have the same DITA document type. This
>>> works because of rule (2): all copies of a given module must be
>>>identical.
>>> So it is sufficient to simply identify the modules. In theory one could
>>>go
>>> from the module names to some set of implementations of the modules
>>> although I don't know of any tools that do that because in practice
>>>most
>>> DITA documents have associated DTDs that already integrate the grammars
>>> for the modules being used. But it is possible. The DITA document type
>>>is
>>> declared on the @domains attribute, which is required on DITA root
>>> elements (maps and topics).
>>>
>>> Note that you could have a conforming DITA vocabulary module that is
>>>only
>>> ever defined in prose. As long as documents reflected the types
>>>correctly
>>> in the @class attributes and reflected the module name in the @domains
>>> attribute the DITA definitional requirements are met. It would be up to
>>> tool implementors to do whatever was appropriate for your domain (which
>>> might be nothing if your vocabulary exists only to provide
>>>distinguishing
>>> names and doesn't require any processing different from the base).
>>>Nobody
>>> would do this *but they could*.
>>>
>>> Thus DITA completely divorces the notion of "document type" from any
>>> implementation details of grammar, validation, or processing, with the
>>> clear implication that there better be clear documentation of what a
>>>given
>>> vocabulary module is.
>>>
>>> Cheers,
>>>
>>> E.
>>> ----
>>> Eliot Kimber, Owner
>>> Contrext, LLC
>>> http://contrext.com
>>>
>>>
>>>
>>>
>>> On 5/4/16, 11:06 AM, "Steve Newcomb" <srn@coolheads.com> wrote:
>>>
>>>> Eliot,
>>>>
>>>> In order to avoid potential misunderstandings, I think it might be
>>>>worth
>>>> clarifying your position on the following points:
>>>>
>>>> (1) Resolved: the whole idea of entity identity was a mistake, is
>>>> worthless, and is evil.
>>>>
>>>> (2) Resolved: the whole idea of document type identity was a mistake,
>>>>is
>>>> worthless, and is evil.
>>>>
>>>> I have deliberately made these statements extreme and obviously silly
>>>>in
>>>> order to dramatize the fact that, even though there are problems with
>>>> SGML's and/or XML's operational approaches to them, we cannot discard
>>>> these ideas altogether.  The ideas themselves remain profound and
>>>> necessary.  They will always be needed.  The usefulness of their
>>>>various
>>>> operational prostheses will always be limited to certain cultural
>>>> contexts.  Even within their specific contexts, those prostheses will
>>>> always be imperfect.  They will always require occasional repair and
>>>> replacement, in order that they remain available for use even as that
>>>> context's notions of "entity", "document", and "identity" continue to
>>>> evolve and diversify.
>>>>
>>>> The operational prostheses with which these ideas were fitted at
>>>>SGML's
>>>> birth are things of their time.  That was then, this is now, and "time
>>>> makes ancient good uncouth".  Their goodness in their earlier context
>>>>is
>>>> a matter of record; they were used, a lot, for a lot of reasons and in
>>>>a
>>>> lot of ways.  At the time, it was not stupid or evil to make the
>>>>notion
>>>> of document type identity depend on the notion of entity identity, nor
>>>> was it stupid or evil to make the notion of entity identity dependent
>>>>on
>>>> PUBLIC identifiers.  And in many ways, it still isn't.  What is your
>>>> proposed alternative, and why is it better?
>>>>
>>>> Steve
>>>>
>>>> On 05/04/2016 11:23 AM, Eliot Kimber wrote:
>>>>> SGML requires the use of a DTD--there was no notion of a "default"
>>>>>DTD.
>>>>> This requirement was, I'll argue, the result of a fundamental
>>>>>conceptual
>>>>> mistake--understandable at the time but a mistake nevertheless.
>>>>>
>>>>> The conceptual mistakes that SGML made was conflating the notion of
>>>>>an
>>>>> abstract "document type" with the grammar definition for (partially)
>>>>> validating documents against that document type. That is, SGML saw
>>>>>the
>>>>> DTD
>>>>> as being equal to the definition of the "document type" as an
>>>>> abstraction.
>>>>> But of course that is nonsense. There was (remains today) the
>>>>>misguided
>>>>> notion that a reference to an external DTD subset somehow told you
>>>>> something actionable about the document you had. But of course it
>>>>>tells
>>>>> you nothing reliable because the document could define it's "real"
>>>>>DTD
>>>>> in
>>>>> the internal subset or the local environment could put whatever it
>>>>>wants
>>>>> at the end of the public ID the document is referencing.
>>>>>
>>>>> Consider this SGML document:
>>>>>
>>>>> <!DOCTYPE notdocbook PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" [
>>>>>     <!ELEMENT notdocbook ANY >
>>>>>     <!ELEMENT bogus ANY >
>>>>> ]>
>>>>> <notdocbook>
>>>>>     <bogus><para>This is not a DocBook document</para></bogus>
>>>>> </notdocbook>
>>>>>
>>>>> This document will be taken as a DocBook document by any tool that
>>>>> thinks
>>>>> the public ID means something. But obviously it is not a DocBook
>>>>> document.
>>>>> It is, however, 100% DTD valid. QED DTDs are useless as tools of
>>>>> document
>>>>> type definition. The only reason the SGML (and now XML world) didn't
>>>>> collapse under this fact is that the vast majority of SGML and XML
>>>>> authoring and management tools simply refused to preserve internal
>>>>> subsets
>>>>> (going back to the discussion about DynaBase's problems with entity
>>>>> preservation).
>>>>>
>>>>> Standoff grammars like XSD and RELAX NG at least avoid the problem of
>>>>> internal DTD subsets but they still fail to serve as reliable
>>>>> definitions
>>>>> of document types in abstract because they are still only defining
>>>>>the
>>>>> grammar rules for a subset of all possible conforming documents in a
>>>>> document document type.
>>>>>
>>>>> Because of features like tag omission, inclusion exceptions, and
>>>>>short
>>>>> references, it was simply impossible to parse an SGML document
>>>>>without
>>>>> having both its DTD and its SGML declaration (which defined the
>>>>>lexical
>>>>> syntax details). There is a default SGML declaration, but not a
>>>>>default
>>>>> DTD.
>>>>>
>>>>> A lot of what we did in XML was remove this dependency by having a
>>>>>fixed
>>>>> syntax and removing all markup minimization except attribute
>>>>>defaults.
>>>>>
>>>>> XML does retain one markup minimization feature, attribute defaults.
>>>>> Fortunately, both XSD and RELAX NG provide alternatives to DTDs for
>>>>> getting default attribute values.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Eliot
>>>>> ----
>>>>> Eliot Kimber, Owner
>>>>> Contrext, LLC
>>>>> http://contrext.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 5/4/16, 6:16 AM, "Norman Gray" <norman@astro.gla.ac.uk> wrote:
>>>>>
>>>>>> Greetings.
>>>>>>
>>>>>> (catching up ...)
>>>>>>
>>>>>> On 29 Apr 2016, at 17:58, John Cowan wrote:
>>>>>>
>>>>>>> On Fri, Apr 29, 2016 at 8:54 AM, Norman Gray
>>>>>>><norman@astro.gla.ac.uk>
>>>>>>> wrote:
>>>>>>>
>>>>>>> In the XML world, the DTD is just for validation
>>>>>>>
>>>>>>>
>>>>>>> That turns out not to be the case.  There are a number of XML DTD
>>>>>>> features
>>>>>>> which affect the infoset returned by a compliant parser.  If they
>>>>>>>are
>>>>>>> in
>>>>>>> the internal subset, the parser MUST respect them;
>>>>>> I stand corrected; I was sloppy.  I think this doesn't change my
>>>>>> original point, however, which was that in SGML the DTD was integral
>>>>>>to
>>>>>> the document, and to the parse of the document, and that it's easy
>>>>>>to
>>>>>> forget this after one has got used to two decades of XML[1].  I
>>>>>>can't
>>>>>> remember if there was a trivial or default DTD which was assumed in
>>>>>>the
>>>>>> absence of a declared one, in the same way that there was a default
>>>>>> SGML
>>>>>> Declaration, but taking advantage of that would probably have been
>>>>>> regarded as a curiosity, rather than normal practice.
>>>>>>
>>>>>> In XML, in contrast, the DTD has a more auxiliary role, and at a
>>>>>>first
>>>>>> conceptual look, that role is validation (even though -- footnote!
>>>>>>--
>>>>>> it
>>>>>> may change other things about the parse as well).  Thus _omitting_
>>>>>>an
>>>>>> XML DTD (or XSchema) is neither perverse nor curious.
>>>>>>
>>>>>> Practical aspect: When I'm writing XML, I use a DTD (in whatever
>>>>>> syntax)
>>>>>> to help Emacs tell me if the document is valid, but I don't even
>>>>>>know
>>>>>> whether the XML parsers I use are capable of using a DTD external
>>>>>> subset.  That careless ignorance would be impossible with SGML.
>>>>>>
>>>>>> The rational extension of that attitude, of course, is MicroXML,
>>>>>>which
>>>>>> (as you of course know) doesn't use any external resources at all,
>>>>>>and
>>>>>> doesn't care about validation.
>>>>>>
>>>>>> Best wishes,
>>>>>>
>>>>>> Norman
>>>>>>
>>>>>>
>>>>>> [1] Hang on, _two_ decades?!  I've just checked and ... 1996 doesn't
>>>>>> seem that long ago.
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Norman Gray  :  https://nxg.me.uk
>>>>>> SUPA School of Physics and Astronomy, University of Glasgow, UK
>>>>>>
>>>>>> 
>>>>>>_____________________________________________________________________
>>>>>>_
>>>>>>_
>>>>>>
>>>>>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>>>>>> to support XML implementation and development. To minimize
>>>>>> spam in the archives, you must subscribe before posting.
>>>>>>
>>>>>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>>>>>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>>>>>> subscribe: xml-dev-subscribe@lists.xml.org
>>>>>> List archive: http://lists.xml.org/archives/xml-dev/
>>>>>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>>>>>>
>>>>>
>>>>> 
>>>>>______________________________________________________________________
>>>>>_
>>>>>
>>>>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>>>>> to support XML implementation and development. To minimize
>>>>> spam in the archives, you must subscribe before posting.
>>>>>
>>>>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>>>>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>>>>> subscribe: xml-dev-subscribe@lists.xml.org
>>>>> List archive: http://lists.xml.org/archives/xml-dev/
>>>>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>>>>
>>>> 
>>>>_______________________________________________________________________
>>>>
>>>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>>>> to support XML implementation and development. To minimize
>>>> spam in the archives, you must subscribe before posting.
>>>>
>>>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>>>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>>>> subscribe: xml-dev-subscribe@lists.xml.org
>>>> List archive: http://lists.xml.org/archives/xml-dev/
>>>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>>>>
>>>
>>>
>>> _______________________________________________________________________
>>>
>>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>>> to support XML implementation and development. To minimize
>>> spam in the archives, you must subscribe before posting.
>>>
>>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>>> subscribe: xml-dev-subscribe@lists.xml.org
>>> List archive: http://lists.xml.org/archives/xml-dev/
>>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>>
>>
>>_______________________________________________________________________
>>
>>XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>>to support XML implementation and development. To minimize
>>spam in the archives, you must subscribe before posting.
>>
>>[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>>Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>>subscribe: xml-dev-subscribe@lists.xml.org
>>List archive: http://lists.xml.org/archives/xml-dev/
>>List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>>
>
>
>
>_______________________________________________________________________
>
>XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>to support XML implementation and development. To minimize
>spam in the archives, you must subscribe before posting.
>
>[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>subscribe: xml-dev-subscribe@lists.xml.org
>List archive: http://lists.xml.org/archives/xml-dev/
>List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS