[
Lists Home |
Date Index |
Thread Index
]
Alaric B Snell wrote:
> There are several data models (and, hence, equivelance tests) in circulation
> for XML - but every application uses at least one of them... Anyone using XSLT
> will be looking at the XPath tree model; likewise, CSS users will be looking
> at something similar. Application developers will be looking at DOM or SAX or
> something else.
>
> So yes, what the XML encodes is the application's problem - but without
> applications to read it, or at least the possibility of applications to read
> it, a bit of XML is just a string of bits, with no meaning or purpose...
>
> Note that I'd include being printed out on paper and read by a human as an
> application. The 'human' data model of XML will, at heart, be a woolly mix of
> SAX and DOM since one will probably look at a small snippet of a few nested
> elements on a few lines as a single tree-structure unit, but will otherwise
> read the file from top to bottom ;-)
>
> I mean, a string of bits is an encoding, although to decide 'of what' you'd
> need to (in general) hunt down where it came from and ask, unless it was an
> encoding you happened to recognise.
Having officially embraced Namespaces and the Infoset, XML years ago forfeited
its defenses against the 'abstract syntax' arguments of the ASN.1 partisans. The
only meaningful way to draw a line separating XML from the abstract syntax camp
is to insist that--at the level of the parseable document--there is no
equivalence except character-by-character lexical equivalence. That is a
fundamental distinction, and one worth insisting upon for the extraordinarily
useful consequences it delivers. Now, granted, the simplest of compliant XML 1.0
parsers will make no distinction between single and double quotes delimiting
attribute values nor between various manifestations of whitespace, but accepting
such abstract equivalence is the price of compromise necessary to have an agreed
XML 1.0 Recommendation implementable in a parser. Choose stricter premises and
you will find it necessary to implement something like Simon St.Laurent's
Ripper. However, for the benefits of a global XML community and the bounty of
general-purpose tools it produces, many of us have accepted those particular
assertions of abstract equivalences which are incorporated in the XML 1.0
Recommendation. We accept them as specific, enumerated exceptions to an
otherwise prevailing rule, and as identified exceptions they in fact confirm the
existence of that rule in their absence.
Namespaces, even if we accept only the simplest argument for them as a mechanism
of disambiguation, necessarily imply that there is a level of abstract
equivalence at which lexically identical GIs must be disambiguated as belonging
to different (abstract) vocabularies, which implies the converse, that lexically
distinct GIs may in fact be understood as equivalent once it is accepted that
they are separate manifestations within different namespaces of a single
abstraction. If we embrace namespaces this abstract equivalence becomes a
fundamental rule which displaces the very different premise of XML 1.0. Likewise
the Infoset, however light an abstraction of the instance syntax its original
proponents insisted on, introduces abstraction as the general rule of which
lexically variant instances are equivalent manifestations. Reversing the
fundamental assumption of a lexically grounded XML, that abstraction is a
slippery slope on which there is no meaningful point to draw a distinction
between XML and any platonic abstract syntax such as ASN.1 and the like.
I have made these arguments here before, for a number of years now. I apologize
if my emphasis on these points becomes tiresome, but I think that the 50 years
or longer struggle in the 20th century that was required for classical philology
to understand the nature of oral poetry demonstrates why the physical, rather
than any abstract nature of a text is worth insisting upon. Texts can and do
'encode' physical properties, among them rhythm, scansion and various devices of
assonance. Once encoded in an instance text these qualities are inherent, and
neither require markup nor other metadata to impute them, nor can those
properties be removed from the text by instructions of markup. Beginning from
the concrete instance gives us these properties in an unambiguous and clearly
perceived form. On the other hand, to begin from an abstraction of syntax is to
forfeit a concrete means of conveying those properties and to be forced to rely
on metadata--that is, metadata to the abstraction!--if they are to be
communicated at all. And of course any recipient of that metadata is free to
ignore it in realizing some physical instantiation from the text. For those
cases I must care most about, ASN.1 and abstract syntax generally are incapable
of a precise and unambiguous encoding of inherent fundamental textual properties
without resorting to a priori agreements between the creator and the consumer of
a document, and from the very nature of document processing such agreements are
unreliable and negligible.
This is the fundamental distinction of document and data to which all
permathreads return, but which I think the recent championing of ASN.1 on
xml-dev gives us a useful new perspective on. Can't we now assert that what is
fundamentally data is that of which the most salient properties are abstract?
That is, different lexical manifestations are understood by both their creator
and their consumer to be secondary to some abstract underlying platonic reality
and, conversely, the physical qualities which might be inherent in a particular
lexical manifestation are understood by both creator and consumer to be spurious
and negligible. The content of documents, on the other hand, most specifically
includes, often as the chief concern, those characteristics which come with the
lexical manifestation and cannot be purged from the physical realization.
Respectfully,
Walter Perry
|