[
Lists Home |
Date Index |
Thread Index
]
In <1021562444.984.6178.camel@localhost.localdomain>, "Simon St.Laurent"
<simonstl@simonstl.com> wrote:
| It's striking me more and more that developers, myself included, have
| done a poor job of examining and explaining how markup works and what
| the parts do best. That extends to a key discussion which is generally
| considered dull but radioactive: the elements/attributes distinction.
The litmus test is whether one thinks these are or should be equivalent:
1) <foo bar="baz"/>
2) <foo><bar>baz</bar></foo>
I would say that the job so far, poor or not, has almost entirely been one
of propounding the equivalence view.
| A lot of people have been storing data in attributes rather than in
| element content. There are lot of reasons for this, ranging from a more
| compact form to simpler processing in SAX.
And, of course, Keeping Things Safe For Netploder. Is there some taboo on
mentioning this?
| To some extent, the misuse arose because attributes had features
| (defaulting, free order, some types, enumeration) that elements didn't
| have. W3C XML Schema condones those practices for attributes and
| extends the same features to elements. Maybe this is an improvement,
| maybe it isn't.
Taking the minority view, I would say that it isn't. That is, rather than
trying to unify attributes and (sub)elements - especially those that wind
up with the moral equivalent of (#PCDATA) content models - it may be more
fruitful to keep them distinct.
| Separating markup from content - and putting attributes squarely in the
| markup side - seems like one means of at least alleviating the headache.
Well, that's how it all started (see eg, [1]). My personal rule of thumb
has always been "elements for analysis, attributes for annotation". The
key is the sense in which attributes are not directly "analytic". In my
own attempts to explain this to (computer-savvy?) people, I've often drawn
a parallel with parsing theory, based on the similarities between content
models and BNFs (extended regular grammars).
Given a set of production rules, a successful parse yields a parse tree
with nonterminals as nodes and terminals as leaves. With one twist, the
SGML/XML serialization of such a parse tree is obvious. (The twist is in
the treatment of what are *taken* to be terminals, in that programs such
as Bison allow terminals of two kinds: variables instantiated by a lexer,
and string constants. The former actually correspond to #PCDATA elements
with obvious expansions, the latter to text directly.)
The basic outcome is a complete partitioning of the data into a hierarchy
of semantically meaningful categories. Turning this around, a SGML/XML
instance basically represents a *complete parsing* of its text content.
That is, while the problem in parsing theory is to recognize input, the
primary intent of generalized markup is to express the result of a prior
process of recognition in the same formalism of parse trees. Pushing the
analogy further, where attributes make their appearance in the semantic
processing of parse trees, markup-attributes are very similar to inherited
(as opposed to synthesized) parse-attributes.
The basic lesson: Do not use attributes to *analyse* wholes into parts.
[1] http://www.sgmlsource.com/history/AnnexA.htm
|