OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] xml taxonomy

[ Lists Home | Date Index | Thread Index ]

Rick and Len:

Disclaimer: The following is my own terminology that helps me sort out the
world.  I'm not trying to impose it on anyone, just sharing it for what its

I distinguish the following:

1. "atomic/electronic document" and "xml document"

2. "messages/protocols", "forms", and "documents"

3. "tight" versus "loose" schema

4. "data dictionary", "schema" and "schema framework"

1. "atomic document/electronic document" and "xml document"
"XML document" is an xml document as defined by W3C's XML 1.x

By "atomic" or "electronic" document, I mean one electronic file that has
both style and data information in it, whether xml or not.  For example, a
MS Word, Word Perfect, or PDF document is an atomic/electronic document.
An XML document can also be an atomic/electronic document.  However, if
there is a stylesheet necessary to render the document, then I do not
consider *two* separate files (xml and stylesheet) to be an
atomic/electronic document.  I would only consider an XML document to be
atomic/electronic if the data and style is in the same file.  IMPORTANT:
This is not to say that I advocate mixing data and style, because I do not.
However, it is possible to put both style information and data into the same
electronic file and still separate it.  Some "electronic/atomic" document
formats separate data/style better than others.  Separation is always better
in my view, even if the style and data are in the same electronic file.

2. "messages/protocols", "forms", and "documents"
Generally, a "message" is a machine-to-machine data transfer (e.g., from one
database to another database).  The order in which the data appears and a
precise visual representation is not important.  What is important is moving
data from one system to another system.  I find this is where people care
the most about capturing "relational" structures in XML, because most often
there is legacy data flowing from one relational database into another
relational database.   (XML provides hierarchical structure, so it is not
always intuitive how to capture "relational" structures . . . but this is
another discussion.)

A "protocol" is a series of messages that follow one of many
request/response patterns.   For example, a filing xml might require a
confirmation xml.

The important point here is that a message generally does not require a
stylesheet or, if it does, one or several stylesheets might show different
views/subsets of the information, but the order and the format of the final
output is not important.

A "form" is similar to a "message" in that data is a form is highly
structured and "fill-in-the-blank."  Forms are different than messages
because the visual representation of the form is important to a human user.
Forms technologies, therefore, must pay good attention to the final output.
In this area, users tend to want the "electronic" form to look exactly like
the "paper" form.

A "form" is a type of "document"; however, not all "documents" are "forms."
A "form" is primarily "fill-in-the-blank" data.  In contrast, a "document"
includes "prose."  "Prose" is free flowing text, structured as headings,
paragraphs, and outlines/lists.  A "document" can also contain
"fill-in-the-blank" data, but, again, unlike a "form," it includes prose.

In the area of legal forms and documents, in which I work, a form might be,
for example, a "coversheet" on a pleading, whereas the "pleading" would be
"document."  A brief supporting the pleading would be a "document."

Note (to add a bit of confusion): In the legal practice, lawyers use "form
books" -- which are templates for making legal claims, such as fraud or
medical malpractice.  These "form books" contain what I define above as a
"document".  Also, in some states, especially California, state government
has codified the format of certain traditional "documents" (form books) into
what it calls "forms" (to be more precise - judicial council forms).

Examples of legal documents include court/justice documents, transcripts,
legislative documents, contracts, treaties and letters.

There is a fine line between a "form" and a "document."  To borrow an
analogy from a Supreme Court Justice, distinguishing forms/documents is like
pornography -- you know it when you see it.

3. "tight" versus "loose" schema
A "tight" schema is a schema that precisely validates data in an xml
document.  Qualities of a "tight" schema, for example, are that it is
neither underinclusive or overinclusive.   Elements have strict/precise
content models.  For example, no mixed content; precise use of
sequence/choice elements; precise, well-defined enumerations, mix/max
occurs, data types, other facets.

A "loose" schema is opposite of a "tight" schema.  For example, there may be
overinclusive elements, there may be mixed content, there may be many
"string/text" nodes that do not define enumerations, mix/max occurs, data
types, other facets.

Different applications need schema that are "tighter" or "looser" than
others.  In practice, I find that there is a continuum from tight to loose
when one moves from messages to forms to documents.  That is, message
formats tend to be very "tight" whereas document formats tend to be much
more "loose".  Forms are somewhere in the middle.

4. "data dictionary", "schema" and "schema framework"
A "schema" is a DTD, XML Schema, Relax NG Schema, or the like.

A "data dictionary" is a set of defined terms.  In my view, it does not (or
should not ) mandate, define, or require a data structure (such as a
relational structure or a hierarchical structure).   Most data dictionaries
that I run across in my work are petrified in paper documents or MS
Word/Word Perfect/HTML/PDF documents.  This is unfortunate, because it
greatly limits usability.  Every once in a while, I'll get lucky and find a
data dictionary in an electronic spreadsheet or in a database.   A good
dictionary will have a lot of terms in it.  If it contain synonyms, then
there will be some mechanical means to determine that two terms are
synonyms.  I would expect an XML data dictionary to be in a simple XML
format that shows simple relationships among terms or in RDF or perhaps one
of the emerging ontology formats.

In my view, "schema" developers should use "data dictionaries" for element
and attribute names, but *not* for content models.   This is necessary
because different applications need different types of schemas (e.g., tight
versus loose) with different combinations/mixture of terms (e.g.,

A "schema framework" is a set of best practices and conventions for creating
(arbitrary) schema.  We have found that the use of a "schema framework"
*greatly reduces* the time it takes to create, manage, develop, store, and
write code/applications around schema.  We have, for example, a set of rules
that apply to creating message schema (all schema), additional rules that
apply to creating form schema, and yet additional rules that apply to
creating document schema.

In relation to Rick and Len's comments, we have found that the use of a
"schema framework" allows us to automate and speed the development of "data
dictionaries" (or taxonomies).    I would disagree with Len that this is a
purely academic exercise.  We have implemented real, working techniques that
greatly reduce the cost and time of using XML and XML Schema.  For example,
in our schema repository, we have perhaps 400-500 schema.  Because each
schema follows the rules of the schema framework, we are able to
automatically generate one or more data dictionaries based on either all or
a subset of the schema in the repository.  (I am not contending that this
could not be done with schema that are not in a "framework" -- I'm simply
saying it is easier and has more benefits if a "framework" is used.)
Automating the creation of data dictionaries has benefits that Rick touches
on -- I call this "aggregation" -- that is, it is possible to aggregate and
efficiently analyze terms (and potentially content models) in a group of
schema.  Aggregation (just as one would do with financial data) allows one
to observe patterns in terminology use, harmonize terminology, and better
use/reuse and define terms.

I hope you find this useful.



----- Original Message ----- 
From: "Bullard, Claude L (Len)" <clbullar@ingr.com>
To: <rjm@zenucom.com>; <xml-dev@lists.xml.org>
Sent: Wednesday, August 27, 2003 9:47 AM
Subject: RE: [xml-dev] xml taxonomy

> That is somewhat like saying take the infoset specification
> apart and analysze how the individual information items in
> combination enable different kinds of provable properties
> given some set of axioms and operations.  Sounds like fun
> but I suspect a rigorous result will require some serious
> resources and that is why I would expect this from the
> academic community presenting papers at conferences, not
> from the developer community on a mail list where as soon
> as the frustration goes past a certain threshhold, someone
> will derail into Godel and use strange loops to
> admonish all about the fruitlessness of universal proofs.
> Proofs are nice to have, but all a real programmer needs is
> to make it run then make it run faster. ;-)
> len
> -----Original Message-----
> From: Rick Marshall [mailto:rjm@zenucom.com]
> following several discussions we've had lately, mostly on relational
> models and document management i'm going to float the idea - which may
> be covered elsewhere, please redirect me if appropriate - that having a
> taxonomy of xml may help us to understand what forms, and when are good
> for different problems.
> if we take numbers as an analogy (and that's all it is, there are plenty
> of others) they can be divided into sets - integer, real, rational,
> irrational, complex, etc and we increase our understanding and use of
> numbers by developing theorems that cover the different sets.
> it seems to me that xml is as diverse as numbers or any similar grouping
> and that by focusing on well defined sets of xml structures and their
> properties we can get the theorems to improve our use and understanding.
> eg one set might be xml with tags only - no attributes; another might be
> xml that is constrained to two levels; etc
> by understanding the properties and operators that are valid on these
> sets we can then see the analogies to other technologies such as
> relational models, markup, etc.
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> The list archives are at http://lists.xml.org/archives/xml-dev/
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS