OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Data storage, data exchange, data manipulation (was RE: Against t heGrain: Pascal commentary about XML and databases)



Well, this issue has been worrying me a lot for two years now, so I'd like
to share my thoughts on the subject...

A] Data storage using XML

I don't think a node-labeled tree (the XML model is a tree, more restricted
than a graph) structure can model all kind of data easily and efficiently.
Likewise, relational and object model cannot model all kind of data easily
and efficiently. The key word here are "easily" and "efficiently" : okay,
for any given data to model, you can find a hierarchical (e.g. XML)
representation, a network representation (the node-labeled graph model), a
relational representation, an object representation, or more exotic
representations (e.g. the Caché model). But depending on your data, one of
these models will rise out at the "best" one, in terms of ease of
implementation and of efficiency in queries and updates.

So I believe there is a whole set of problems that will benefit from XML
databases (which are I believe based on the hierarchical database model*,
maybe Mike can confirm/infirm). The storage, indexation and querying of a
set of document-oriented data is a good example. 

But XML databases isn't or (won't) be a revolution, blasting all other
storage models. We could even say that the XML database model is just a come
back of the hierarchical model that was supposedly "killed" by the
relational model back in the 80s. I don't think XML databases are the "next
thing".

B] Data exchange using XML

Anyway, whatever the database model you chose, you'll have to exchange data
between your database and other systems (a business application, another
database, etc.). As everyone is not using the same data model, you'll have
to find a data model for your data exchange that :

- can capture most semantics of your data => it has to have a way to express
basic structure.
- be as simple as possible to allow for a wide audience => we should look
for the "largest common divisor" (from which you can build any other models
by adding things) rather than the "least common multiple" (from which you
can obtain any other models by building subsets).
- can easily be sent on a wire => the serialized form of the structure has
to be easily parseable and standardized.

Surprise, surprise, XML is AFAIK the right answer to these needs. Things as
simple as CSV files do not enable us to capture enough semantics, and more
complicated solution like Java serialized objects or CORBA objects-by-values
are overkill. Here are some other arguments in favor of the hierarchical
model :

- you never exchange a whole complex set of data between two different
systems. You rather exchange subset or views of the whole database. The
hierarchical model should be sufficient to exchange views, even if the
underlying model is more complex (a true node-labeled graph, or an object
model, for instance).

- when considering the particular need of data extraction for presentation
to human beings, the hierarchical model is the most structured model that
can still be readable (that is, not only by geeks). After all, human beings
should be considered as potential systems to exchange data with :). AFAIK,
the only ways a computer can exchange data with a human being are serial,
and I feel that hierarchised text or speeches are the highest form of
structured, serialized data that we can understand.

C] In-memory data storage and manipulation using XML

This last point on presentation is very important to me, as its consequences
finally made me to abandon any attempt of modeling data as full-featured
objects in the development of presentation layers (I've been in charge of
the development of a multi-modal - HTML, WAP, iMode, VoiceXML, etc. -
presentation layer for my company).

The current fashion in Java presentation layers (as I've seen at JavaOne
this year) is to use JavaBeans to exchange data between the other
application layers and the presentation layer. I have followed some very,
very strange sessions where the speaker was presenting us how he was
extracting data from a RDBMS, mapping it into objects (possibly using an
object-relationnal mapping tool) and directly using these objects in JSP
pages. I have followed an even more strange session where the data were
acquired by a call to a Web Service (the new hype this year), thus directly
in XML format, then mapped into Java object, then sent to the JSP pages.
I've even seen framework that sent XML data to the JSP, with custom taglibs
transforming the XML data into custom Collection objects.

To save development time, deployment time, and memory (all thoses classes
modeling data come at a high price), we chose not to model data as
full-featured object, but simply as XML DOM Documents. We were directly
mapping any external data (from RDBMS, ODBMS, LDAP directories, etc.) into
XML, then manipulating these data using XSLT or Java code when XSLT is not
enough, then applying a final XSLT transformation based on the output
device.

So, there is a third usage of XML, apart from data exchange and persistent
data storage : transitory, in-memory data representation. Of course, it is
more a matter of representing data as a node-labeled tree than representing
it as serialized XML with tags and all, but the "XML spirit" is there. This
usage has a lot of advantages, at least for front-end applications :

- it saves a lot of memory by removing application-specific classes and
replacing it with a small set of classes, the DOM. This means that a single
application server can handle a lot more of different data types. This is
important to us as we designed our presentation layer for Application
Service Provider (ASP) usage. The ASP context means that to keep costs as
low as possible, you run many different applications in the same application
server. If each application had its own set of application-specific classes
to model data, the application server would be crowded with classes.

- it saves a lot of time and energy by the sheer flexibility of XML. If your
data and application code are written in XML, adding or removing data to the
presentation is way more easy than if data was modeled in
application-specific classes. You don't have to modify the
application-specific classes, recompile the whole application and redeploy
it. All those who have deployed applications using entity EJBs as the
object-relational mapping layer know what I'm talking about.

- data exchange is straightforward : just parse the XML document you've been
sent, or serialize the Document object, et voilà ! No more mapping.

There are of course some disadvantages, but I think it's just a matter of
work and time before they disappear :

- Java APIs for XML document manipulation are awkward. Even if some new DOM
API appear (e.g. JDOM and dom4j), you can't beat the simplicity of just
writing <foo><bar/></foo> to create a foo element containing a bar element.
Moreover, there is no current standard for XPath APIs (though an API is
being specified by the W3C at
http://www.w3.org/TR/2001/WD-DOM-Level-3-XPath-20010618/). To solve this
problem we have developed an extensible XML/Java based language, quite in
the same spirit as the Apache Cocoon XSP pages. This language enables us to
write <foo><bar/></foo> directly in Java code, as well as XPath expressions,
which save us a considerable amount of time.

- contrary to Java class definitions, XML schemas (or schemata if you prefer
:) are quite difficult to read and write. The "difficult to read" issue can
be solved by schema documentation tools. XML Spy, for example, can generate
a pretty good documentation based on a W3C XML Schema (though some current
limitations prevent us for using this feature efficiently). The "difficult
to write" issue can be tackled using tools, but unfortunately having a good
editor is not very helpful if the schema meta-model is inherently complex.
This is why we are looking for a simple, readable schema language.

- compile-time checks are not performed. If you call
person.setFavoriteColour() on a Person instance, and the Person class has
not this method, you will get a compile-time error. Using Java + DOM, a
compiler cannot see an error when you try to add the "favoriteColour"
attribute or child element to a "person" element. As we have developed a
custom XML language compiled to Java code, we feel that it is possible to
make the compiler schema aware, thus enabling compile-time checks when the
schema of the manipulated documents are known.

- "this is not pure object oriented programming !" I don't know if it is the
same out there, but here in France it matters to a lot of people. My current
5 seconds answer is that presentation layers usually do not require a pure
object oriented model for data. It does not mean that the underlying
framework of the layer is not object-oriented, far from it !

- "yeah, but then how do I associate a behaviour to my data ?" My short
answer is "why would you like to do it in a presentation layer ?". If it's
for validation, then either the validation is simple and it can be done
using schemas, or the validation is complex and it cannot be done at all in
the presentation layer, so you have to send the object to another layer, and
once again you benefit from the easy serialization/parsing. If it is for
another purpose, I recon the model has a limit : no encapsulation of data.
Until now, we have not found it to be blocking nor required in the
presentation layer (after all, you're here to show the data, not hide it),
but we are thinking about the problem.

So, there are still a lot of work to be done, but we already are making
benefits of this approach. Are there people out there that have the same
views ?

Regards,
-----------------------------------------------------------
Nicolas Lehuen
Responsable R&D - Head of R&D
Ubicco - Multi Access Software Solutions
http://www.ubicco.com/


* see for example http://www.cs.pitt.edu/~chang/156/14hier.html 

-----Message d'origine-----
De : Joshua Allen [mailto:joshuaa@microsoft.com]
Envoyé : vendredi 29 juin 2001 01:02
À : Mike.Champion@SoftwareAG-USA.com
Cc : xml-dev@lists.xml.org
Objet : RE: Against the Grain: Pascal commentary about XML and databases


>I keep hoping that there is some middle ground where the rigorous
mathematics of the 
>relational model and the pragmatic usability of XML can meet and inform
one another.  In 
>private correspondence, Mr. Pascal assured me that a truly mathematical
model of XML is 
>impossible, but I'm keeping an open mind.

Hehe, this is pretty good reading.  The only reason that RDBMS software
dominates the market right now is because we are good at solving these
problems, and RDBMS design has evolved to disallow users from asking
questions that the database isn't good at answering.  The fact that we
ship databases that only permit things that we know how to answer
efficiently does NOT imply that we will never be able to answer other
questions more efficiently (in fact, RDBMS systems have evolved and
gobbled up much of the research on data warehousing to include those
techniques into the engines -- witness materialized views and bitmapped
indexes).  It is quite easy to see a trend in the industry that shows
consistent continual progress at solving hard query problems.  Of course
some problems will always be hard (distributed cost-based query
optimization is one), but I would point out that research on RDBMS
optimizations has tapered off quite a bit and we have seen major
increases in research geared towards semi-structured data in the past
decade.  So we are simply easing off on some of the traditional RDBMS
constraints and beginning to allow things like recursive self-joins,
ragged hierarchies, etc. and we are optimizing these things.  I mean, we
already solved the RDBMs optimization challenge (and remember that there
were people predicting that SQL would never fly back in 1980) and now it
is time to move to the next thing.  XML seems like a very appropriate
evolutionary step.

As for saying that a truly mathematical model of XML is impossible; XML
is simply a node-labeled graph.  This is about as pure a discrete
mathematics concept as you can get.  It is easy to find graph traversal
challenges that are NP-hard or need O(n^2) or worse.  So?  I think that
areas of discrete mathematics that deal with graphs are currently the
most vibrant area of research in the industry.  The web itself is one
huge graph structure, and research on ways to index the web, optimize
routing, etc. all feed directly into techniques for optimizing XML
processing.  And it seems that TSPs and NP-Optimizations are all the
rage these days.  XML *is* math, and it's the *cool* math these days.
Data processing married with XML is about as real as it gets.

But I know this is all twice-told tale for you Mike.

Regards,
Joshua

------------------------------------------------------------------
The xml-dev list is sponsored by XML.org, an initiative of OASIS
<http://www.oasis-open.org>

The list archives are at http://lists.xml.org/archives/xml-dev/

To unsubscribe from this elist send a message with the single word
"unsubscribe" in the body to: xml-dev-request@lists.xml.org

application/ms-tnef