xml-dev - Seeking advice on handling large industry-standard XML data models [long

Seeking advice on handling large industry-standard XML data models [long
[ Lists Home | Date Index | Thread Index ]
To: xml-dev@lists.xml.org
Subject: Seeking advice on handling large industry-standard XML data models [long]
From: Jeff Lowery <jlowery@scenicsoft.com>
Date: Mon, 13 Jan 2003 15:11:08 -0800
I'd say "document models", but this is really a data model in XML guise.

We have fairly large XML data-transfer document. By "large" I mean that the
document itself is more or less described in around 25,000 lines of
(inadequate) XML Schema. This is one of those models like I imagine some of
the Oasis documents to be: a data interchange format targetted at a wide
variety of processing domains within a vertical or horizontal market.  

It's really not a common data model; each processing domain may use the same
data, but with somewhat different names, constructs, formats, and concepts.
Plus there are many "uncommon" types of data that have not found their way
into this standard (which will, of course, be added in using elements and
attributes under non-standard namespaces).

I'm sure for some of you this sounds mundane. Good! -- that's what I'm
hoping for.

Well, of course, each domain needs to populate documents based on this
model.  I can see several approaches:

1.  Build a type-specific document object model

I'm pretty sure I don't want to go this route.  This approach may make sense
for small, stable document models; but for this one in particular, I'm
having a hard time seeing the payback. Reasons:

a) it's large
This is going to be one monster object model.  Source generation from the
Schema using Castor or JAXB would offer a good first approach, but neither
tool is mature enough yet (Castor can't handle simple-type elements; JAXB
chokes on circular schema includes). 

Even with code generation, however, there's still a lot of work left to be
done.  The code that is generated will be fine-grained Java beans, which is
all well and good but not at a high-enough a level of abstraction to be
useful for an API. There's also a lot of co-occurrence constraints in this
particular document model that would have to be coded up.  And due to a
limitation of XML Schema (or a limitation in the XML mindset of the
designers, depending on where you sit in the unordered content debate), many
of the one-to-one cardinality constraints are defined as one-to-many, so the
generated objects will need tweaking.  So it's a generate once, tweak many
proposition.

b) it's somewhat undefined
There's also the issue of all those custom attributes and elements that are
going to be added at various processing stages. We of course don't know what
they will be (except for the ones we create), but we do have to preserve
them.  This argues for a generic data structure to store such (meta)data, so
you're going to have to have a bit of a DOM thrown in as well.

c) it's still young
New version of spec leads to new, improved class definitions. Oh, and
object-to-object data migrations, all hand-coded (or maybe using reflection,
if you have the talent and patience). What fun.

d) monster potential
Given enough implemention cost, managers will seek to reuse.  I see a real
danger here of this code developing into the Common Object Model.  Hey, it
talks to all these processes, right?  So each process can just use this
model directly, right?  We can build all our applications on this data
model!   

Unfortunately this makes a lot of superficial sense, which is what managers
tend to go by (I was one, once).  Given the disparity of the domains this
addresses, I just don't think it would work out.  Better to populate from
app-specific data models to a common  *interchange* format, IMNSHO.

Granted, Java has a lot of XML libraries; language support is not the
concern, it's the implementation of a large document-type specific class
hierarchy that I question the value of.

2)  Use a generic object model

Say, fer instance, DOM 3 (which will be finished real soon now).

Advantages:

a) it can handle any content
Which means it can handle all those custom doc components under non-spec
namespaces the same way it handles the spec'd components. 

b) continuous validation
Although the DOM 3 WG tossed aside Abstract Schemas, they still intend to
support continous validation last I checked.  I think Xerces already
implemented continuous validation support for JDOM, but I'm too lazy to
check right now. Anyhow, the upshot is that today or someday, all the
validation against newly-entered data will be 'free'.  Which beats custom
code, IMO.

But there are disadvantages:
c) everybody uses DOM , nobody likes DOM
I don't know why.... must be documentation.  JDOM is certainly nicer for
Java volken.  

d) not so fast...
No matter how you slice it, DOM will be a tiny bit slower in the validation
dept. than generated or custom objects. At least that's what I think.

3) Web publishing
This is the term I use for the "take XML document, transform, post on
server" approach.   A lot of people on this list would be comfortable with
this approach, but it's kind of radical in my little circle.

The basic idea is to build HTML-based forms to handle input.  Theoretically,
creating HTML forms specific to each domain (think custom views)  is a
matter of writing new XSLT scripts. 

Advantages:

a) modular and pipelined
Take a complex XML document, cut it down to size, and add abstractions where
appropriate.  Generate an HTML form using another transform. Handle input,
run the reverse process, voila! populated interchange data.

b) XML tools for managing XML
This means experts in XML (the tool authors) handle much of the in-memory
management of the document data, including validation. 

Generic programming languages like Java are nice, but there's a tendency for
overkill and "creativity" on the part of developers.  It's a little Siren
song Java sings to you as you type....  XML-specific tools tend to push you
to be task-focused.

Disadvantages:
a) browser interface
I work with Mac developers.  If it ain't Aqua, forgeddabowdit.

b) tool maturity
We need some interactive graphical component to the UI.  Nothing fancy, at
least nothing SVG can't handle (I don't think).  But... binary input is via
PDF.  There are some converters out there, but I doubt they're going to
handle anything complex.  And we get some off-the-wall PDF sometimes. 

We'll also need to generate PDF for reports.  XSL-FO may be sufficient, but
can't say for sure.

c) I don't know scratch about XSL-FO, XSP, JSP, ASP, etc. etc.
I'm not a Web developer by any stretch of the imagination, although I have
done a proof-of-concept on this.  Perhaps just too bleeding edge, eh? 


Conclusion:
First of all, I'd like to thank those of you who've bothered to read this
far.  (I know I haven't bothered to reread it, so I expect you've seen a lot
of typos.)

None of these implementation choices are completely exclusive.   If I were
to put them on a scale from safe and traditional (1) to bleeding edge and
radical (3), I'm about at a 2.4.   I know some of you have worked with large
XML interchange formats for many years, so your insight (or incite, if
you're like me)  is appreciated.   I'm looking forward to hearing about
alternatives and better assessments of risk.
Prev by Date: Request for XML-based Content Management System Recommendation
Next by Date: Re: [xml-dev] Instrument DTD
Previous by thread: Request for XML-based Content Management System Recommendation
Next by thread: XPath validation against a schema rather that against an instance
Index(es):
- Date
- Thread