Lists Home |
Date Index |
The political pressure is to do it all right _now_
But I agree that approaching this challenge by automating the
highest-return smallest-possible portion of the data store updates,
immediately, and deferring the rest for later processing makes sense
logistically, but not really from a process perspective for the cases of
which I am aware.
In response to your question, I would say, no, assuming such existed today
(they do not). But the case presented to me was that terabytes of XML is
massively preferable to terabytes of data in data elements in a relational
structure in a dbms served up or delivered as xml docs on demand. Drawing a
comparision between the two approaches, (1) pure XML and (2) relational
dbms discrete data elements delivered or served up as XML, my conclusion
was (a few years ago, and remains to this day) that the latter (2) was much
more feasible, viable, doable and desirable than the former based on a long
list of factors, some of which I posted earlier. I agree with posts that
say that XML only, or XML without dbms aka XML non-dbms, is just stupid
for very large data stores (gigabytes or terabytes) especially in the
presence of known growth factors of any significance (> 1/10th of 1% per year).
As for your post, it was the absence of a comment regarding scaling that
concerned me, and it was not my intent to infer that you stated that such
would scale into terabytes.... it is that you did not indicate what would
scale and what would not, and that this lack of distinction causes problems
for folks who read this list thoroughly and maybe draw unsupported
conclusions. This is not your fault, per se, but it seemed to me that you,
and various other posters, might have some very valuable observations about
scaling issues that could help alot of other readers. Ergo my comments.
Please keep in mind that any gov't or scientific data oriented system is
quite likely to grow into a very large data store over time, so it is
_already_ a real problem that current gov't and scientific document
applications featuring "XML only" or XML and _non-dbms_ implementations
already delivered and in use will probably not scale into gigabytes much
less terabytes. In fact in most of such cases that I know of I expect the
primary stakeholder to face support, maintenance and revision outlays that
exceed the original application cost, year after year after year. This is
going to be a rude surprise, and probably will cause abrupt career changes
for lots and lots and lots of people as time goes on.
I respectfully suggest that some X3C working group development guidelines
as to XML Scalability, XML Performance, XML Life-cycle planning (CMM even),
XML Data scoping and appropriate XML Application Engineering techniques be
considered, and published or distributed, for the masses.
Thanks for your response.
A 'document file' according to the other parties in this f2f discussion
here is any file composed of XML and data in an ascii file format stored
electronically regardless of size or how many hard drives it spans.....
I do not advocate using strictly relational dbms systems either, but rather
seek as a general principle to use the best tool for the job, as
intelligently as possible. However this implies a certain amount of what I
call common-sense, which is probably more training and experience than
Anyway, some dedicated XML purists hereabouts have been working quite hard
to deliver very large scale (gigabyte or terabyte) "XML only" applications.
Sigh. Our US tax dollars at work !!!
mutter mutter mutter
At 03:03 PM 8/19/2003 -0400, you wrote:
>I'm not sure what part of my response you took to suggest that one-off
>documents would scale well into terabytes. Without some serious supporting
>storage management infrastructure you aren't approaching that scale.
>On the other hand, I wonder how many of these huge document systems are
>motivated by corporate, accademic, or gub'ment politics? Avoidance and
>deferred update may be a better strategy than trying to treat the whole
>corpus statically. Does a couple of terabytes of XML present a worse case
>than, say, a collection of a couple of terabytes of Interleaf, various
>versions of MS Word, Framemaker, PDF, and ASCII text documents?
>In any case, you do raise good points regarding the CMS facilities needed
>(in some edge cases).
>>I am concerned to hear this approach, and others here, discussed, without
>>comment as to scaling issues regarding very large datastores (in XML
>>documents or in relational dbms) that might be ten to several hundred
>>terabytes in size.
>>Specifically, in the following respects:
>>1- sheer size problems such as disk access time, out of memory
>>conditions, and processor time to parse very large XML documents (say,
>>1,000 documents of 1 terabyte each) or a very large number of XML
>>documents of smaller size (say, 5,000,000 5MB docs).
>>2- maintenance issues driven by the smallest of interface changes or
>>presentation changes, that result in hundred of thousands if not millions
>>of manual static schema modifications, rippling across either a very
>>large number of smaller XML documents and their specific schemas or
>>through as many as a thousand or so documents of 1 terabyte each in size.
>>Even if such ripple effect maintenance can be automated, the processing
>>time required to update, say, 5,000,000 XML doc files of 5MB each cannot
>>be said to be real time, so perhaps weeks of processing time is required
>>before the interface mods can be subject to just one full test.
>>3- consistency across versions, releases, XML standards and tool sets
>>(MS, SQL Server, MySQL, Oracle, etc) considering that a very large scale
>>project will take some time to mature (possibly years), and that a lack
>>of backward compatibility could drive massive changes into the basic XML
>>design structure and overall document architecture.
>>4- transmission time across interchanges - whether lan, web or intranet
>>based, the time to transmit and parse result sets to XQuery are often
>>very large, and for very large XML documents this processing time is
>>unacceptably long. People want results in five to eleven seconds, not
>>minutes, not hours.
>>I have specific experience in very large paper based, and relational
>>database systems. From time to time, I see folks scale up systems that
>>work fine, up to a point, past which they are forced to redesign from scratch.
>>While I agree that broadly generalized discussions are the most common
>>form of technical exchange of information, having seen several of these
>>pilot efforts crash and burn, I feel a moral obligation to suggest that
>>some comment be made as to scaling issues, known propagation or ripple
>>effects, and sheer size problems that come into play when viable
>>"average" architectures are scaled beyond their design parameters.
>>In reference to this specific method, I submit that when dealing with a
>>very large repository of prose, that a very large number of "profile
>>documents" is possible, and that the number of possible "profile
>>documents" correlates to some index of the context and the subject matter
>>and the usage purposes (inquiry / result pairs), a result that to my mind
>>increases or scales up as the number of prose entities scales up. I will
>>go further and say that, for instance, for all articles ever published in
>>the scientific journal "Nature", or perhaps all items in the U.S. Library
>>of Congress or all pending applications and issued patent files in the
>>U.S. Patent Office, this number of possible "profile documents" becomes
>>very large indeed. Though it may be possible to satisfy as much as a
>>majority if inquiries with a small number of such structures, the rest of
>>the inquiries, it seems to me, will require an ever increasing number of
>>"profile documents" to satisfy so that satisfying the last 1 percent of
>>such inquiries might require several thousands of such "profile
>>documents", if not tens of thousands or hundreds of thousands.
>>So, I am interested to hear about practical applications using XML only
>>implementation (XQuery, XML, XSLT, XPath, etc) that deal with wide
>>ranging subject matter, such as is found in the scientific journal
>>"Nature", or perhaps all items in the U.S. Library of Congress or all
>>pending applications and issued patent files in the U.S. Patent Office,
>>to a very broad audience, across scientific disciplines and cultures (and
>>possibly languages), for a very large data repository of mixed content
>>(prose, graphics, slides, photos, video, sound, other streaming data
>>sources or media) measured in tens or hundreds of terabytes.
>>While XML is superb at document mark up, in my experience almost as good
>>as TeX, it does not strike me as the best tool for the job when dealing
>>with very large scale data repositories. Still, I have an open mind and
>>perhaps someone here can enlighten me.
>>At 10:28 PM 8/18/2003 -0400, you wrote:
>>>One of the difficulties in considering factoring out functionally
>>>dependent entities from prose, is that the block of prose may itself not
>>>be worth reusing. That is, the prose may be a one-shot document whose
>>>original intent is simply to present information, not to act as a
>>>reliable container for access by clients with a variety of intents.
>>>One thing I've done is to try to identify those concepts which are best
>>>understood, are most firmly established, and which serve as the focus of
>>>the stakeholders' activities and communications. Then design a profile
>>>document for each of these high-level concepts, which provide context
>>>for making pointers and for generating identifiers. The profiles are
>>>designed to provide some elements which are rigidly structured, and
>>>other elements which are prose with mixed content. In one case at least,
>>>this allowed me (with a stylesheet) to resolve most cross references
>>>internal to the document itself, minimizing calls to scan external
>>>documents. Also, depending upon the nature of your data and your
>>>validation techniques, you may be able to use the mixed content prose as
>>>the source of the definitive information, rather than just as glue.
>>>It is certainly something a good CMS can help with, but I've also used
>>>DSSSL and XSLT/XPath for doing just this sort of thing with reasonable
>>>results. You might also want to check out DITA by Michael Priestley et
>>>al. of IBM, which I think intends to facilitate topical reuse.
>>>Roger L. Costello wrote:
>>>>I am working with some people who wish to migrate from an
>>>>all-prose format to a prose-plus-reusable-XML-fragments
>>>>They have some data in prose that is useable in many contexts. They
>>>>want to break out that reusable data into XML fragments. However,
>>>>they want to continue to provide the prose style.
>>>>For example, consider this prose data:
>>>><para>The city of Miami, Florida (pop. 1, 234,000) is a sprawling city
>>>>with many attractions. Miami Beach is a popular attraction. The
>>>>spring tide is ... The neap tide is ... </para>
>>>>Examining this prose we can extract reusable info about the city of
>>>>We can also extract reusable info about tide data on Miami Beach:
>>>>The problem now is to create a framework which allows the prose
>>>>to bring-together the independent, reusable XML components.
>>>>Conceptually, what is desired is a "glue framework" like this:
>>>><para>The <ref href="Miami.xml"> is a sprawling city with
>>>>many attractions. Miami Beach is a popular attraction. The
>>>>tides are <ref href="MiamiBeachTides.xml"><para>
>>>>Thus, the prose is "glueing" together the XML fragments.
>>>>Is this a problem that you have experience with? What "glue
>>>>framework" have you used? What strategy did you use to merge
>>>>the XML fragments with the prose? Is there is a standard way
>>>>of combining semi-structured data with structured data?
>>>>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
>>>>initiative of OASIS <http://www.oasis-open.org>
>>>>The list archives are at http://lists.xml.org/archives/xml-dev/
>>>>To subscribe or unsubscribe from this list use the subscription
>>>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
>>>initiative of OASIS <http://www.oasis-open.org>
>>>The list archives are at http://lists.xml.org/archives/xml-dev/
>>>To subscribe or unsubscribe from this list use the subscription