OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] A standard approach to glueing together reusableXML fragme

[ Lists Home | Date Index | Thread Index ]

On Wed, 2003-08-20 at 09:08, Chiusano Joseph wrote:
> <Quote1>
> processor time to parse very large XML documents (say, 1,000 documents 
> of 1 terabyte each)
> </Quote1>
> 
> If one's XML documents are 1 terabyte large, then they better rethink
> their system architecture and design, and chop their documents up into
> smaller pieces. A 32-bit processor can itself address only up to 4GB of
> memory.

<;)>that's why we have 64 bit processors</;)>

> 
> <Quote2>
> maintenance issues driven by the smallest of interface changes or 
> presentation changes, that result in hundred of thousands if not
> millions  of manual static schema modifications, rippling across either
> a very large  number of smaller XML documents and their specific schemas
> or through as many as a thousand or so documents 
> </Quote2>
> 
> One should never have to perform "hundred of thousands if not millions 
> of manual static schema modifications" - an XML registry and/or a robust
> content management system should enable updates to be made in one
> central location and propagated to all of the pertinent places (which
> reference the central location by pointers). This also addresses your #3
> point.
> 
> <Quote3>
> transmission time across interchanges - whether lan, web or intranet 
> based, the time to transmit and parse result sets to XQuery are often
> very  large, and for very large XML documents this processing time is 
> unacceptably long. 
> </Quote3>
> 
> A very valid and well-known issue - and one of the reasons that some
> brainstorming over binary XML is going on these days.
> 
> Kind Regards,
> Joe Chiusano
> Booz | Allen | Hamilton
> 
> dbexcom wrote:
> > 
> > I am concerned to hear this approach, and others here, discussed, without
> > comment as to scaling issues regarding very large datastores (in XML
> > documents or in relational dbms) that might be ten to several hundred
> > terabytes in size.
> > 
> > Specifically, in the following respects:
> > 1- sheer size problems such as disk access time, out of memory conditions,
> > and processor time to parse very large XML documents (say, 1,000 documents
> > of 1 terabyte each) or a very large number of XML documents of smaller size
> > (say, 5,000,000 5MB docs).
> > 2- maintenance issues driven by the smallest of interface changes or
> > presentation changes, that result in hundred of thousands if not millions
> > of manual static schema modifications, rippling across either a very large
> > number of smaller XML documents and their specific schemas or through as
> > many as a thousand or so documents of 1 terabyte each in size. Even if such
> > ripple effect maintenance can be automated, the processing time required to
> > update, say,  5,000,000 XML doc files of 5MB each cannot be said to be real
> > time, so perhaps weeks of processing time is required before the interface
> > mods can be subject to just one full test.
> > 3- consistency across versions, releases, XML standards and tool sets (MS,
> > SQL Server, MySQL, Oracle, etc) considering that a very large scale project
> > will take some time to mature (possibly years), and that a lack of backward
> > compatibility could drive massive changes into the basic XML design
> > structure and overall document architecture.
> > 4- transmission time across interchanges - whether lan, web or intranet
> > based, the time to transmit and parse result sets to XQuery are often very
> > large, and for very large XML documents this processing time is
> > unacceptably long. People want results in five to eleven seconds, not
> > minutes, not hours.
> > 
> > I have specific experience in very large paper based, and relational
> > database systems. From time to time, I see folks scale up systems that work
> > fine, up to a point, past which they are forced to redesign from scratch.
> > 
> > While I agree that broadly generalized discussions are the most common form
> > of technical exchange of information, having seen several of these pilot
> > efforts crash and burn, I feel a moral obligation to suggest that some
> > comment be made as to scaling issues, known propagation or ripple effects,
> > and sheer size problems that come into play when viable "average"
> > architectures are scaled beyond their design parameters.
> > 
> > In reference to this specific method, I submit that when dealing with a
> > very large repository of prose, that a very large number of "profile
> > documents" is possible, and that the number of possible "profile documents"
> > correlates to some index of the context and the subject matter and the
> > usage purposes (inquiry / result pairs), a result that to my mind increases
> > or scales up as the number of prose entities scales up. I will go further
> > and say that, for instance, for all articles ever published in the
> > scientific journal "Nature", or perhaps all items in the U.S. Library of
> > Congress or all pending applications and issued patent files in the U.S.
> > Patent Office, this number of possible "profile documents" becomes very
> > large indeed. Though it may be possible to satisfy as much as a majority if
> > inquiries with a small number of such structures, the rest of the
> > inquiries, it seems to me, will require an ever increasing number of
> > "profile documents" to satisfy so that satisfying the last 1 percent of
> > such inquiries might require several thousands of such "profile documents",
> > if not tens of thousands or hundreds of thousands.
> > 
> > So, I am interested to hear about practical applications using XML only
> > implementation (XQuery, XML, XSLT, XPath, etc) that deal with wide ranging
> > subject matter, such as is found  in the scientific journal "Nature", or
> > perhaps all items in the U.S. Library of Congress or all pending
> > applications and issued patent files in the U.S. Patent Office, to a very
> > broad audience, across scientific disciplines and cultures (and possibly
> > languages), for a very large data repository of mixed content (prose,
> > graphics, slides, photos, video, sound, other streaming data sources or
> > media) measured in tens or hundreds of terabytes.
> > 
> > While XML is superb at document mark up, in my experience almost as good as
> > TeX, it does not strike me as the best tool for the job when dealing with
> > very large scale data repositories. Still, I have an open mind and perhaps
> > someone here can enlighten me.
> > 
> > Thank you.
> > 
> > At 10:28 PM 8/18/2003 -0400, you wrote:
> > >One of the difficulties in considering factoring out functionally
> > >dependent entities from prose, is that the block of prose may itself not
> > >be worth reusing. That is, the prose may be a one-shot document whose
> > >original intent is simply to present information, not to act as a reliable
> > >container for access by clients with a variety of intents.
> > >One thing I've done is to try to identify those concepts which are best
> > >understood, are most firmly established, and which serve as the focus of
> > >the stakeholders' activities and communications.  Then design a profile
> > >document for each of these high-level concepts, which provide context for
> > >making pointers and for generating identifiers. The profiles are designed
> > >to provide some elements which are rigidly structured, and other elements
> > >which are prose with mixed content. In one case at least, this allowed me
> > >(with a stylesheet) to resolve most cross references internal to the
> > >document itself, minimizing calls to scan external documents. Also,
> > >depending upon the nature of your data and your validation techniques, you
> > >may be able to use the mixed content prose as the source of the definitive
> > >information, rather than just as glue.
> > >It is certainly something a good CMS can help with, but I've also used
> > >DSSSL and XSLT/XPath for doing just this sort of thing with reasonable
> > >results. You might also want to check out DITA by Michael Priestley et al.
> > >of IBM, which I think intends to facilitate topical reuse.
> > >
> > >Roger L. Costello wrote:
> > >
> > >>Hi Folks,
> > >>I am working with some people who wish to migrate from an
> > >>all-prose format to a prose-plus-reusable-XML-fragments
> > >>format.
> > >>They have some data in prose that is useable in many contexts.  They
> > >>want to break out that reusable data  into XML fragments.  However,
> > >>they want to continue to provide the prose style.
> > >>For example, consider this prose data:
> > >><para>The city of Miami, Florida (pop. 1, 234,000) is a sprawling city
> > >>with many attractions.  Miami Beach is a popular attraction.  The
> > >>spring tide is ... The neap tide is ... </para>
> > >>Examining this prose we can extract reusable info about the city of
> > >>Miami:
> > >><City id="Miami">
> > >>     <state>Florida</state>
> > >>     <population>1,234,000</population>
> > >></City>
> > >>We can also extract reusable info about tide data on Miami Beach:
> > >><TideData id="MiamiBeachTides">
> > >>     <springTide>...</springTide>
> > >>     <neapTide>...</neapTide>
> > >></TideData>
> > >>The problem now is to create a framework which allows the prose
> > >>to bring-together the independent, reusable XML components.
> > >>Conceptually, what is desired is a "glue framework" like this:
> > >><para>The <ref href="Miami.xml"> is a sprawling city with
> > >>many attractions.  Miami Beach is a popular attraction.  The
> > >>tides are <ref href="MiamiBeachTides.xml"><para>
> > >>Thus, the prose is "glueing" together the XML fragments.
> > >>Is this a problem that you have experience with?  What  "glue
> > >>framework" have you used?  What strategy did you use to merge
> > >>the XML fragments with the prose?  Is there is a standard way
> > >>of combining semi-structured data with structured data?
> > >>/Roger
> > >>
> > >>-----------------------------------------------------------------
> > >>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> > >>initiative of OASIS <http://www.oasis-open.org>
> > >>The list archives are at http://lists.xml.org/archives/xml-dev/
> > >>To subscribe or unsubscribe from this list use the subscription
> > >>manager: <http://lists.xml.org/ob/adm.pl>
> > >
> > >
> > >
> > >-----------------------------------------------------------------
> > >The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> > >initiative of OASIS <http://www.oasis-open.org>
> > >
> > >The list archives are at http://lists.xml.org/archives/xml-dev/
> > >
> > >To subscribe or unsubscribe from this list use the subscription
> > >manager: <http://lists.xml.org/ob/adm.pl>
> > 
> > -----------------------------------------------------------------
> > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> > initiative of OASIS <http://www.oasis-open.org>
> > 
> > The list archives are at http://lists.xml.org/archives/xml-dev/
> > 
> > To subscribe or unsubscribe from this list use the subscription
> > manager: <http://lists.xml.org/ob/adm.pl>
> 
> ______________________________________________________________________
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS