xml-dev - Validation vs performance - was Re: [xml-dev] Fast text output from SAX?

Validation vs performance - was Re: [xml-dev] Fast text output from SAX?

[ Lists Home | Date Index | Thread Index ]

To: 'XML DEV' <xml-dev@lists.xml.org>
Subject: Validation vs performance - was Re: [xml-dev] Fast text output from SAX?
From: Michael Champion <mc@xegesis.org>
Date: Fri, 16 Apr 2004 19:53:32 -0400
In-reply-to: <p06010202bca5d6132a8a@[192.168.254.88]>
References: <006b01c423d0$2ef38ee0$650aa8c0@BOBDEV> <p06010202bca5d6132a8a@[192.168.254.88]>

On Apr 16, 2004, at 2:30 PM, Elliotte Rusty Harold wrote:

> . I have seen any number of binary formats that achieve speed gains 
> precisely by doing this. And it is my contention that if this is 
> disallowed (as I think it should be) much, perhaps all, of the speed 
> advantages of these binary formats disappears.
>

Well, there is an immense amount of truth in this, but I have to take 
issue with the "as I think it should be" aside.       For example, 
there are AFAIK plenty of enterprise systems out there that do a 
billion transactions a day during peak times. Even on big honking 
hardware that doesn't allow many cycles per transaction for data 
validation if you have to do it  more than 10,000 times per second.

As best I understand it, people get this kind of performance in an 
enterprise environment by various methods, including a) doing the 
business-rule validation and  data cleansing earlier in the pipeline, 
b) trusting the overall business process to have produced valid data at 
crunch time; and c) auditing the results so that if somebody tries to 
exploit this trust, sooner or later they will be caught.  The same 
basic approaches are available in "XML" environments, e.g. validating 
and optimizing the data early in the pipeline, and using efficiently  
formatted and trusted data for downstream processing.  AFAIK 
essentially everyone using XML in a performance-critical environment 
(such as a DBMS or an enterprise messaging system) does something along 
these lines, including a couple of mega-corporations who do not see the 
value of *standardizing* the efficient XML formats. <duck>

Echoes of the great RSS well-formedness debate:  the choice isn't 
between unquestioningly accepting whatever data you are given and doing 
draconian checking at every single step in the pipeline, it's a 
question of how to setup the pipeline to detect corrupt data early on 
and do what it takes to get it fixed or rejected, and then efficiently 
process the data in those parts of the pipeline where speed is 
critical.  Sometimes XML syntax level validation against a DTD or 
schema is useful as part of this, sometimes not.  Sometimes double and 
triple checking of data validity against business rules by procedural 
code makes good business sense, sometimes not.  Sometimes you can get 
away with throwing the data back at the originator to fix, and 
sometimes you gotta fix it yourself.

  I cringe at the "Right Thing vs the Cowboy Way" characterizations at 
various points in this these threads. There are a lot of ways to set up 
a business process or transformation/aggregation pipeline to get both 
scalability and validity, and recommendations "disallowing" particular 
approaches at one step by global fiat are certain to be ignored.  It 
would be nice to get these threads turned into a discussion of best 
practices that people see in real life to find the optimal tradeoffs 
between desirable but somewhat incompatible properties such as loose 
coupling and high performance ... and away from discussion of alleged 
universal principles that should be promoted or disallowed.

References:
- RE: [xml-dev] Fast text output from SAX?
  - From: "Bob Wyman" <bob@wyman.us>
- RE: [xml-dev] Fast text output from SAX?
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>

Prev by Date: RE: [xml-dev] Fast text output from SAX?
Next by Date: [ANN] xframe - xsddoc-0.4-beta released
Previous by thread: RE: [xml-dev] Fast text output from SAX?
Next by thread: RE: [xml-dev] Fast text output from SAX?
Index(es):
- Date
- Thread