[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Are we losing out because of grammars?
- From: Charles Reitzel <firstname.lastname@example.org>
- To: email@example.com
- Date: Sat, 03 Feb 2001 15:32:48 -0500 (EST)
First, let me say that I really appreciate this discussion. It is helping
me understand schema development on a deeper level. Thank you.
If I may, I'd like to give some feedback to schema language developers as a
potential customer. And please don't tell me "it's free". Purchase price
is a vanishingly small part of TCO.
References to "DOM" below should be read as "any processing approach that
requires a generalized model of the entire XML document to be memory resident".
1) Simple things should be simple. Although subjective, I think grammars
win easily by this measure.
2) Scalability is important. DOM dependencies need to be explicit. AFAIK,
XPath typically requires DOM. Thus, XML Schema features like "unique",
"key" and "keyRef" will most likely end up w/ a DOM under the hood. The
application designer/developer must be able to make an informed decision
about the cost of using schema features.
3) Mr. Jelliffe's real world example is a good one. The solution can be
easily implemented by a simple application using XPath+DOM. To my mind,
such validation is not a schema language requirement, per se. Note that to
apply these rules to some external schema (e.g. DocBook), some additional
metadata will be required to decide which DocBook instances to apply the
rules and which not. I.e. this processing needs to take place in the
context of an application, not a generic process.
Note the use of an aggregate operator "count(//news:who)". This is a job
for the DOM - at least indirectly. Don't get me wrong, the DOM is useful
and makes useful features like compound keys and aggregate operations (long
enjoyed by SQL developers) tractable. In the SQL world, the DOM-equivalent
layer is always proprietary and hidden. But SQL developers know to avoid
certain features when performance is important. For example, any data mart
developer knows that - although not logically necessary - summary data
should be pre-calculated to get decent, predictable response times and
support more simultaneous users.
4) I must respectfully disagree w/ Mr. Bullard when he says,
On Fri, 02 Feb 2001 13:18:15 Len Bullard wrote:
>Yes: systems for choosing. If there is only one,
>there is no ambiguity. But is that a good thing?
>I think it an attractive thing to mammal brains
>that strive for closure instinctively and crave
>power and esteem physically, but a bad
>thing for systems that reciprocally evolve environments.
I think the mammals' requirements take precedence. Systems will evolve in
healthier ways when the people that write and use them don't waste a lot of
what I call "organizational bandwidth" discussing arcana like ambiguity
resolution algorithms. That discussion needs adequate resolution here on
this list - or someplace like it. Please don't pass the buck to the users.
BTW, +1 for "document order" type resolution. I think most people will find
it more intuitive than "most restrictive type". I do, anyway.
5) The open/closed schema issue is probably important. However, there have
been several discussions on this list lately about schema extensibility that
address the issue more directly.
If I have gained anything from this discussion, however, it is probably that
layering of rules over schema should not be an afterthought. My modest
proposal follows. Layer 1 should be approved "yesterday" to allow the world
to start using this stuff!
Layer 1: content model + data types
DOM never required. 1 pass validation and data type determination. If you
can't resolve it, unambiguously, by the time the end element tag appears, it
doesn't belong here. Note, required elements and attributes will, by
necessity, be the loosest allowed during the life of a document (just like
NOT NULL in SQL). Ancestor knowledge is ok. XML Schema supports different
definitions for the same element name, based on parent element. I think
this feature is overkill, but it is streamable, so ok.
For XML Schema, data types also includes some basic constraints: minValue,
maxValue, minOccurs, maxOccurs, list vs. scalar. These all look streamable.
Layer 2: constraints and intra-doc references
DOM may be required. XML Schema Identity constraints. ID/IDREF integrity
In theory, key selectors that choose only child elements -as in example from
the XML Schema Primer- do not need DOM support. Implementations will
probably vary. Layer 2 is important to allow much more compact documents
(e.g. the lookup table in that same example).
Layer 3: processing rules
Anything that looks or acts like an "if-then-else". Aggregate operations.
External doc references?
Some rules may not require a DOM, but the analysis required to make the
determination may cost more. Some kind of "EXPLAIN PLAN" equivalent would
be necessary to let the schema designer know what he is for at runtime. A
debugger would be nice, too.
Thanks for reading,
On Fri, 02 Feb 2001, Rick Jelliffe wrote:
>From: James Clark <firstname.lastname@example.org>
>>Whilst I think the approach used by Schematron is an valuable
>>complement to grammar based schemas (obviously I'm personally
>>delighted to see XPath getting used for validation), I really
>>find it very hard to take seriously the idea that the time has
>>come to completely discard grammars in favour of path-based rule
>XPath and XSLT are great.
>>Let's take a really simple example:
>><!ELEMENT a (b?, c)>
>><!ELEMENT b (#PCDATA)>
>><!ELEMENT c (#PCDATA)>
>>or as a TREX pattern:
>> <element name="b">
>> <element name="c">
>If efficiency and terseness is the criteria, what about:
Efficiency: yes. Terseness: only up to a point. I would suggest
efficiency, clarity and maintainability as criteria.
> <rule context="a">
> <assert test=
> "b[next-sibling::c[position()=last()]] or
> <rule context="b[* or @*] | c[* or @*]">
> <report test="1=1" >Should be empty.</report>
>This has 5 functioning elements compared to TREX's 6. It only
>requires looking at the first child. (This is an example of
>elaborating each path, which is nasty for larger rules.) But
>it is not particularly the way I'd envision people will use
Q: Will a Schematron implementation actually look at the rule set and decide
whether or not to load a DOM?
>This can be pretty printed to give a very direct list of rules
>about the schema. I note again that the comprehensibility of a
>schematron schema comes not from its paths (though often these
>are simple) but because there is a pretty direct path for making
>everything explicit in simple natural language statements. If
>one element can follow another, we can explain "why".
Can you show an example of such pretty printing?
>But lets try a different example, quid pro quo. This is a real
>one, coming from discussion on how to mark up news stories.
>The client gave the following requirement:
> "Every news story must have elements to mark up who, what,
> where, when and how. There must be one and only one of
> each in every story. They can appear anywhere."
>This requirement is very easy to express in words. It is also
>trivially easy to express in Schematron:
> <rule context="/">
> <assert test="count(//news:who)=1 and
> count(//news:what)=1 and
> count(//new:where)=1 and
> count(//news:when)=1 and
> >Every news story must have elements to mark up who,
> what, where, when and how. There must be one and
> only one in every story. They can appear
>One can take these constraints and add them to any schematron
>schema without change and it will work. (If the other schema is
>closed, then that is an internal inconsistency, which is a
>different matter. However, schematron schemas are open by default.)
>Lets say we add this to a schematron schema for full DOCBOOK. The
>addition in Schematron is just a single rule, and it will fit in
>with all the constraints already in place. It seems to me that
>this would cause a grammar-based schema language to explode, if
>it could cope at all: XML Schemas could not cope (if the schema
>had used <or> groups, and we have to assume that there are already
>"<any>" wildcards in place so these new elements are allowed