XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Fwd: [xml-dev] It is okay for things to break in the future!



One answer is to use a schema language with an ability to distinguish between the strengths or natures of different constraints. 

You need a schema language that helps you ask and specify, when you have some constraint, questions like WHY have that constraint, HOW should it be reported, and WHAT are the processing ramifications?  If your schema language treats documents as existing outside the SDLC and where there is no separation of concerns between different constraints and their rationales, then over time, why would you expect not to be mugged by reality? 

For example, in Schematron,

  • every assertion (or rule, or pattern) can have an attribute "@role" which lets you classify the constraint;
  • you can attach extra information giving boilerplate (or dynamic, specific) further information for the humans in "diagnostic" elements;
  • you can attach properties to drive subsequent computer-oriented processing using "properties", which may be dynamic and contain any foreign elements;
  •  you can use the @flag attribute to raise a flag on the document (flags are raised on an assertion, but are true for the whole document) for dispatching.
<sch:rule context="book">
    <sch:assert id="r1a1" test="count(author) =< 100"  role="warning" diagnostic="as-of-2022" flag="non-2022"
    >A book is expected to have no more than 100 authors.</sch:assert>
...
</sch:rule>
...
<sch:diagnostics>
 <sch:diagnostic id="as-of-2021">This constraint is a processing capacity required for
systems developed starting 2021; documents that do not meet this constraint may need
special attention.</sch:diagnostic>
<sch:diagnostic id="as-of-2022">This constraint is a processing capacity required for
systems developed starting 2022; documents that do not meet this constraint may need
special attention</sch:diagnostic>
...
</sch:diagnostics>
<sch:properties>
 <sch:property id="book-special-treatment">
     <my:handle-specially type="book" />
     <sch:span class="title"><sch:value-of select="@id"/></sch:span>
   </sch:property>
</sch:properties>

This would typically produce an SVRL (Schematron Validation Report Language) result like this:

<svrl:failed-assert
  location="/book[1]"
  id="r1a1"
  test="count(book) =< 100"
  flag="non-2022"
  role="warning">
   <svrl:text>A book is expected to have no more than 100 authors.</svrl:text>
   <svrl:diagnostic-reference id="as-of-2022">
    <svrl:text>This constraint is a processing capacity required for
    systems developed starting 2022; documents that do not meet this constraint
    may need special attention</svrl:text>
    </svrl:diagnostic-reference>

    <svrl:property-reference id="book-special-treatment">
      <my:handle-specially type="book"/>
      <svrl:text>
        <svrl:span class="title">NCD Risk Factor Collaboration</svrl:span>
     </svrl:text>
   </svrl:property-reference>
</svrl:failed-assert>

You can see that there is a separation of concerns allowing

1) The constraint information provided by SMEs to be expressed in a way that notional users
can understand. (The constraint text.)

2) Extra metadata to be attached to say the severity or nature of that constraint (@role="warning") which can be used for triage, GUI, notifications and pipeline dispatch,

3) Information to categorize the whole document accordingly (@flag="non-2022")

4) Extra information to convey to humans (e.g. whoever is first inline to fix the issue) to explain why this is a problem (the text in the diagnostics element)

5) Extra information for subsequent handling systems to route and process this document (here, an element in a foreign namespace, and the title of the document)

So we see that in this schema, there were also some constraints for system for 2021 and some for 2022: perhaps the number of authors was upgraded.  It is possible to model this in Schematron using the phases mechanism: you could have all the patterns that are applicable for systems developed in 2022 which those systems can use to validate, and another for systems developed in 2021 which those systems can use to validate.  (Or you could have a phase which lets you detect whether a document requires a 2021 or a 2022 system, and validate that first, then forward the document to the appropriate system or to triage.)

Lets face it, it was common practise by the late 1970s to classify log messages by criteria (such as fatal, error, warning, info). Indeed it still is.  But the grammar-based schemas do not provide any means to classify their constraints in useful ways: which means that they are only useful for the most rudimentary kinds of validation, for constraints that are likely true at all stages in the SDLC and document pipeline, and have equal significance.   Consider this: some schema languages provide mechanisms for one schema to build on another in some way, but again they fail to provide simple annotations to convey to humans or systems why the change is needed or how it applies.

To put it another way, constraints themselves need to be "typed" or, rather, support standard and ubiquitous metadata annotations that progress into the PSVI or equivalent (in Schematron, the PSVI is the SVRL linking into the original document(s) being validated in a session.)  The particular scheme used for those values may be specific to the use-cases and schema, but the standard elements or attributes need to be first-class parts of the schema language. (Note also that the XPaths themselves are not adequate to express anything useful for humans and systems in the rest of the SDLC or document pipeline.)

The problem of describing what goes in a single instance of a document is, in a sense, trivial (e.g. a schema to allow some objects to be serialized, transmitted, parsed and bound as objects at the other end.)   There is no time, no change, no gotchas, no intent, no systems, no humans.

The problem of how to do software engineering with documents, where your schema needs to fit into a dynamic, evolving, multi-party, non-uniform web, is much more challenging, I think. And violently practical.  I think those challenges demand first-class support in the schema language (and because the future is unpredictable, it may well demand an approach based on mix-ins rather than inheritance: I don't think that is too controversial to suggest...)  I think Schematron shows that it is entirely practical to have systems that handle those challenges pretty well, and consequently entirely reasonable to require that schema languages should support them.  

Regards
Rick
On Tue, Aug 30, 2022 at 9:18 PM Roger L Costello <costello@mitre.org>D wrote:

Hi Folks,

Scenario: You are designing an XML Schema for validating XML instances that contain Book data. Each Book element contains Title, Author, Date of Publication, and ISBN. Some Books have multiple Authors. In your current environment, in your current worldview, no Book has more than 10 Authors. So you constrain the Author element to maxOccurs="10":

<xs:element name="Author" maxOccurs="10" type="…"/>

But what if in the future there are Books with 100 Authors, then XML instances will fail validation. Should you set maxOccurs to unbounded?

No!

Here’s why:

You want to be informed when the world has changed, when the choices you made are no longer relevant. A world in which Books contain 10 times more Authors than you thought they would is a different world. You want to be informed of this. Your XML Schema was originally written with one world view, if validation starts breaking that means you have got to rethink the initial stuff.

There is an analogous situation in programming. Should you constrain the size of an array or make it variable length? Here’s what John Carmack says: (https://youtu.be/I845O57ZSy4?t=4005)

I'm kind of fond in a lot of cases of static array size declarations. I went through this period where we should just make everything variable length because I had this history in the early days where Doom had some fixed limits on it and then everybody started making crazier and crazier things and they kept bumping up the different limits -- this many lines, this many sectors -- and it seemed like a good idea that we should just make it completely generic so it can go up to whatever. There are cases where that's the right thing to do, but the other aspect of the world changing around you is it's good to be informed when the world has changed more than you thought it would. If you've got a continuously growing collection, you're never going to find out. You might have this quadratic slowdown on something where you thought “Oh, I'm only ever going to have a handful of these,” but something changes and there's a new design style and all of a sudden you've got 10,000 of them. So I kind of like in many cases picking a number, some nice round power of two, and setting it up in there and having an assert saying “Hey, if you hit this limit, I need to know.” When that occurs, you should probably think: “Are the choices that I've made around all of this still relevant if somebody's using 10 times more than I thought they would? This code was originally written with this kind of world view, with this kind of set of constraints, and I was thinking of the world in this way.” If something breaks that means I’ve got to rethink the initial stuff.



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS