Re: [xml-dev] It is okay for things to break in the future!

I tend to agree with John.

We need to factor in the likely SDLC for our schemas and modeling.

In particular, if you are a multinational company or a company that will likely merge with other companies or buy in data, you can be pretty sure that the structure you derive from your document analysis will be revealed as incorrect/inappropriate/bogus as soon as it is faced with these new kinds of data.

Therefore you need to be much clearer in what is vocabulary (e.g. element names), semantically necessary structure (e.g. that table cell only occurs in a row and that in a table) and document constraints (e.g. that a section starts with one title, or that an address has one ZIP code at the end).

Failure to model these separately (e.g. by an open and loose base schema for the first two, and derived schemas or Schematrons for the last) causes extra work for later integration: indeed, sometimes this is compounded by management, embarrassed that the new documents could not be shoe-horned into the standard schemas that so much effort had been spent on to model, blaming the poor suckers who have to do this shoe-horning, rather than attributing it to a kind of failure in awareness of the SDLC.

Regards

Rick

On Sat, Jan 28, 2023 at 12:12 PM John Cowan <johnwcowan@gmail.com> wrote:

On Sun, Sep 4, 2022 at 6:11 PM Roger L Costello <costello@mitre.org> wrote:

Roger's Perspective: It is possible to know the current world. Developers can and should model the current world. The benefits of flagging data that violates the model outweighs the benefits of "coding for the future."

I Guess Everyone Else's Perspective: It is not possible to model the world. Even in incredibly simple ways. The costs of breaking the model when the world doesn’t agree with the model outweighs the benefits of flagging invalid data.

I wouldn't put it that way at all. It's possible to model the world, and we do, all the time. But we always do so on the basis of insufficient data. At Lexis-Nexis, the 1-billion-document company, modelers would typically ask for a sample of documents (already very roughly XMLized) from which an XML Schema would be built. It turned out that even a few hundred documents of a given type (too many to examine individually) was not enough to capture all possible structural features, never mind refinements like maximum length. So we turned up the knob and started to ask for thousands or tens of thousands of documents and used some simple-minded software (which I wrote) to look at and count features and to determine which features were subordinate to which other features.

Essentially the second perspective is a warning against overfitting <https://en.wikipedia.org/wiki/Overfitting>. If we see that in our sample (and that's all we ever have, a sample) the longest given name ("first name", though it isn't always first) is 17 characters long, we probably don't want to introduce a constraint saying "maxlength(firstname) = 17". As the WP article says, "The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented underlying model structure."