Re: [xml-dev] It is okay for things to break in the future!

On Sun, Sep 4, 2022 at 6:11 PM Roger L Costello <costello@mitre.org> wrote:

Roger's Perspective: It is possible to know the current world. Developers can and should model the current world. The benefits of flagging data that violates the model outweighs the benefits of "coding for the future."

I Guess Everyone Else's Perspective: It is not possible to model the world. Even in incredibly simple ways. The costs of breaking the model when the world doesn’t agree with the model outweighs the benefits of flagging invalid data.

I wouldn't put it that way at all. It's possible to model the world, and we do, all the time. But we always do so on the basis of insufficient data. At Lexis-Nexis, the 1-billion-document company, modelers would typically ask for a sample of documents (already very roughly XMLized) from which an XML Schema would be built. It turned out that even a few hundred documents of a given type (too many to examine individually) was not enough to capture all possible structural features, never mind refinements like maximum length. So we turned up the knob and started to ask for thousands or tens of thousands of documents and used some simple-minded software (which I wrote) to look at and count features and to determine which features were subordinate to which other features.

Essentially the second perspective is a warning against overfitting <https://en.wikipedia.org/wiki/Overfitting>. If we see that in our sample (and that's all we ever have, a sample) the longest given name ("first name", though it isn't always first) is 17 characters long, we probably don't want to introduce a constraint saying "maxlength(firstname) = 17". As the WP article says, "The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented underlying model structure."