XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] It is okay for things to break in the future!



On Sun, Sep 4, 2022 at 6:11 PM Roger L Costello <costello@mitre.org> wrote:
 
Roger's Perspective: It is possible to know the current world. Developers can and should model the current world. The benefits of flagging data that violates the model outweighs the benefits of "coding for the future."

I Guess Everyone Else's Perspective: It is not possible to model the world. Even in incredibly simple ways. The costs of breaking the model when the world doesn’t agree with the model outweighs the benefits of flagging invalid data.

I wouldn't put it that way at all.  It's possible to model the world, and we do, all the time.  But we always do so on the basis of insufficient data.  At Lexis-Nexis, the 1-billion-document company, modelers would typically ask for a sample of documents (already very roughly XMLized) from which an XML Schema would be built.  It turned out that even a few hundred documents of a given type (too many to examine individually) was not enough to capture all possible structural features, never mind refinements like maximum length.  So we turned up the knob and started to ask for thousands or tens of thousands of documents and used some simple-minded software (which I wrote) to look at and count features and to determine which features were subordinate to which other features.

Essentially the second perspective is a warning against overfitting <https://en.wikipedia.org/wiki/Overfitting>.  If we see that in our sample (and that's all we ever have, a sample) the longest given name ("first name", though it isn't always first) is 17 characters long, we probably don't want to introduce a constraint saying "maxlength(firstname) = 17".  As the WP article says, "The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented underlying model structure."


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS