Re: [xml-dev] The impact of data format selection on application develop

On Sun, Jul 10, 2022 at 10:29 AM Roger L Costello <costello@mitre.org> wrote:

Ihe gave this as an example of a data format that is simple and is XML:

<thoughtProvoker name="Roger Costello"/>

I think a good measure of the simplicity of a data format is how much prose (ink) is needed to explain the format. Clearly not a lot of ink would be needed to explain the above format to an XML person. But if the person knows nothing about XML, then large amounts of ink would be needed; likely a substantial amount of the XML specification.

Contrast with this data format:

Jon Bentley Avaya
Brian Kernighan Princeton University
Paul Hudak Yale University

Explaining that data format is simple:

Well for starters it omits a material piece of information conveyed by the XML format as to whether they are thought provokers or not.

The data format consists of lines. Each line contains two fields separated by a tab symbol. The first field is the name of a person. The second field is the person’s employer.

Says who. How do I know from looking at that that Kernighan and Hudak aren't students or that the 2nd field isn't just a location. Suppose I don't know where they work, do I have 1 field or 2 and will there be a trailing tab in that scenario?

Ihe also wrote:

> Simplicity is good but if the data format (and/or its supporting ecosystem)
> is too simple for the application the work simply shifts to the application
> where it will probably be duplicated and subjected to multiple approaches.

What does this mean: "the data format is too simple for the application to work"?

No. It means that because complexity is a feature of the domain, rather than simplifying anything, applying a "too simple" data format to the domain is going to create an order of magnitude more work elsewhere (usually for other groups) .

Almost any non-trivial domain entails subtyping - e.g an insurance policy would have subtypes of home and auto and within that personal and commercial etc.

Suppose I decide to attack that with JSO..... oops.... a format whose ecosystem has no support for subtypes so to get subtype semantics they will have to be implemented in the application so that work doesn't go away, it just shifts from the data model to the application programs and from one group to another. True there will be those that will argue that's where it should be.

As for an ecosystem for a data format, the neat thing about little languages is that it's easy to build little tools to support them. That is, it's easy to quickly develop an ecosystem.

Lastly, Ihe wrote:

> Simplicity is good but it can also be a one way bet at the outset from which:
> there is no upgrade path. So if you subsequently decide that there are
> structures latent in data you once thought was suitable to be treated as
> text you are going to have a hard time shifting a text based ecosystem
> to exploit that.

I think this is talking about enhancing a data format with additional kinds of data. For example, in the data format above, add the person's age:

Jon Bentley Avaya 61
Brian Kernighan Princeton University 80
Paul Hudak Yale University 62

From my (admittedly brief) experience with little tools such as AWK, such an enhancement would be seamlessly handled.

What if Bentley retires or leaves Avaya - is there a tabbed placeholder for him in which case you know have the complications of 3-valued logic, if not the cardinality of fields in each row becomes variable.