Re: [xml-dev] Flatter is Better (part two)

When Codd developed his ideas about relational data-banks, it was to allow the development of 'a universal retrieval language based on the second order predicate calculus.'

To me, there are strong reasons for endorsing each kind of data model but also weak or bad ones: Codd's ideal of predicate calculus was really strong reason, Goldfarb's hope of tree grammars was also a strong reason. Determining what category of data model you have allows efficient implementations and possibilities to be explored. (Is your data just facts, or is your data just annotated regions? When it is both, for the same process, we have problems.)

But the more you drill down from that top-level decision, the less that absolute recommendations can be made, it seems to me. Being able to cut and paste a single arbitrary element, bundling all you need, is a nice idea; but the processing systems need to handle that kind of mix and match. So the question becomes, how can i make data models that may simplfy mix-and-match data?

One way I have seen this supported is by first defining the basic categories of information you have: into the level above data types and level below conventional semantics: for example, instead of a date element, you have a generalized "event" element that allows the bundling of dates and other properties. (I think the ETL people have good approaches for this too.)

In the example, the fat design would have most elements replaced by eg <area kind='state'> , while in the flat example you might have eg <place kind='house'>. You could combine them: elements drill down to the region, which places reference:

This design is 'fat', but it avoids having to drill down through area information if you just need to search by year-by or other non-area properties. And identity matching is easy. So it may have different efficiency characteristics than either Roger's hierarchical or flat models.

On 02/12/2014 9:31 PM, "Costello, Roger L." <costello@mitre.org> wrote:

Hi Folks,

The flat design is about creating XML documents that consist of a long series of standalone components:

A component in the document can be combined with other data (mashup):

Let’s take a concrete example to compare the flat design versus the fat design.

Here is a flat design:

<Iowa>
    <house>
        <street>1009 Arlington Court</street>
        <city>Davenport</city>
        <style>Ranch</style>
        <porch>open</porch>
        <year-built>1951</year-built>
        <square-feet>1700</square-feet>
    </house>
    <house>
        <street>1008 Arlington Court</street>
        <city>Davenport</city>
        <style>Ranch</style>
        <porch>closed</porch>
        <year-built>1955</year-built>
        <square-feet>1850</square-feet>
    </house>
    ...
</Iowa>

The document consists of a long series of standalone <house> components. Any of those <house> components could be mashed-up with other data, e.g., mashup a <house> component with a <GPS> component.

Here is a fat design:

<Iowa>
    <city name="Davenport">
        <street name="Arlington Court">
            <house>

               <street-number>1009</style>
                <style>Ranch</style>
                <porch>open</porch>
                <year-built>1951</year-built>
                <square-feet>1700</square-feet>
            </house>
            <house>

                <street-number>1008</style>
                <style>Ranch</style>
                <porch>closed</porch>
                <year-built>1955</year-built>
                <square-feet>1850</square-feet>
            </house>
        </street>
       ...
    </city>
    <city name="Cedar Rapids"> ... </city>
    ...
</Iowa>

The flat design and the fat design are radically different!

In the fat design the houses have been grouped into streets and the streets have been grouped into cities. The street name data has been removed from each <house> and also the city name data has been removed from each <house>. Consequently, each <house> is no longer a standalone component. House data is now fragmented, scattered over the document. The ability to do mashups has been lost (or, at least, greatly hampered). The fat design has normalized the data and, as I argued in my last message: Normalization is horrible for data exchange formats.

It’s best to exchange the data in the flat design. Consumers can transform it into the fat design, if needed.

Recommendation: When designing a data exchange format create a flat design.

Comments?

/Roger