3 approaches to structure lists, plus an analysis of each approach

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
From: "Costello, Roger L." <costello@mitre.org>
To: "'xml-dev@lists.xml.org'" <xml-dev@lists.xml.org>
Date: Sat, 14 Feb 2009 17:40:31 -0500

Hi Folks,

What are the different approaches to structure lists? What are the pros and cons of each approach? Is there a way to structure lists to maximize their utility and minimize their overhead?

The purpose of this message is to document and analyze several approaches to structure lists. I use "country list" to illustrate the different approaches.

ASSERTION: LISTS THAT CAN BE USED FOR MULTIPLE PURPOSES ARE GOOD

Lists should be structured in a way that they can be used for multiple purposes. For example, a country list may be:

    - used as values in an XForms pick list.

    - transformed into a document that contains, for each country, 
      sales figures (or death rates, births, political leadership, 
      religions, etc).

    - used to validate an element's content, e.g. The value of the 
      <country-visited> element must be a country.

Those are only a few of the myriad uses of a country list. A well-designed country list should support all of them.


xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
         THREE APPROACHES
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 

Below I show three approaches to structure lists. Other approaches are possible, such as comma-separated values.
 
I illustrate the three approaches using the country list example and then follow with an analysis of each approach.


APPROACH #1: Express lists using the XML Schema vocabulary:

---------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema";
           targetNamespace="http://www.countries.org";
           xmlns="http://www.countries.org";
           elementFormDefault="qualified">

    <xs:element name="countries" type="countriesType" />

    <xs:simpleType name="countriesType">
        <xs:restriction base="xs:string">
            <xs:enumeration value="Afghanistan"/>
            <xs:enumeration value="Albania"/>
            <xs:enumeration value="Algeria"/>
            ...
        </xs:restriction>
    </xs:simpleType>
</xs:schema>
---------------------------------------------


APPROACH #2: Express lists using the RELAX NG vocabulary:

---------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0";
         ns="http://www.countries.org";>

    <define name="countriesElement">
        <element name="countries">
            <ref name="countriesType" />
        </element>
    </define>

    <define name="countriesType">
        <choice>
            <value>Afghanistan</value>
            <value>Albania</value>
            <value>Algeria</value>
            ...
        </choice>
    </define>
</grammar>
---------------------------------------------


APPROACH #3: Express lists using domain-specific vocabularies. The markup comes from terminology used by Subject Matter Experts (SMEs):

---------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<countries xmlns="http://www.countries.org";>

    <country>Afghanistan</country>
    <country>Albania</country>
    <country>Algeria</country>
    ...
</countries>
---------------------------------------------


xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
         ANALYSIS
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


ANALYSIS OF APPROACH #1 AND APPROACH #2

Approach #1 and approach #2 make it easy to use a list for validation purposes. A schema simply imports the list schema and then its values are immediately available for validating element content. 

Here is an XML Schema that imports the country list XML Schema and uses its simpleType as the datatype for the <country-visited> element:

---------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema";
           targetNamespace="http://www.example.org";
           xmlns:c="http://www.countries.org";
           elementFormDefault="qualified">

    <xs:import namespace="http://www.countries.org";
               schemaLocation="countries.xsd" />

    <xs:element name="country-visited" type="c:countriesType" />

</xs:schema>
---------------------------------------------

Here is a RELAX NG schema that includes the country list RELAX NG schema and uses its define element as the datatype for the <country-visited> element:

---------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0";
         ns="http://www.example.org";>

    <include href="countries.rng"/>

    <start>
        <element name="country-visited">
           <ref name="countriesType" />
        </element>
    </start>

</grammar>
---------------------------------------------

If the schema doing the importing is an XML Schema then it can't use the list if it's expressed using RELAX NG. And vice versa.

Although these two approaches enable the efficient usage of lists for validation, it's not clear that they are the most efficient format for the myriad other ways that a list may be used (rendering in a pick list, merging with other lists, searching, and so forth). This is discussed further in the below analysis of approach #3.


ANALYSIS OF APPROACH #3

Recall that approach #3 uses domain-specific terminology. This can be helpful to Subject Matter Experts (SMEs) as they maintain the lists.

Validation can be accomplished using a Schematron schema. Here is a Schematron schema which validates that the content of the <country-visited> element matches one of the values in the country list:

---------------------------------------------
<?xml version="1.0"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron";>
   <sch:ns uri="http://www.countries.org";
           prefix="c" />

   <sch:pattern name="Country List Check">

      <sch:rule context="country-visited">

         <sch:assert test=". = document('countries.xml')//c:country">
             The value of country-visited must be one of the
             countries in the countries' list.
         </sch:assert>

      </sch:rule>

   </sch:pattern>

</sch:schema>
---------------------------------------------

With approach #3 the markup used to construct the list has semantics specific to the list:

{http://www.countries.org}countries
{http://www.countries.org}country

This makes possible the creation of programs that are readily understood, as they use terminology consistent with the domain. For example, this XSLT program uses the country list to generate an HTML list of all countries:

---------------------------------------------
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                xmlns:c="http://www.countries.org";
                version="2.0">
 
    <xsl:output method="html"/>

    <xsl:template match="c:countries">

        <html>
            <head>
                <title>Countries of the World</title>
            </head>
            <body>
                <ol>
                    <xsl:apply-templates />
                </ol>
            </body>
        </html>

    </xsl:template>

    <xsl:template match="c:country">

        <li>
            <xsl:value-of select="." />
        </li>

    </xsl:template>

</xsl:stylesheet>
---------------------------------------------

Note the template match values. They match on:

{http://www.countries.org}countries
{http://www.countries.org}country
 

Conversely, with approach #1 and approach #2 the markup used to construct the list has semantics that are specific to the schema language:

{http://www.w3.org/2001/XMLSchema}element
{http://www.w3.org/2001/XMLSchema}simpleType
{http://www.w3.org/2001/XMLSchema}restriction
{http://www.w3.org/2001/XMLSchema}enumeration

{http://relaxng.org/ns/structure/1.0}define
{http://relaxng.org/ns/structure/1.0}choice
{http://relaxng.org/ns/structure/1.0}value

Consequently programs must operate using schema terminology rather than domain terminology. For example, this XSLT program generates an HTML list of all countries from the countries list specified by the XML Schema document:

---------------------------------------------
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                xmlns:xs="http://www.w3.org/2001/XMLSchema";
                version="2.0">
 
    <xsl:output method="html"/>

    <xsl:template match="xs:simpleType">

        <html>
            <head>
                <title>Countries of the World</title>
            </head>
            <body>
                <ol>
                    <xsl:apply-templates />
                </ol>
            </body>
        </html>

    </xsl:template>

    <xsl:template match="xs:enumeration">

        <li>
            <xsl:value-of select="@value" />
        </li>

    </xsl:template>

</xsl:stylesheet>
---------------------------------------------

Note the template match values. Rather than the XSLT program operating on <countries> and <country> elements, it operates on <schema>, <simpleType>, <restriction>, and <enumeration> elements. This makes programming challenging and error-prone.

With approach #3 a list can be used as a building block (data component) which can be immediately dropped into other documents to create compound documents. For example, consider a list of religions, also formatted using approach #3:

---------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<religions xmlns="http://www.religions.org";>

    <religion>Baha'i</religion>
    <religion>Buddhism</religion>
    <religion>Catholicism</religion>
    ...

</religions>
---------------------------------------------

It is easy to construct a compound document comprised of the country and religion lists:

---------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<religions-per-country>
    <countries xmlns="http://www.countries.org";>
        <country>Afghanistan</country>
        <country>Albania</country>
        <country>Algeria</country>
        ...
    </countries>
    <religions xmlns="http://www.religions.org";>
        <religion>Baha'i</religion>
        <religion>Buddhism</religion>
        <religion>Catholicism</religion>
        ...
    </religions>
    <!-- markup that maps religions to countries -->
</religions-per-country>
---------------------------------------------

Due to the modularity provided by approach #3, it is possible to perform list-specific processing on this compound document. That is, a country-list-aware application would be able to extract the country list from this compound document and process it. Ditto for a religion-list-aware application.

With approach #1 and approach #2 the XML vocabulary used to construct the list is the same regardless of the list. Here is the <religions-per-country> document using lists that are defined using the XML Schemas vocabulary: 

---------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<religions-per-country>
    <xs:simpleType xmlns:xs="http://www.w3.org/2001/XMLSchema";
                   name="countriesType">
        <xs:restriction base="xs:string">
            <xs:enumeration value="Afghanistan"/>
            <xs:enumeration value="Albania"/>
            <xs:enumeration value="Algeria"/>
            ...
        </xs:restriction>
    </xs:simpleType>
    <xs:simpleType xmlns:xs="http://www.w3.org/2001/XMLSchema";
                   name="religionsType">
        <xs:restriction base="xs:string">
            <xs:enumeration value="Baha'i"/>
            <xs:enumeration value="Buddhism"/>
            <xs:enumeration value="Catholicism"/>
            ...
        </xs:restriction>
    </xs:simpleType>
    <!-- markup that maps religions to countries -->
</religions-per-country>
---------------------------------------------

The namespace used by the country list cannot be distinguished from the namespace used by the religion list. Thus, the benefits namespaces provide in terms of modularity are negated. It is not easy to create country-list-aware applications or religion-list-aware applications.
 
Approach #3 has minimal markup overhead.


ANALYSIS OF ALL APPROACHES

Regardless of which approach is used, the meaning of the list and its values must be clearly documented. It may be challenging to achieve consensus on meaning:

- The same terminology may be used by different people to mean the same thing. For example, one person expects to see Puerto Rico in a country list, whereas another person does not. This is because one person defines "country" only as principal sovereignties whereas another person defines "country" to include territories and protectorates. 

- Further, some people use different terminology to mean the same thing. For example, one person calls it "country" another calls it "principality."

Thus, with all approaches the issue arises of which terminology and definitions to adopt.


OTHER FACTORS?

Above is my initial stab at analyzing the three approaches. Are three other factors of each approach that I have not considered? 

/Roger
Follow-Ups:
- RE: 4 approaches to structure lists, plus an analysis of eachapproach
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: [xml-dev] 3 approaches to structure lists, plus an analysis of each approach
  - From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
- RE: [xml-dev] 3 approaches to structure lists, plus an analysis of each approach
  - From: "Michael Kay" <mike@saxonica.com>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]