XPath and a continuous, uniform information space

Hi Folks,

Every once in a long while one comes upon a description that is so masterful and so beautifully written that it takes your breath away. It takes you on a journey, starting simply and moving gently along the path of knowledge. At the end of the journey you sit back in awe of all that you have seen and learned.

That is how I felt when I read the below description by Hans-Juergen Rennau, XPath and a continuous, uniform space of information. The description is the first part of his paper that he presented at the Balisage last week. I highly recommend reading the below description. Even if you already know XPath, it is a worthy read even if only for the sheer beauty of its exposition. Plus, it provides a perspective on information spaces that you may have never considered.

---------------------------------------

The following XML (persons.xml) has person data -- their name and the country they reside in:

<persons>
    <person>
        <name>Michael Kay</name>
        <country>United Kingdom</country>
    </person>
    <person>
        <name>Hans-Juergen Rennau</name>
        <country>Germany</country>
    </person>
    <person>
        <name>Roger Costello</name>
        <country>United States</country>
    </person>
    <person>
        <name>John Smith</name>
        <country>Blah</country>
    </person>
</persons>

Obviously the last <country> value is invalid.

We want to retrieve all <country> elements, no matter where they occur in the document. The following XPath expression does just that:

doc('persons.xml')//country

This XML (countries.xml) is an authoritative dictionary of countries:

<countries>
    <country>Afghanistan</country>
    <country>Albania</country>
    <country>Algeria</country>
    <country>American Samoa</country>

...

</countries>

We want to retrieve all <country> elements in persons.xml that are valid according to countries.xml and ignore those that are not valid. We can accomplish this by adding a simple filter expression:

doc('persons.xml')//country[. = doc('countries.xml')//country]

This XML (foods.xml) has data about various foods, the name of the food and its country of origin:

<foods>
    <food>
        <name>curry</name>
        <country>India</country>
    </food>
    <food>
        <name>taco</name>
        <country>Mexico</country>
    </food>
</foods>

When the country check is to be applied to several files, we only have to replace the first XPath step by an expression that yields several documents:

(doc('persons.xml'), doc('foods.xml'))//country[. = doc('countries.xml')//country]

As there may be many documents to process, we introduce a helper resource (docs.xml), an XML document listing all documents concerned:

<docs>
    <uri>persons.xml</uri>
    <uri>foods.xml</uri>
    ...
</docs>

The list represents each document by a <uri> element containing the document URI, which may be a file name or an HTTP address. In order to apply the country check to all documents referenced, we again adapt the first step of the XPath:

doc('docs.xml')//uri/doc(.)//country[. = doc('countries.xml')//country]

The referenced documents could be on the local file system, on an intranet, or on the internet. The number of documents could be large.

A helper document dedicated to just persons and foods might be limiting. We can create an inventory document (doctree.xml) which describes a whole domain of documents:

<doctree>
    <department name="culinary">
        <project name="people-and-foods">
            <uri application="persons">persons.xml</uri>
            <uri applications="foods">foods.xml</uri>
        </project>
        <project name="equipment">
            <uri application="blenders">blenders.xml</uri>
            <uri application="coffee-makers">coffee-makers.xml</uri>
        </project>
    </department>
    ...
</doctree>

Only some of the uri's might be relevant for any particular application.

The document is a tree-structured list in which the leaf elements are <uri> elements. All the inner nodes have the purpose of adding structure, implicitly creating groups of documents. All elements -- inner nodes and <uri> elements -- may have attributes supplying metadata to describe the document(s) referenced by the element itself or its descendants.

We can combine the selection of the relevant documents and their country check into a single expression:

doc('doctree.xml')//project[@name='people-and-foods']/uri/doc(.)//country[. = doc('countries.xml')//country]

Navigating across multiple resources -- through an inventory, through all resources referenced by <uri> elements which match certain conditions, and finally through the country dictionary -- is achieved by a single expression, without taking actions such as opening files and without shifting information from data sources into intermediate variables.

The XPath expression not only yields the valid countries; it resolves to the nodes containing them, which means we have information and its location. Assigning the nodes to a variable ($countries), we can later resume navigation, using those nodes as starting points.

Suppose each document has a <change-log> element, like so:

<persons>
    <person>
        <name>Michael Kay</name>
        <country>United Kingdom</country>
    </person>
    ...
    <change-log>
        <change>
            <from>Sally Smith</from>
            <to>John Doe</to>
        </change>
    </change-log>
</persons>

We can collect the change logs of all documents:

$countries/ancestor::document-node()//change-log

Note: the above examples filtered the <country> elements to select the valid countries. In other scenarios, such as quality control, we may wish to filter the <country> elements to select the invalid countries. Such a filter is easily obtained by adding a "not":

[not(. = doc('countries.xml')//country)]

Observations
This brief programming experience may give you a feeling of handling information in a different way than when using a general purpose language such as, say, Java. What is the difference?

First, we did not "load" resources -- we just addressed them using an expression, doc(__). That expression is equivalent to other expressions for navigating the information content. This equivalence enables us to compose the addressing expression with navigation expressions:

doc(__)//country

Contrast this to Java code for extracting data from a CSV file which would involve opening a file, scanning through the file, and closing the file -- all very different, disjointed operations.

Second, navigation can be downward (e.g., "__//__") or upward (e.g., "ancestor::__") in a vertically unbounded way, moving from any current location in a single leap down to the leaf nodes under it or up to the very root above it. Compare this to the navigation in an object tree: a downward move can only be achieved recursively, one level per invocation, and upward navigation is impossible.

Third, resource boundaries do not impose a resistance to navigation: the effort to enter a different document is not greater than the effort to move within the same document. The following expression moves down to <uri> elements, crosses over into the referenced documents and continues its path down into that document:

//uri/doc(.)//country

Fourth, navigation is a parallel operation, multiply rooted and cascading: the starting point may be one or many locations, and subsequent steps resume the movement at every location reached by the preceding step. This is very different from navigation of an object tree, which is a movement from single item to single item.

Fifth, navigation is a generic operation: the inventory, the resources to be checked and the country dictionary have different semantics, but they are navigated in the same way. Navigation of an object tree, on the other hand, must adapt each single step to object types.

Summarizing the observations, using XPath one perceives a continuous, uniform space of information: we can enter and leave resources at will, and within them we can move up, down, forward and backward in a uniform and effortless way. The space is a sum total of information which integrates every single accessible XML document. Within this space, every item of information is visible and addressable.

---------------

Wouldn't it be cool to have a continuous, uniform information space that extends beyond the borders of XML, to all text-based data formats (JSON, CSV, email, etc.)? We can! That is one part of Hans' paper.

You can read all of Han's paper here:

http://balisage.net/Proceedings/vol10/html/Rennau01/BalisageVol10-Rennau01.html

/Roger