Hello devs,
occasionally I
say something along these lines: the sum total of accessible XML
resources is a continuous space of information, which we can navigate
and process in a unified way.
Awareness of
this space may have a significant impact on our decisions whether or not
to use XML, and how to use it; in fact it may have an impact on what
the term "XML" means to us. Common awareness of this space might have a
marked impact on the role of XML within IT, and it might even change
mainstream approaches to various problems.
Alas, there is
no common awareness of this phenomenon of an info space created by XML.
The average reactions to statements about XML creating a space of
information range from scepticism over disparagement to ridicule and
hostility. Obviously, those statements do not match the practical
experience gathered by the majority of people speaking about XML.
Therefore I invite anybody to whom the space is not obvious to
contemplate a few small code examples which may be helpful for
understanding why I speak of a space.
[
If anybody is
interested in trying out the code, this can be done, for example, by
installing BaseX, an excellent XQuery processor with full support for
XQuery 3.0 and the XQuery Update Facility. Download & installation
requires less than a minute of your time. Go here:
download the latest exe, double click and press "Exec".
]
Here we go. To
start with, let me share a little idiom which allows you to process any
number of XML resources simultaneously with the same ease as you would
process a single document. If you create a catalog of XML document
references like this:
<docs>
<doc href=""/>
<doc href=""/>
<doc href=""/>
...
</docs>
and pass
the catalog to your XQuery or XSLT as the context node (option -i in
BaseX, -s: in Saxon), you can access the sum total of referenced
documents as conveniently as you would access a single document. See PS
for a five-line XQuery creating such catalogs automatically, taking as
input a root directory name and a file name pattern.
For our examples I created a catalog of NIEM XSDs:
basex -b dir=niem-2.1 -b files=*.xsd writeCat.xq > cat.xml
Now let us explore and process those 123 XSDs. In general, having started a query with this line:
we can access the forest of referenced XML nodes by letting XPath expressions start with $space, like this:
$space//foo
Ok.
Suppose we want to view the vocabulary established by those schema
documents. For simplicity's sake, we content ourselves with producing a
sorted list of element names, like this:
_Curve
_CurveSegment
_GeometricPrimitive
...
AbstractContinuousCoverage
AbstractCoordinateOperation
AbstractCoordinateSystem
...
XDescriptionText
xmlContent
XValue
YDescriptionText
Year
YearMonth
YValue
This is achieved by the following five-liner:
= = = = = = = = = = =
let $space :=
//@href/doc(resolve-uri(., base-uri(..))) return
string-join(
for $name in distinct-values($space//xs:element/@name)
order by lower-case($name) return $name
, '
')
= = = = = = = = = = =
It can be called like this:
basex -s method=text -i cat.xml getElemNames.xq > names.txt
Now
we want to check data quality. We are interested in enumeration values
whose whitespace content is not normalized, e.g. because of trailing
blanks. We catch those black sheep with a two-liner:
= = = = = = = = = = =
let $space :=
//@href/doc(resolve-uri(., base-uri(..))) return
string-join(distinct-values($space//xs:enumeration/@value[not(. eq normalize-space(.))]), '
')
= = = = = = = = = = =
The
result is two enumeration values with a single trailing blank: 'true '
and 'false '. How to correct our data? BaseX supports the XQuery Update
Facility, so the following three-liner does the job:
= = = = = = = = = = =
let $space :=
//@href/doc(resolve-uri(., base-uri(..))) return
$space//xs:enumeration/@value[not(. eq normalize-space(.))]/
(replace value of node . with normalize-space(.))
= = = = = = = = = = =
Calling
the query, don't forget to set the -u option of BaseX which ensures
in-place update of the documents. This is the right way to do it:
basex -u -i cat.xml correctFaultyEnums.xq
Let's
assume we want to get rid of the annotations, as they make it difficult
to read the structures. You can do it with the following query:
= = = = = = = = = = =
= = = = = = = = = = =
Finally,
let's do a search. The following query yields all enumeration types
which contain an enumeration value matching a supplied regex; the
display yields for each enumeration value a concatenation of value and
annotation. This is the query:
= = = = = = = = = = =
declare variable $pattern external;
let $space :=
//@href/doc(resolve-uri(., base-uri(..))) return
string-join(
$space//xs:simpleType[.//xs:enumeration[matches(., $pattern, 'i')]]!
(@name, .//xs:enumeration/(concat(' ', @value, .//xs:documentation/concat(' # ', .))), '')
, '
')
= = = = = = = = = = =
and this is how it may be called:
basex -s method=text -i cat.xml -b pattern=.*shipment.* getEnums.xq > h.txt
These
examples may suffice to convey an impression of the coding experience
on which the speaking of a "space" is based. The homogeneity of the
resources actually used in the examples (all XSDs) is of course
irrelevant to the experience as such; a homogeneous set of documents was
only chosen in order to enable minimalistic, self-explanatory examples.
Finally,
I invite you to a thought experiment. Imagine an environment using 123
relational databases. Further imagine that the database definitions were
provided by XML files with an adhoc vocabulary allowing for the
straightforward transformation into SQL scripts. It is obvious that the
sum total of database definitions could be monitored and analyzed with
amazing simplicity, using sweeping XPath expressions, in a way similar
to how we processed 123 XSDs. And I think that such monitoring and
evaluation would be much more difficult to achieve if the database
definitions were only available as SQL scripts. What this thought
experiment underlines is the difference between information integrated
into the info space, and information not integrated. Contrary to common
belief, it is not
irrelevant if XML or any other syntax is used, as (currently!) only XML
provides integration into the info space.
Hans-Juergen
PS:
Query
'writeCat.xq' creates an XML catalog referencing all XML documents
found in or under a given root directory and matching a given file name
pattern. This is the query text:
= = = = = = = = = = =
declare variable $dir external;
declare variable $files external;
<docs>{
file:list($dir, true(), $files) ! <doc href="'/', replace(., '\\', '/'))}"/>
}</docs>
= = = = = = = = = = =
Example call:
basex -b dir=niem-2.1 -b files=*.xsd writeCat.xq > cat.xml
Example outout:
<docs>
<doc href=""/>
<doc href=""/>
...
</docs>