[
Lists Home |
Date Index |
Thread Index
]
I don't normally forward newsletters to xml-dev, but this one has a very
interesting report on Web Services and questions about things like
binary representations of XML infosets. The XSLT 2.0 piece that follows
may also be interesting to people with no interest in the Don Box story.
Thanks to Kurt Cagle, both for writing this and for saying it was okay
to redistribute it.
-----------------------------------------------------
****************************
Kurt Cagle's
Metaphorical Web
****************************
Wednesday, October 16, 2002
http://www.kurtcagle.net
kurt@kurtcagle.net
****************************
==========================================
Out of the Box
Don Box and Microsoft's XML Architecture
==========================================
I had the pleasure last night of listening to Don Box, one of the
principal architects of SOAP and, as of January of this year, the
Program Director for Microsoft's XML Architecture Group. A tall,
energetic man with a salt and pepper beard and owlish glasses, Don held
the audience of developers at the Seattle Dot Net Users Group rap with
his discussion of what is usually a deadly-dull topic -- technical
standards.
Don is in charge of the group within Microsoft that deals with the pipes
and plumbing of Microsoft's .NET/web services strategies. He is, in
essence, sitting at the very epicenter of the most profound changes that
have taken place within the company since the heady days of the browser
wars in 1995 through 1997. The roadmap that he is laying out now will
likely end up shaping application development at the software giant
easily for the next five to ten years.
The strategy that Don laid out last night was, to say the least,
audacious: push through standards that will rebuild the Internet from
the ground (or perhaps more accurately, the sockets) on up, replacing
not just the http layer but potentially the tcp-ip infrastructure. In
its place would be a more stateful web, utilizing variable length SOAP
messages that would be more conducive to web service architectures than
the current unreliable, packet based system.
To get an idea of what that will likely entail (and why it may have such
a high payoff for Microsoft ... if they don't fail), its necessary to
understand how sockets currently work. In the early 1980s the Berkeley
Socket Architecture was built in order to make it possible to stream
content between two computers using a certain message format called the
Transport Control Protocol, or TCP, with the packets being limited to
containing only up to a limited number of bytes. The IP protocol
overlays the TCP layer and controls the reintegration of messages. Most
operating systems have integrated the Berkeley Socket architecture and
have built networks using TCP/IP, to the extent that the older
Banyan/Novell IPX architecture is becoming an anachronism.
The WS-Routing specification, in an effort spearheaded by Microsoft and
IBM, would break packets along SOAP boundaries rather than at preset
lengths, an as such would allow for the efficient transmission of
complete SOAP commands, though it would rely upon TCP/IP packets and
even HTTP for the transmission of non-XML attachments such as images,
sounds or multimedia. To do this effectively, it would mean that every
single operating system would have to adopt the WS-Routing architecture
or be shut out of the process; the danger here is that you would end up
for a while with a two tier Internet where much of the world is not on
WS-Routing, with the very real consequence that TCP/IP-HTTP solutions
would need to be built to bridge, actually decreasing the efficiency of
the networks over the few years that it would take for such a changeover
to occur. It also assumes a willingness to modify or even replace
billions of lines of code that have been built to utilize the TCP/IP
architecture in order to go to this supposed next stage.
Don talked about a number of the other standards that Microsoft is
currently trying to develop, either through their own auspices or in
conjunction with IBM, Ariba, and others. These include distributed
agreement protocols (WS-Coordination and WS-Transaction) for performing
stateful transactions, federated oriented security (which includes an
alphabet soup of protocols), and ubiquitous metadata for handling policy
data. In some cases (such as with security) these efforts are being
coordinated with OASIS, and in others they are being proposed through
the WSIA, a standards body that Microsoft co-founded. Significantly
Microsoft is working only grudgingly with the W3C for the base web
services specifications of SOAP and WSDL -- ironically the two standards
that seem to be the most solid and widely adopted. Whether or not that
is an anomaly or a central datapoint may ultimately determine the fate
of Microsoft's .NET efforts.
One other facet that Don discussed that I think may point to some
significant innovation is his discussion about the XML "stack". XML
actually refers to three different concepts. The first, the one that
most people are familiar with, is the syntactical expression of "frozen"
XML, the angle bracket tag and attribute syntax that most people who
work with XML are familiar with. Above this is the conceptual
underpinnings of XML, the XML Infoset, which basically is the
abstraction of a named tree structure with multiple types of nodes. This
infoset really doesn't care about the syntactical representation of XML
-- it is instead a document object model as represented internally any
number of different but congruent ways between systems (i.e., the way
that Java and .NET represent XML in memory are almost certain to be
different, but they are equivalent in terms of the abstract model, the
infoset).
The third form he brought up (the Post Schema Validation Infoset) is an
infoset representation of XML, but with each item having a specific
schema association with it. The idea here is an important one, perhaps
even crucial in the realm of programmatic interfaces, though I think
there is a danger here in thinking that simply because you have an
abstract model with intrinsic type associations, that this is equivalent
to an object that can readily be passed between systems. Don brought up
a goal that has occasionally been floated of having a compact, binary
version of XML for intersystem communication, in part because the cost
of parsing on the one hand and extracting on the other add considerably
to the total cost of transactions.
However, the same arguments that applied three years ago when this
argument first arose come out now -- within a homogenous environment,
passing binary objects is generally not a problem, and passing an
inforset that has been rendered as a DOM is far more efficient than the
parse/deparse mechanism that currently existing for passing XML. The
problem is that the internal binary representation of that infoset IS
extremely dependent upon the architecture of the host system, and that
fact will likely not change any time soon.
On the other hand, it is possible that a binary to binary translation
layer might actually prove to be an easier sell than the older COM/CORBA
bridge interfaces that (almost) facilitated intersystem communication.
With the establishment of a consistent DOM through the W3C, being able
to work with a schema-aware infoset between systems has at least a
chance to work, providing that there is some effort made to insure that
the bridges are kept open on both ends.
There was a lot more from the talk that I will try to cover in greater
detail in subsequent columns. I don't completely agree with every aspect
of what I'm seeing Microsoft do, I can see valid reasons for most of it.
Perhaps as a caution, its worth noting that there are standards bodies
and then there are standards bodies. The fact that much of the
application level protocols are running through OASIS is ultimately a
good thing, because with an effort as Herculean as this, the more hands
you can get to push the boulder up the hill, the more likely you'll
reach the top.
============================================
Code: Creating named regexes with XSLT2
============================================
Here's some more exploration with some of the features in XSLT2 and
XPath2, specifically the Regular Expressions capabilities. For those of
you who are not familiar with them, regular expressions (or regexes for
short) use a set of predefined patterns and special characters to
attempt to match a whole class of potential strings. They have two
principle purposes: validating that a given string does in fact fit a
specific profile and transforming one string into another based upon
general pattern matching, rather than specific character matches. For
instance, consider phone numbers. Most American phone numbers follow a
very distinct sequence: three digits giving the area code (or the toll
free code, in some cases), three digits indicating the exchange, and
then four digits containing the local code within that exchange. These
are critical.
The problem is that there are also a number of different ways of
grouping these numbers, and when someone enters such a number into a
form, for instance, it would be nice if you could determine whether the
phone number is valid in the permutation provided. For instance, for the
phone number with area code 800, exchange 555 and local number 1212, the
following are all valid:
800.555.1212
800-555-1212
(800)-555-1212
(800)555-1212
(800)555.1212
while
800.5554.1212
is not because the exchange has four digits instead of three.
XPath2 provides a number of string manipulation functions that accept
regular expressions as arguments, but the two that I wanted to
concentrate on are the matches() function and the replace() function.
The matches() function takes the string to test and the regular
expression to test against, and returns a Boolean value of true() if the
expression matches and false() if it does not. The regular expression
for validating phone numbers can be pretty ugly, but here is at least
one stab at it:
^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$ (1)
without going into a lot of detail, this basically says:
^ Match from the start of the string
\(? Accept an optional opening parenthesis
(\d{3}) Find a sequence of three digits (\d) and remember them
\)? Accept an optional closing parenthesis
\s?\-?\.?\s?Accept white space, a dash, a period, and maybe more white
space
(\d{3}) Remember the next sequence of three digits
\-?\.? Accept an optional dash or space
(\d{4}) Remember the final sequence of four digits
$ The string must terminate at this point
The matches() function would take a string (such as a phone number) and
evaluate against the above regular expression, as follows:
matches('(800)555-1212','^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$
')
This would return the Boolean true() because the pattern in regex #1 is
satisfied.
Similarly, you can use the replace function to perform a substitution of
a new string for an old string within a third string. The replace
function uses the Perl notation of back references -- if an expression
in the regex is contained within parentheses, it is remembered in the
order that it was encountered. The back references provide a way to
retrieve these remembered expressions. For instance, in
replace('(800)555-1212','^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$
','$1.$2.$3')
the first expression to be matched (the area code) is assigned to back
reference $1, the second (the exchange) to back reference $2, and the
the third (the local code) to back reference $3. This in turn will
provide the output:
800.555.1212
Now, I don't know about you, but
'^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$' doesn't exactly stand
up and scream "phone number" to me. This tends to be the case with many
regexes - they can be puzzled out with a lot of work, but in general
they are far from being intuitive. Consequently, I got to thinking about
how I could build a general library of regexes, each of which I could
then refer to by name. As it turns out there are two very different
approaches that you can take, each with its own advantages and
disadvantages.
The first approach places the regexes into an XML file, with each regex
being referenceable by name. For instance, the following illustrates
just such a regular expression library (regexLib1.xml):
<regularExpressions>
<regularExpression id="phone">
<pattern>^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$</pattern>
<replace>($1)$2-$3</replace>
</regularExpression>
<regularExpression id="zipcode">
<pattern>^(\d{5})(-\d{4})?$</pattern>
<replace>$1$2</replace>
</regularExpression>
</regularExpressions>
This document establishes two regular expressions - one for phones, one
for zipcodes - along with the standard replacement forms for encoding
these.
With this approach, I can define a set of two XSLT functions in their
own namespace (re:) called re:isValid() and re:format(). The
re:isValid() function takes the string to be validated and tests it
against the regular expression named in the second argument. For
instance,
re:isValid('800.555.1212','phone','') => true()
will return the Boolean value true() indicating that it is a valid phone
number. The third argument is either a local or absolute URL to a
library of regular expressions, and should usually be set to the empty
string '' to use the default regexLib.xml file.
Meanwhile, the re:format() function takes a valid (but not necessarily
conformant) input string and converts it into the standard form given by
the <replace> element:
re:format('800.555.1212','phone','') => '(800)555-1212'
Here is a preliminary regexes.xsl library file, showing how these
functions are implemented.
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:re="http://www.solvex.com/schemas/regex"
exclude-result-prefixes="re"
>
<xsl:output method="xml" media-type="text/xml" indent="yes"/>
<xsl:variable name="regexes" select="document('regexLib.xml')"/>
<xsl:function name="re:isValid">
<xsl:param name="str"/>
<xsl:param name="formatType"/>
<xsl:param name="regexLibFile"/>
<xsl:variable name="regexLib" select="if ($regexLibFile) then
document($regexLibFile) else $regexes"/>
<xsl:variable name="re"
select="$regexLib//regularExpression[@id=$formatType]"/>
<xsl:variable name="pattern" select="$re/pattern"/>
<xsl:variable name="target" select="$re/replace"/>
<xsl:result select="matches($str,$pattern)"/>
</xsl:function>
<xsl:function name="re:format">
<xsl:param name="str"/>
<xsl:param name="formatType"/>
<xsl:param name="regexLibFile"/>
<xsl:variable name="regexLib" select="if ($regexLibFile) then
document($regexLibFile) else $regexes"/>
<xsl:variable name="re"
select="$regexLib//regularExpression[@id=$formatType]"/>
<xsl:variable name="pattern" select="$re/pattern"/>
<xsl:variable name="target" select="$re/replace"/>
<xsl:result select="if (matches($str,$pattern)) then
replace($str,$pattern,$target) else ''"/>
</xsl:function>
</xsl:stylesheet>
Finally, I wanted to include an xsl file that imported these routines
and used them in something approaching a real world basis
(regexesTest.xsl):
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:re="http://www.solvex.com/schemas/regex"
exclude-result-prefixes="re"
>
<xsl:import href="regexes.xsl"/>
<xsl:template match="/">
<xsl:variable name="phoneNum1" select="'800.555.1212'"/>
<xsl:variable name="phoneNum2" select="'800-5554-1212'"/>
<xsl:variable name="zipCode" select="'45221'"/>
<html>
<body>
<h1>re:isValid</h1>
<p>The phone number <xsl:value-of select="$phoneNum1"/>
is <xsl:value-of select="if (re:isValid($phoneNum1,'phone','')) then
'valid.' else 'invalid'"/></p>
<p>The phone number <xsl:value-of select="$phoneNum2"/>
is <xsl:value-of select="if (re:isValid($phoneNum2,'phone','')) then
'valid.' else 'invalid'"/></p>
<p>The zipcode <xsl:value-of select="$zipCode"/> is
<xsl:value-of select="if (re:isValid($zipCode,'zipcode','')) then
'valid.' else 'invalid'"/></p>
<h1>re:format</h1>
<p>The properly formatted form of <xsl:value-of
select="$phoneNum1"/> is <xsl:value-of
select="re:format($phoneNum1,'phone','')"/>.</p>
<p>The properly formatted form of <xsl:value-of
select="$zipCode"/> is <xsl:value-of
select="re:format($zipCode,'zipcode','')"/>.</p>
<p>Here is an example of an alternate regex library
implementation for <xsl:value-of select="$phoneNum1"/>, returning
<xsl:value-of disable-output-escaping="yes"
select="re:format($phoneNum1,'phone','regexLibAlt.xml')"/></p>
<h1>re:phone</h1>
<p>You could also use the re:phone() function directly,
returning <xsl:value-of select="re:phone($phoneNum1)"/></p>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
The line
<p>Here is an example of an alternate regex library ...</p>
uses an alternate library for performing regexes, regexLibAlt.xml. The
new library itself is significant because it illustrates a way that you
can actually generate XML code using the re:format() function
(regexLibAlt.xml):
<regularExpressions>
<regularExpression id="phone">
<pattern>^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$</pattern>
<replace><![CDATA[
<phone>
<areacode>$1</areacode>
<exchange>$2</exchange>
<localcode>$3</localcode>
</phone>]]></replace>
</regularExpression>
<regularExpression id="zipcode">
<pattern>^(\d{5})(-\d{4})?$</pattern>
<replace>$1$2</replace>
</regularExpression>
</regularExpressions>
Here, I've created a CDATA section that contains the mappings into the
XML code:
<replace><![CDATA[
<phone>
<areacode>$1<\/areacode>
<exchange>$2<\/exchange>
<localcode>$3<\/localcode>
<\/phone>]]></replace>
The $1,$2,$3 work as they did in the previous example. Normally, when
returned through the <xsl:value-of/> statement, the tagged code is
"escaped", with "<" and ">" characters converted into the < and >
sequences. However, if you set the disable-output-escaping attribute of
the <xsl:value-of/> element to "yes", this escaping is disabled, and you
generate pure XML code that you can then pass directly into a variable.
Thus, you could use regexes in this manner to build rich XML on the fly.
The alternative approach would be to create an XSLT named function for
each regex and define the code inline:
<xsl:function name="re:phone">
<xsl:param name="str"/>
<xsl:variable name="re"
select="'^\(?(\d{3})\)?\s?\-?\.?\s?(\d{3})\-?\.?(\d{4})$'"/>
<xsl:variable name="replaceStr" select="'($1)$2-$3'"/>
<xsl:result select="if (matches($str,$re)) then
replace($str,$re,$replaceStr) else ''"/>
</xsl:function>
This would then be called as
re:phone('888.555.1212') => '(888)555-1212'
re:phone('888.5554.1212') => ''
Because XPath treats an empty string as being synonymous to the false()
function, you can use this in an if() statement to handle both valid and
invalid input:
<xsl:variable name="phoneNum" select="re:phone('888.555.1212')"/>
The phone number is <xsl:if ($phoneNum) then $phoneNum else 'not
properly built.'"/>
Just as a side note, if you are not familiar with how to run these
examples, you need to use the Saxon7.2 parser, available from Source
Forge at http://saxon.sourceforge.net. Extract the saxon7.jar file into
a working directory in your classpath, then you can invoke these
routines from the Windows or Unix command line as
currentDir>java -jar saxon7.jar stub.xml regexesText.xsl
or
currentDir>java -jar -o outputDoc.htm saxon7.jar stub.xml
regexesText.xsl
if you wanted to direct the output to the file outputDoc.htm.
============================================
Pass the Word
============================================
I'm heartened and gratified by the number of people who have joined the
list (60 and counting in two days). I have directed my current domain
http://www.kurtcagle.net so that it now points to the Yahoo site, so you
can see source code samples and archived columns for this work. I have
had a couple of questions as to why I'm using Yahoo groups to do this.
At the moment, its a matter of expediancy. My own server is sitting in a
storage locker in Portland Oregon while looking for a job, and until I
land somewhere (and I am available, email me at kurt@kurtcagle.net for
details) it's just easier to use existing tools. Once relocated, I'll
probably move this newsletter on to its own server, if nothing else than
to escape the annoying advertising (and replace it with my own annoying
advertising).
I'm doing this newsgroup as a free service. Please, if you like it, pass
on the link (http://www.kurtcagle.net) to anyone that you know who might
want to keep up with what's going on in my own little corner of the XML
world.
Until next time ...
Kurt Cagle
**********************************************
Copyright 2002 Cagle Communications
All Rights Reserved
**********************************************
-------------
Simon St.Laurent - SSL is my TLA
http://simonstl.com may be my URI
http://monasticxml.org may be my ascetic URI
urn:oid:1.3.6.1.4.1.6320 is another possibility altogether
|