xml-dev - how HTTP URIs and URI references work (or don't)

how HTTP URIs and URI references work (or don't)
[ Lists Home | Date Index | Thread Index ]
To: xml-dev@lists.xml.org
Subject: how HTTP URIs and URI references work (or don't)
From: "Simon St.Laurent" <simonstl@simonstl.com>
Date: Thu, 10 Oct 2002 12:46:32 -0400
The use of HTTP URIs in a number of contexts is important to XML work in
general, and the nature of HTTP URIs is important to particular aspects
of XML processing, notably namespaces and RDDL, so it seems worth
exploring how these things actually work.

RFC 2616[1] defines the HTTP 1.1 protocol and also the http scheme for
URLs:

>3.2.2 http URL
>
>The "http" scheme is used to locate network resources via the HTTP
>protocol. This section defines the scheme-specific syntax and
>semantics for http URLs.
>
>http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]
>
> If the port is empty or not given, port 80 is assumed. The semantics
>are that the identified resource is located at the server listening
>for TCP connections on that port of that host, and the Request-URI
>for the resource is abs_path (section 5.1.2)....

Although this defines the scheme in the now-unfashionable (so 1999)
terminology of URLs, it both conforms to common expectations about what
something that starts "http://"; is for and defines what a resource does.
A resource identified by a URI using the http scheme is not merely
something that is (or isn't); instead, it is something "listening for
TCP connections..."

This notion of resource as listener makes it very easy to discuss HTTP
resources in the abstract, without concern for what the listener might
say in response.  http://www.cnn.com is the web site for CNN, whatever
the news of the day might be or the ownership of the station,
http://dilbert.com is an eternal fount of truth, etc.  There's a
listener identified by those URIs, perhaps even a distributed listener,
and it works quite nicely.

This level of abstraction, however useful, is a far cry from using HTTP
URIs to identify resources which are not in fact HTTP listeners, which
seems to be a more recent trend since the publication of RFC 2616.
Being able to discuss HTTP URIs as abstract identifiers for listening
resources is very different from being able to use HTTP URIs as abstract
identifiers for arbitrary subjects.

Another set of related issues arises because many of the specifications
that incorporate URIs don't incorporate just URIs themselves.  Rather,
they incorporate URI references, a more fully-featured toolkit that
includes both relative addressing and fragment identifiers.  Those
features are both defined in a different specification, RFC 2396[2],
which is not HTTP-specific.

The appropriate use of relative addressing has been previously discussed
as it applies to namespaces, and the conclusion reached seems pretty
simple: use relative addressing only for information that needs to
change depending on context, and don't use it as a shortcut for
information that should remain stable.  Having concluded that namespace
identifiers should remain stable, the XML Plenary deprecated the use of
relative URI references in namespace identifiers.[3]

Fragment identifiers are a very different set of problems.  Although
fragment identifiers (anything after a #, perhaps including nothing
after a pound) are defined generally by RFC 2396, the interpretation of
fragment identifiers is left to client processing and is dependent on
the media type of the information returned by the resource to the
client, as defined in Section 4.1:

>When a URI reference is used to perform a retrieval action on the
>identified resource, the optional fragment identifier, separated from
>the URI by a crosshatch ("#") character, consists of additional
>reference information to be interpreted by the user agent after the
>retrieval action has been successfully completed....
>
>The semantics of a fragment identifier is a property of the data
>resulting from a retrieval action, regardless of the type of URI used
>in the reference.  Therefore, the format and interpretation of
>fragment identifiers is dependent on the media type [RFC2046] of the
>retrieval result. The character restrictions described in Section 2
>for URI also apply to the fragment in a URI-reference.  Individual
>media types may define additional restrictions or structure within the
>fragment for specifying different types of "partial views" that can be
>identified within that media type.
>
>A fragment identifier is only meaningful when a URI reference is
>intended for retrieval and the result of that retrieval is a document
>for which the identified fragment is consistently defined.

URI references clearly demand a tighter coupling between the identifier
and the type of the thing identified.  With HTTP, is entirely possible
and perhaps even more and more likely (thanks to XML-based kits like
Cocoon and AxKit) that requests to the same URI will produce
substantially different "data resulting from a retrieval result"
depending on contexts which are not specified in the URI reference
itself.  (XHTML, for instance, has a lot of linking elements with
separate type attributes for optional identification of the MIME
Content-Type desired.)

While it might be nice for multiple formats to have common fragment
identifiers, the difficulties are fairly obvious once you examine the
diversity of types the Web supports, from HTML to plain text to graphics
to audio and video.  To single out a particular (and very useful) case,
SVG defines [4] the svgView() fragment identifier scheme, as in:

MyDrawing.svg#svgView(viewBox(0,200,1000,1000))

The complications that have slowed progress on XPointer are worth
consideration as well, as is the scheme-based approach the XPointer WG
appears to have settled on, with its (I think necessary) options for
diversity of implementation.

The value of fragment identifiers in ordinary linking situations where
the type of "data resulting from a retrieval result" is constrained
through mechanisms beyond the URI reference itself is pretty obvious, I
think.  Pointing to particular locations within documents is frequent
and useful, and a pointer system is necessary for effective use of
out-of-line hypertext.

The value of fragment identifiers in situations where the type of "data
resulting from a retrieval result" is not constrained is far less clear.
Namespaces in XML, for example, provides no information whatsoever
beyond a URI reference.  Many other uses of URI references similarly
provide only the URI reference and no further context.  As many of these
specifications appear to have lost sight of the notion that, for
example, an http-schemed URI reference involves a listening resource
which returns a variety of types of data.

While the use of URI reference syntax for string identifiers may seem
acceptable to URI proponents who have long since abandoned a notion of
resources as active beings participating in conversations, this use has
little if anything to do with the practice defined for URI references
generally and http URIs particularly by RFCs 2396 and 2616.

It may be a stretch to describe URIs and URI references beginning with
"http" as contracts which bring expectations for performance, but there
are clearly both formal and informal descriptions of those expectations.
Within those expectations, http URIs and URI references function very
well.  When pressed beyond those expectations into a world of arbitrary
identification, http URIs and URI references create confusion rather
than reduce it.

For those of us in XML-land, this has a few implications:

1) It's not clear what namespaces containing fragment identifiers (even
if they aren't http) are about; it may make more sense to use URIs, and
if http URIs, put a RDDL document there whose fragment identifiers
identify tools.

2) Pretending that the URI in a namespace identifier identifies the
namespace rather than a listening (for http) resource is foolish; it may
make more sense to redescribe namespaces in a context which offers
namespaces-as-affiliation-with-a-URI than as namespaces-as-a-URI.

3) In other contexts where URI references are used, providing additional
constraining information regarding the expected type of "data resulting
from a retrieval result" should be provided either in the specification
or explicitly in the document, as XHTML does with type attributes.  This
will help to ensure that fragment identifiers are interpreted in an
appropriate context.  XLink notably fails to do this, leaving
content-type identification to further URI interpretation rather than
MIME type identification.

4) If you provide an identifier which looks like it points to a listener
which provides responses (like an http URI or URI reference), make sure
there's actually a listener.  That listener can then provide
representations describing the affiliation between itself and your use
of the identifier.

5) Seriously consider specifying URIs rather than URI references, even
in contexts where 'just HTTP' is in use, unless you actually need and
are prepared to deal with the additional features/consequences of URI
reference usage. 

I'm not entirely sure why some people prefer Platonic Forms to the
practices defined in the specifications, but the specifications seem to
offer enough abstraction to be useful without the ever-expanding
complications that appear as HTTP identifiers are separated from their
foundations.

[1] - http://www.ietf.org/rfc/rfc2616.txt (June 1999)
[2] - http://www.ietf.org/rfc/rfc2396.txt (August 1998)
[3] - http://lists.w3.org/Archives/Public/xml-uri/2000Sep/0083.html
[4] - http://www.w3.org/TR/SVG/linking.html#SVGFragmentIdentifiers


-------------
Simon St.Laurent - SSL is my TLA
http://simonstl.com may be my URI
http://monasticxml.org may be my ascetic URI
urn:oid:1.3.6.1.4.1.6320 is another possibility altogether
Prev by Date: RE: RE: [xml-dev] Great piece on RSS
Next by Date: Re: RE: RE: [xml-dev] Great piece on RSS
Previous by thread: overhead
Next by thread: RE: [xml-dev] XForms Annotations (was Annotations in XPath-NG?)
Index(es):
- Date
- Thread