xml-dev - RE: [xml-dev] Re: URIs, concrete (was Re: [xml-dev] Un-ask the question)

RE: [xml-dev] Re: URIs, concrete (was Re: [xml-dev] Un-ask the question)

[ Lists Home | Date Index | Thread Index ]

To: "Amelia A Lewis" <amyzing@talsever.com>
Subject: RE: [xml-dev] Re: URIs, concrete (was Re: [xml-dev] Un-ask the question)
From: "Julian Reschke" <julian.reschke@gmx.de>
Date: Sun, 4 Aug 2002 11:25:07 +0200
Cc: <xml-dev@lists.xml.org>
Importance: Normal
In-reply-to: <1028408411.722.43.camel@marajen>

> From: Amelia A Lewis [mailto:amyzing@talsever.com]
> Sent: Saturday, August 03, 2002 11:00 PM
> To: Uche Ogbuji
> Cc: xml-dev@lists.xml.org
> Subject: Re: [xml-dev] Re: URIs, concrete (was Re: [xml-dev] Un-ask the
> question)
>
>
> I'm going to be an irritating little git, Uche.  Sorry.
>
> On Sat, 2002-08-03 at 15:30, Uche Ogbuji wrote:
> > [Amy wrote:]
> > > Sorry, do we have any escaping rules?  I don't recall seeing such a
> > > thing in the Namespaces rec (I'm not considering the anyURI
> type in W3C
> > > XML Schema; does that have escaping rules?  Or interesting rules for
> > > comparison?  *sigh*  Guess I'll go look ...).
> >
> > Yes we do.  For example:
> >
> > http://bête.com
> >
> > Is an invalid URI, and thus an invalid namespace name.  It must
> be escaped to
> >
> > http://b%eate.com
> >
> > One thing I don't know is how this URI restriction interacts
> with the recent
> > opening up of DNS to i18n.
>
> I can't actually find a justification for this.  It isn't in the
> Namespaces recommendation, which is fairly silent on what a URI is.

And that's A Good Thing.

> Instead, the recommendation points at RFC 2396.  Section 2 of RFC 2396
> discusses representations of URIs, and the generalized escape mechanism.

Yes.

> It is important to note, however, that the RFC delegates *all* authority
> over which characters are reserved for which components to the component
> ... that is, to the URI registration specification subsection dealing
> with that particular part of that particular URI scheme.

I disagree. Section 2:

<quote>
URI consist of a restricted set of characters, primarily chosen to aid
transcribability and usability both in computer systems and in non-computer
communications. Characters used conventionally as delimiters around URI were
excluded. The restricted set of characters consists of digits, letters, and
a few graphic symbols were chosen from those common to most of the character
encodings and input facilities available to Internet users.

uric          = reserved | unreserved | escaped

Within a URI, characters are either used as delimiters, or to represent
strings of data (octets) within the delimited portions. Octets are either
represented directly by a character (using the US-ASCII character for that
octet [ASCII]) or by an escape encoding. This representation is elaborated
below.
</quote>

So a URI by definition consists only of US-ASCII characters. Independantly
of the scheme.

> Or in other, other words, you may well have a requirement that URIs be
> legal and valid, per the scheme's constraints, before it is transformed
> into a namespace name.  Once it has been so transformed, it is not
> possible to unescape it.  Since the escape mechanism happens before a
> namespace name can be used, and there is no valid unescape mechanism,
> then it does not make sense to speak of an escape mechanism.  What you
> have, instead, is just a string of characters.  This string should
> follow the rules to create a valid URI in some scheme, encoded for
> computer-based transmission, but it doesn't matter, because the
> namespace recommendation says you can't modify it, or interpret it, in
> any useful fashion.
>
> Note that your example, above, is an invalid URI for computer
> transmission, but would be allowed, pretty explicitly, by RFC 2396.  So

Nope. There's no distrinction between a "URI" and a "URI for computer
transmission". There is no such thing as a "unescaped" URI. After unescaping
URI-reserved characters, it stops being a URI.

> blame the mess on TimBL, maybe.  But it seems fairly clear that there is
> no two-way activity happening.  If you get something that contains
> %61%6d%79, you are *not* allowed to read it as 'amy'.  The namespaces
> recommendation gives you no permission to unescape the encoded
> characters.

Indeed.


Julian

References:
- Re: [xml-dev] Re: URIs, concrete (was Re: [xml-dev] Un-ask the question)
  - From: Amelia A Lewis <amyzing@talsever.com>

Prev by Date: Re: [xml-dev] constructive (was RE: [xml-dev] Markup perspective not code)
Next by Date: RE: RE: [xml-dev] constructive (was RE: [xml-dev] Markup perspective not code)
Previous by thread: Re: [xml-dev] Re: URIs, concrete (was Re: [xml-dev] Un-ask the question)
Next by thread: Re: [xml-dev] Re: URIs, concrete (was Re: [xml-dev] Un-ask the
Index(es):
- Date
- Thread