[
Lists Home |
Date Index |
Thread Index
]
> From: Amelia A Lewis [mailto:amyzing@talsever.com]
> Sent: Saturday, August 03, 2002 11:00 PM
> To: Uche Ogbuji
> Cc: xml-dev@lists.xml.org
> Subject: Re: [xml-dev] Re: URIs, concrete (was Re: [xml-dev] Un-ask the
> question)
>
>
> I'm going to be an irritating little git, Uche. Sorry.
>
> On Sat, 2002-08-03 at 15:30, Uche Ogbuji wrote:
> > [Amy wrote:]
> > > Sorry, do we have any escaping rules? I don't recall seeing such a
> > > thing in the Namespaces rec (I'm not considering the anyURI
> type in W3C
> > > XML Schema; does that have escaping rules? Or interesting rules for
> > > comparison? *sigh* Guess I'll go look ...).
> >
> > Yes we do. For example:
> >
> > http://bête.com
> >
> > Is an invalid URI, and thus an invalid namespace name. It must
> be escaped to
> >
> > http://b%eate.com
> >
> > One thing I don't know is how this URI restriction interacts
> with the recent
> > opening up of DNS to i18n.
>
> I can't actually find a justification for this. It isn't in the
> Namespaces recommendation, which is fairly silent on what a URI is.
And that's A Good Thing.
> Instead, the recommendation points at RFC 2396. Section 2 of RFC 2396
> discusses representations of URIs, and the generalized escape mechanism.
Yes.
> It is important to note, however, that the RFC delegates *all* authority
> over which characters are reserved for which components to the component
> ... that is, to the URI registration specification subsection dealing
> with that particular part of that particular URI scheme.
I disagree. Section 2:
<quote>
URI consist of a restricted set of characters, primarily chosen to aid
transcribability and usability both in computer systems and in non-computer
communications. Characters used conventionally as delimiters around URI were
excluded. The restricted set of characters consists of digits, letters, and
a few graphic symbols were chosen from those common to most of the character
encodings and input facilities available to Internet users.
uric = reserved | unreserved | escaped
Within a URI, characters are either used as delimiters, or to represent
strings of data (octets) within the delimited portions. Octets are either
represented directly by a character (using the US-ASCII character for that
octet [ASCII]) or by an escape encoding. This representation is elaborated
below.
</quote>
So a URI by definition consists only of US-ASCII characters. Independantly
of the scheme.
> Or in other, other words, you may well have a requirement that URIs be
> legal and valid, per the scheme's constraints, before it is transformed
> into a namespace name. Once it has been so transformed, it is not
> possible to unescape it. Since the escape mechanism happens before a
> namespace name can be used, and there is no valid unescape mechanism,
> then it does not make sense to speak of an escape mechanism. What you
> have, instead, is just a string of characters. This string should
> follow the rules to create a valid URI in some scheme, encoded for
> computer-based transmission, but it doesn't matter, because the
> namespace recommendation says you can't modify it, or interpret it, in
> any useful fashion.
>
> Note that your example, above, is an invalid URI for computer
> transmission, but would be allowed, pretty explicitly, by RFC 2396. So
Nope. There's no distrinction between a "URI" and a "URI for computer
transmission". There is no such thing as a "unescaped" URI. After unescaping
URI-reserved characters, it stops being a URI.
> blame the mess on TimBL, maybe. But it seems fairly clear that there is
> no two-way activity happening. If you get something that contains
> %61%6d%79, you are *not* allowed to read it as 'amy'. The namespaces
> recommendation gives you no permission to unescape the encoded
> characters.
Indeed.
Julian
|