Re: [xml-dev] xml:base and fragments

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: "Andrew S. Townley" <ast@atownley.org>
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
Date: Wed, 10 May 2017 13:23:53 +0200

> On May 10, 2017, at 5:27 AM, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote:
> 
> 
>> On May 9, 2017, at 4:02 PM, Andrew S. Townley <ast@atownley.org> wrote:
>> 
>> 
>> if string1 was equal to string2;
>>   and string1 was the base URI; 
>>   and string2 was the target (your http://A/ example),
>> then you have a ‘same-document’ reference (per spec).
>> 
>> If you’ve previously retrieved string1 as the base URI, then what the paragraph actually says was that you shouldn’t need to trigger a new retrieval of string2 because both are equal and refer to the same resource content that you have already loaded.
>> 
>> If you haven’t actually retrieved string1 as the base URI *and* you’re dereferencing the target as part of a retrieval action, then you have no choice but to trigger a new retrieval action (because you don’t have it yet).
> 
> You are disagreeing with RFC 3986 (and with some people
> who commented on the draft) here.  On your interpretation, a fresh
> retrieval will be necessary for any “same-document reference” (as
> RFC 3986 defines the term) for which the base URI does not equal
> the URI from which the document was retrieved.  

You’re right.  The specification doesn’t say anything about retrieval other than if the relative and base URI resolve to the same base URI, then dereferencing “should not” result in a new retrieval action.

The whole of Section 4.4 is silent on the retrieval operation that resulted in the octets of the resource being processed.

I think the main issue that potentially causes confusion is due to the particular behavior of URI fragments (from Section 3.5):

>    Fragment identifiers have a special role in information retrieval
>    systems as the primary form of client-side indirect referencing,
>    allowing an author to specifically identify aspects of an existing
>    resource that are only indirectly provided by the resource owner.  As
>    such, the fragment identifier is not used in the scheme-specific
>    processing of a URI; instead, the fragment identifier is separated
>    from the rest of the URI prior to a dereference, and thus the
>    identifying information within the fragment itself is dereferenced
>    solely by the user agent, regardless of the URI scheme.

Therefore, Fragments, not actually anything else, seem to be where things potentially become a bit fuzzy because some systems knowingly or unknowingly violate the above specification and resolve fragment identifiers as part of the scheme-specific processing.

Since fragment dereferencing is "done solely by the user agent”, the assumption here is that the target of fragment identifiers exist within the content of the resource they are loading.  However, due to many reasons, they may not, and so the dereference of the fragment identifier degrades to a reference to the current base URI of the resource being processed.

If the identified secondary resource exists, then it is “presented” by the user agent from the content of the current resource following the “should not trigger a retrieval action” guidance in the RFC.

This is true regardless of the granularity of the base URI and “same document” test scope, e.g. document, entity or element as provided in XML Base.

If the resource being processed has the same base URI and passes the same-document test as the fragment identifier, then dereferencing should not result in another retrieval action as per Section 4.4, and all of the “same-document” tests pass independently of the URI used to load the the octet stream in the first place.

> This behavior was predicted and deplored by some comments on
> the draft that later became 3986, because it means that any change
> to the base URI in a document risks breaking all document-internal
> links.  The lead editor explained (several times, I believe) that this
> is not in fact what the spec says or recommends.

> 
> RFC 3986 does NOT agree that you have no choice but to trigger
> a new retrieval in this situation; it tells you that the target of the link
> is *defined* as being within the current document and that a new
> retrieval action “should not” be triggered.

Yes.  You are correct, as I said above.  Thanks for pointing out my errant assumption.

> On what basis do you believe a piece of software can determine
> whether a new retrieval action is needed?  Because the document
> was retrieved from URI R and the base URI is a different URI B?

Actually, in deference to the RFC 3986 authors, I don’t think providing guidance for generalized, protocol-independent behavior – even for retrieval operations – should’ve been within the scope of the RFC.

In practice, there are many factors which can and should govern the conditions under which retrieval actions take place, and the majority of them have nothing to do with URIs.  I do appreciate that the specification attempted to give guidance for URIs based on the experience and architecture present in previous RFCs and the architecture of the Web in general, but both of these factors pre-suppose the existence of certain protocols and specifications that are orthogonal to a generalized representation of identifiers for abstract or physical resources.

> What principle of Web architecture says that two different URIs 
> cannot point at the same document?  

I never implied that there was such a principle within Web architecture, nor would I.

>> What I still didn’t understand was your:
>> 
>>>> * the reference is interpreted as a reference to the entity containing the reference, even if that entity is completely unrelated to anything you might find by retrieving the resource at http://A/. 
>> 
>> Because it must be related if you’ve already loaded it, and it by definition is a "same-document” reference.  How could it be completely unrelated?
>> 
>> The only way it would be potentially unrelated is if you held a base URI reference that wasn’t the same as the “target reference” because they would, in fact be different resources.  However, then they wouldn’t pass the test for “same document” though.
> 
> You should perhaps re-read the definition of “same-document reference” in RFC 3986.
> It does not depend on performing two document retrievals and comparing the
> results.

Done, and yes.  You are correct.  However, I wasn’t referring to any comparison of the results, only the comparison of the URIs.

Sameness of resources can only be determined by comparison of the octets and metadata.  Until you dereference the resource through a retrieval action, you have no choice but to denote two URIs not conforming to the ‘same-document’ criteria as identifying different abstract or physical resources. 

This is why i said “they would, in fact be different resources”, because knowing only the URI and the rules in Section 4.4, you could logically make no other determination.

> 
>> This seems important (because you mentioned it), but I don’t see how it is possible.
> 
> There are at least two scenarios where I think most people will find it all makes
> perfect sense:  1 the base URI is a canonical URI for the docuent, although in 
> fact the document was retrieved from a non-canonical URI.  (E.g the document’s
> canonical URI is http://example.com/foo, but the document was retrieved by
> requesting the different URI http://www.Example.com/foo.xml.)  2 the base URI
> is a URI for a partial document, embedded in the current document either by
> entity reference or by XInclude. 
> 

In case 1, they are related in terms of Michael’s original concern, and I can discern this through comparison of the octets and metadata.

In case 2, they’re still related in some way, if by no other means than they are embedded within the same source resource.  The meaning of that relation would only be defined by the author of that container resource.

I still don’t see that these are examples of what Michael was asking.

With a fresher head and more sleep, I think I get what Michael is actually concerned about now, however.  Since they pass the same document test, and if you follow the “should not” guidance and don’t actually ever dereference the base URI, then you have know knowledge beyond the current resource being processed what actual octets and metadata might result from that dereference because it never happens.  In effect you’ve been told, “these ARE the droids you’re looking for” in relation to the octets comprising the resource being processed.

Again, with the clearer head, Section 4.4. is solely the domain of URI fragments, and since fragments themselves are the responsibility of the client user agent, independent of the URI scheme, it seems that a client user agent, operating on content elements identified by fragment identifiers MAY define some other actions that wouldn’t be defined in terms of “retrieval actions” in Section 1.2.2, and therefore would not be governed by the specification.

In terms of potential impacts, do we really care and why?

Fragments are handled within the media type according to Section 3.5

>    The fragment's format and resolution is therefore
>    dependent on the media type [RFC2046] of a potentially retrieved
>    representation, even though such a retrieval is only performed if the
>    URI is dereferenced.  If no such representation exists, then the
>    semantics of the fragment are considered unknown and are effectively
>    unconstrained.  Fragment identifier semantics are independent of the
>    URI scheme and thus cannot be redefined by scheme specifications.

And it means that fragments can only target operations on secondary resources within the context of a primary resource (governed by a media type), so again, we’re scoped to client-side processing within the user agent’s capabilities for the media type of the retrieved primary resource.

Let’s say that “update”, per MK’s example is the action that must be handled in relation to the fragment, because, if it’s not a fragment, we don’t care if it’s the same resource and the scheme-specific interpretations of the operation would govern the interaction.

The semantics of this operation upon them media type would be defined independently of the scheme or protocols used to load the media type resource into the user agent.  Doing so allows user-agent processing of the media type before potentially doing something with that resource based on the currently appropriate base URI, at which point the specs take over, and we handle the request in terms of standard Web architecture constructs.

Isn’t this scenario akin to editing a file and then saving/deleting that file to/from a filesystem?

The only thing prevented (in this analogy) is checking to see whether the file changed out from under us because we “should not” initiate another retrieval operation on the resource itself since the empty URI is equivalent to the base URI, and again conforms to the same-document test.  Only fragments and empty relative URIs meet the criteria of “same document” per the RFC.

> 
>> 
>> Either way, I still don’t agree that the paragraph(s) you referenced in Section 4.4 is(are) doing more than providing implementation advice potentially relating to caching in the case of retrieval.  This advice wouldn’t necessarily apply to other protocol operations – especially if they weren’t idempotent – which is why I’m assuming the “should not” is present vs. “must not.”
> 
> The statement that 
> 
>    When a same-document reference is dereferenced for a retrieval
>    action, the target of that reference is defined to be within the same
>    entity (representation, document, or message) as the reference;
> 
> looks clearly normative to me.  The following statement, that 
> 
>    therefore, a dereference should not result in a new retrieval action.
> 
> does pretty clearly allow a same-document reference to be 
> dereferenced either without or with a new retrieval, but ‘should’ is
> a conformance keyword, and describing this sentence as “implementation 
> advice” does not seem to me to capture the situation very well.

To me, anything that’s not mandatory minimum requirements for a specification potentially falls under “implementation advice” because you are describing additional behavior or features that are essentially optional to some degree by the very use of the word “should.”

“Should” you attempt to implement as much of the specification that is technically and economically feasible for a given implementation environment?  Of course.

However, in this case, the statement refers to behavior that I, with my architect hat on, would say was an optimization for efficiency rather than causing non-conformance and/or significant incompatible side-effects with the specification if it were to be ignored.

Worst cases if I re-retrieve the resource, a) the resource really was the same but referenced by to different URIs, and I have a performance hit or b) the resource really wasn’t the same, and I present the user with a different set of octets relating to the base URI defined by the content creator that they believed should be the canonical location of the item in the first place. Maybe it’s updated, maybe it’s a redirect, or maybe it’s a 404.

Maybe it means the user no longer has access to the content, what what if that’s the intent of the content owner who should have authority over the content in the first place?

Answering that question is a philosophical discussion well beyond the scope of URIs, RFCs, and the Web in general, but possibly not xml-dev itself.

I can see pros and cons of each approach.

In this case, it doesn’t matter because the spec says “should not”, so that means I’m still conformant to the RFC in some way regardless of which choice I make as an implementor.

Not all “should”s are created equal, however, and I would not by default paint them with the same “implementation advice” brush.

Cheers,

ast
--
Andrew S. Townley <ast@atownley.org>
http://atownley.org

Follow-Ups:
- Re: [xml-dev] xml:base and fragments
  - From: John Cowan <johnwcowan@gmail.com>

References:
- xml:base and fragments
  - From: "John P. McCaskey" <mailbox@johnmccaskey.com>
- Re: [xml-dev] xml:base and fragments
  - From: Eliot Kimber <ekimber@contrext.com>
- Re: [xml-dev] xml:base and fragments
  - From: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
- Re: [xml-dev] xml:base and fragments
  - From: "Andrew S. Townley" <ast@atownley.org>
- Re: [xml-dev] xml:base and fragments
  - From: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
- Re: [xml-dev] xml:base and fragments
  - From: "Andrew S. Townley" <ast@atownley.org>
- Re: [xml-dev] xml:base and fragments
  - From: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
- Re: [xml-dev] xml:base and fragments
  - From: "Andrew S. Townley" <ast@atownley.org>
- Re: [xml-dev] xml:base and fragments
  - From: Michael Kay <mike@saxonica.com>
- Re: [xml-dev] xml:base and fragments
  - From: "Andrew S. Townley" <ast@atownley.org>
- Re: [xml-dev] xml:base and fragments
  - From: Michael Kay <mike@saxonica.com>
- Re: [xml-dev] xml:base and fragments
  - From: "Andrew S. Townley" <ast@atownley.org>
- Re: [xml-dev] xml:base and fragments
  - From: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]