Re: [xml-dev] RE: Comprehensive list of issues with dereferencing alink?

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] RE: Comprehensive list of issues with dereferencing alink? Stress-testing link crawlers

From: "Liam R. E. Quin" <liam@w3.org>
To: "Costello, Roger L." <costello@mitre.org>
Date: Mon, 29 Jun 2015 00:37:01 -0400

On Sun, 2015-06-28 at 14:51 +0000, Costello, Roger L. wrote:
> Thank you very much for the outstanding responses.
> 
> I have a few questions about the behavior you expect of XML tools 
> with respect to links (specifically, links to a redirect page).
> 
> 
> 1. Suppose an XML document has an external entity reference (link). 
> The link points to a redirect page and the redirect page points to 
> the real content. What behavior do you expect of an XML parser? 
> Should the XML parser:

I'm going to assume you mean an HTTP redirect here.
> 
> (a) Follow the redirect and retrieve the real content.

Yes. This is mandated by the HTTP spec. (or should be - the HTTP specs 
are awfully vague)

> 
> (b) Retrieve the redirect page.
> 
> (c) Generate an error.
> 
> 
> 2. Suppose an XML document is to be validated against an XML Schema. 
> The schema validator is provided the link to the XML instance 
> document and the link to the XML schema document. The link to the 
> XML schema document points to a redirect page and the redirect page 
> points to the real XML schema. What behavior do you expect of an XML 
> schema validator? Should the XML schema validator:
> 
> (a) Follow the redirect and retrieve the real XML schema (and 
> successfully validate against the real XML schema).

Yes. This is mandated by the HTTP spec. (or should be - the HTTP specs 
are awfully vague).

Again, this depends on the HTTP status code in the response header. if 
by "redirect page" you mean an HTML page that says "The schema you 
have looking for has been moved _here_" and gives a 200 OK response, 
then no, you should get an error as that's not a schema document and a 
processor is not expected to parse natural language :)

> 
> (b) Retrieve the redirect page and generate an error.
> 
> (c) Generate an error.
> 
> 
> 3. Suppose an XML document is to be transformed using an XSLT 
> program. The XSLT processor is provided the link to the XML instance 
> document and the link to the XSLT program. The link to the XSLT 
> program points to a redirect page and the redirect page points to 
> the real XSLT program. What behavior do you expect of an XSLT 
> processor? Should the XSLT processor:
> 
> (a) Follow the redirect and retrieve the real XSLT program (and 
> successfully transform using the real XSLT program).

Yes. *Any* thing using HTTP should follow the HTTP spec.

You can answer these questions by looking at RFC 7231 -- e.g. for a 
3xx redirect HTTP response code,

[[
The 3xx (Redirection) class of status code indicates that further
   action needs to be taken by the user agent in order to fulfill the
   request.  If a Location header field (Section 7.1.2) is provided, 
the
   user agent MAY automatically redirect its request to the URI
   referenced by the Location field value, even if the specific status
   code is not understood.  Automatic redirection needs to done with
   care for methods not known to be safe, as defined in Section 4.2.1,
   since the user might not wish to redirect an unsafe request.
]]

You can see that this has typical HTTP RFC vagueness :-) We can 
analyze it like this:
(1) the user agent (in this case the XSLT processor and its http 
library) "needs" to take further action.
(2) if the user agent doesn't understand the response code and a 
Location is given then it MAY treat is as an automatic redirect (this 
doesn't apply to us; it's for the case that new 3xx codes are added to 
the spec later)
(3) The implemetation (RFCs are generally addressed to implementors) 
needs to be careful in implementing automatic redirects.

OK, so from that we get that if we get a 3xx HTTP code with a 
Location, our http library should go fetch the resource at the new 
location.

The likely codes are 301 (moved premanently), 302 (found somewhere 
else) and 307 (temporarily moved). There are other possibilities, such 
as 304 not modified e.g. in response to the http library keeping a 
cache and asking the server, only give me this if it changed since my 
cached copy was fetched.

But all of this depends on the remote HTTP server sending a 301 (or 
other 3xx) redirect header with a Location having a valid URI.

If an HTML page with a meta name="refresh" is used, served with HTTP 
200 instead of 3xx, no, that won't work, and shouldn't be expected to 
work, because HTML redirects are implemented by HTML user agents, and 
an XSLT or XML Schema engine is not an HTML user agent.

Liam

> (b) Retrieve the redirect page and generate an error.
> 
> (c) Generate an error.
> 
> 
> 4. Suppose an XML document is to be validated against an XML Schema 
> which contains an xs:import element with a link to a schema 
> document. The schema validator is provided the link to the XML 
> instance document and the link to the XML schema document. The link 
> to the imported XML schema document points to a redirect page and 
> the redirect page points to the real imported XML schema. What 
> behavior do you expect of an XML schema validator? Upon encountering 
> the import link should the XML schema validator:
> 
> (a) Follow the redirect and retrieve the real imported XML schema 
> (and successfully validate against the real imported XML schema).
> 
> (b) Retrieve the redirect page and generate an error.
> 
> (c) Generate an error.
> 
> 
> 5. Suppose an XML document is to be transformed using an XSLT 
> program which contains an xsl:import element with a link to an XSLT 
> document. The XSLT processor is provided the link to the XML 
> instance document and the link to the XSLT program. The link to the 
> imported XSLT program points to a redirect page and the redirect 
> page points to the real XSLT program. What behavior do you expect of 
> an XSLT processor? Upon encountering the import link should the XSLT 
> processor:
> 
> (a) Follow the redirect and retrieve the real XSLT program (and 
> successfully transform using the real imported XSLT program).
> 
> (b) Retrieve the redirect page and generate an error.
> 
> (c) Generate an error.
> 
> 
> Have you tried any of these? Today I tried (2) -- validate an XML 
> document against a schema. The validator did _not_ follow the 
> redirect to the real XML schema document. The validator generated an 
> error. That's not the behavior I expected.
> 
> 
> /Roger
> 
> 
> _______________________________________________________________________
> 
> 
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS to 
> support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
> 
> [Un]Subscribe/change address:  http://www.oasis-open.org/mlmanage/Or 
> unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

References:
- Comprehensive list of issues with dereferencing a link?Stress-testing link crawlers
  - From: "Costello, Roger L." <costello@mitre.org>
- RE: Comprehensive list of issues with dereferencing a link?Stress-testing link crawlers
  - From: "Costello, Roger L." <costello@mitre.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]