RE: [xml-dev] The limitations of XPath and navigation- A XPath/XQuery

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

RE: [xml-dev] The limitations of XPath and navigation- A XPath/XQuery Challenge

From: "Michael Kay" <mike@saxonica.com>
To: <mike@adatinc.com>,<john.snelson@oracle.com>,<james.fuller.2007@gmail.com>
Date: Tue, 12 Feb 2008 11:31:46 -0000


	EmpView XML
 
	<root>
	  <emp empid="Emp01">
	   <dpnd dpndid="Dpnd01"/>
	   <eaddr eaddrid="Addr01"/>
	  </emp>
	  <emp empid="Emp02">
	   <eaddr eaddrid="Addr03"/>
	  </emp>
	</root>

	CustView XML

	<root>  
	 <cust custid="Cust01">
	   <invoice invid="Inv01"/>
	   <invoice invid="Inv02"/>   
	   <addr addrid="Addr01"/>  
	 </cust>  
	 <cust custid="Cust02">
	   <invoice invid="Inv03"/>
	   <addr addrid="Addr02"/>
	   <addr addrid="Addr04"/>
	 </cust>
	 <cust custid="Cust03">
	   <addr addrid="Addr03"/>
	 </cust> 

	</root>
	 
	This is the ANSI SQL multi-path query that will be translated into
XQuery:

	 
	SELECT DpndID, CustID, EmpID, InvID, AddrID 
	FROM EmpView LEFT JOIN CustView  ON EmpCustID=CustID
	WHERE Invoice="Inv02"

	The query description given here is the default SQL operation. The
data fields that are desired in the hierarchical XML result are specified in
the SELECT statement. The selected fields stay within their XML elements
when output (attribute mode is the default). Node promotion will occur
automatically around empty XML elements (with no data selected) in the
output node. The FROM clause identifies the input objects and how they are
related which is indicated on the ON clause. In this case the EmpView XML
document is (left) joined over the Custview XML document linked by the
EmcustID and CustID data values. The combined hierarchical structure is then
filtered by the WHERE clause filter of Invoice='Inv02'. Referenced fields in
the SQL query do not have to be selected for output.

MK>I'm struggling with this because I can't see what EmpCustID might be.
I've come to the conclusion that you left out an attribute of the <emp>
element and it should have been <emp empid="Emp01" empCustId="Cust01">. If I
got that wrong, sorry, please correct me.

Also you haven't explained in English what the query is supposed to do, so
I'm having to rely on my very rusty knowledge of SQL to understand it. And
I'm afraid I'm confused. I thought the idea of LEFT JOIN was that every EMP
would appear in the result whether or not there was a matching CUST? But
perhaps the WHERE clause overrides this - in which case it's surely the same
as an ordinary join? Sorry, I'm really struggling to reverse-engineer the
query specification from a language I don't fully understand.

When you write Invoice="Inv02" I would have expected to see
Invoice.InvID="Inv02". Are you allowing a table name used in a comparison to
be treated as a reference to its primary key?

Anyway, with your sample data you can get the supplied output using:

declare variable $emps := doc('employees.xml')/root/emp;
declare variable $custs := doc('customers.xml')/root/cust;

<root>{
for $e in $emps,
    $c in $custs[@custid = $e/@empcustid],
    $i in $c/invoice[@invid = "Inv02"]
return
  <emp>
    {$e/@*, $e/*} 
    <cust>
       {$c/@*, $i, $c/addr}
    </cust>
  </emp>
}</root>    
   

	As you can see, this full hierarchical processing is being performed
at a high hierarchical conceptual level and can be specified easily and
interactively by non technical users. In XQuery terms, the FLOWR expression
and output XML output template is being automatically created and performed
accurately regardless of the internal processing complexity. So adding a new
field in the SELECT list that will cause a new hierarchical path to also be
accessed and complicate the hierarchical processing further is still all
that is needed. Because the output structure is known, the proper data
replication removal is performed.

MK> The main differences seem to be (a) that in the SQL version (I'm not
actually clear now whether this is standard SQL or some variant of your own
invention?) you are implicitly qualifying names appearing in the query, and
implicitly adding the relationship between customer and invoice, based on
knowledge of the schema, and (b) in the SQL version you are making guesses
about the required output structure, for example that within the output
<cust> element, the <invoice> child should appear before <addr>.

Generally, my feeling looking at these is that the explicit construction of
the result hierarchy in the XQuery solution feels the right thing to do; my
gut feeling is that if the system were left to design its own hierarchy it
would not usually produce what the user wants. On the other hand, I have
rather mixed feelings about making the hierarchic paths to input data less
explicit. It's true that it's very easy to make mistakes (I had several
attempts at getting this right because I wrote $e/empcustid instead of
$e/@emcpcustid). However, use of schema-aware query processing provides
quick diagnosis of such errors (and without a schema, there's no chance of
doing the implicit navigation anyway).

So the debate seems to boil down to a question of whether implicit
navigation is better than explicit navigation: whether the system should
guess that when I say invoiceId I must mean $c/invoices/invoice/invoiceId. I
can see arguments both ways on that. XQuery of course allows you to say
$c//invoiceId if you know it's unambiguous - but you do have to say that
you're looking in the hierarchy rooted at $c.

Michael Kay
http://www.saxonica.com/

References:
- Re: [xml-dev] The limitations of XPath and navigation- A XPath/XQuery Challenge
  - From: mike@adatinc.com

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]