OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XPath calculation



From: Vakulenko, Andrey V <andrey.vakulenko@eds.com>

>I am looking for Java implementation of an algorithm for finding the most
>generic and short XPath expression for a selected element in DOM (XML
>document). Any ideas and recommendations are welcomed.

Sorry, no code, but here is a possible algorithm, no flames please.

I think the problem comes down to seeing the document as a graph,
with the root element, each element with an ID, and each unique set of
vertices
as potential anchor points. We want to find the shortest distance to a
potential
anchor point, then the simplest way to express the path between the anchor
and the element.

You start only considering the element: is it the root element
or an element with an ID or the only element with that name?  If not you
expand
the search one out:  does it have parent or children or siblings that are a
root element
or an element wth an ID, or is it uniquely named?  Next go to grandparent,
cousins
and grandchildren; at this stage you also consider whether any internal
(within the
current border) paths of distance 1 are unique.  Next to the
great-granchildren etc, and seeing whether any paths of distance 2 within
the current border are unique.

Continue, you will always stop because a document always has a root.

(You would presumably handle the effect of longer uniquess patterns
incrementally coming
into play by giving every element a number, indicating the length of the
shortest unique path in which it participates. This number would be added to
the distance from the element to see if it is within the current boundary as
a candidate IYSWIM)

Now you have found the point you are going to anchor your XPath from. We now
have to work backwards from that anchor to our element.  Treat the anchor as
the root of a new tree (dont allow up access from it, or rather, don' allow
acess from the branch that contains both it and your element, because the
anchor may not be an ancestor).

Procede downwards in a similar way to the upwards way, but this time the
alternatives are to find the element by a direct path or the nearest
similar-sized set of vertices. If you find a unique set of vertices before
you find a path, that is your second anchor. Trim the tree to fit both the
anchor and your element and repeat.

Something like that.  If you wanted "shortness" to be strict string
shortness, you would also have to factor in the sizes of element names and
id values, I guess.

A simpler heuristic might be just to go up ancestor-or-self::* until you
find an element with an id or the root.  Given that element names are often
longer than 6 characters, you may as well just use *[n] as use the element
name unless they are unusually small. Then attempt to make this path even
smaller by checking if it has any unique subpaths in it (i.e. if it is
*[@id="x1223"]/*[2]/*[5]/*[1]   and the *[2] is an x and the *[5] is a y and
there is only one
x/y in the document, it is shorter to use that  x/y/*[1].     This is not
optimal but does not involve the long searches of the previous method.
There would, of course, be lots of intermediate methods for better results.

Cheers
Rick Jelliffe