Re: [xml-dev] XML to graph

The Zorba cleaning library is the most immediately interesting. Since I am not starting from a blank slate (see below) and am applying some domain knowledge I think the problem I am solving fits into too small of a subset of the scope of the paper and Lisa Getoor's tutorial. There is definitely commonality in some of the heuristics but all I am doing is matching. The Zorba library may have changed my approach had I known about it before but it would not have helped to clean movie ratings where I was dealing with how to turn a completely free format movie rating where anything could be entered like a number on an unknown scale, a letter grade or things like

2 out of -4..+4

and

**1/2

to a -2 to +2 scale.

What I have is a bunch of heuristics based on common metadata from different movie silos . The heuristics are prioritised based on my own domain knowledge (not very satisfactory is it but easy to change). I do a very superficial form of stemming (lower case, get rid of non-alphanumerics) after which the highest ranked heuristic is if the titles match and they have a director in common they are the same movie. The lowest priority heuristic is if the movie release dates are the same and they have an actor in common they are the same (many movies don't have release date information). Some of these heuristics entail lookups of information culled from other repositories.

The use of the freebase search API's gives me a list of candidate solutions so I am not starting from a blank slate and it allows me to circumvent concepts like edit distance. Below is a good example of the scope of the problem. We are trying to find the correct movie match for a 1998 release of Treasure Island. These are the ranked matches for that search term from the Freebase Search API - no the right match isn't the top ranked one and yes it is there even though 1998 is not the year of any of the candidate matches.

<match mid="/m/0fw837" score="375.183075" year="1950" imdb_id="tt0043067">Treasure Island</match>

<match mid="/m/0gyk56x" score="362.390839" year="2012" imdb_id="tt1820723">Treasure Island</match>

<match mid="/m/05351g" score="357.812256" year="1996" imdb_id="tt0117110">Muppet Treasure Island</match>

<match mid="/m/027hq_7" score="312.396545" year="1990" imdb_id="tt0100813">Treasure Island</match>

<match mid="/m/0d6_3x" score="303.274017" year="1972" imdb_id="tt0069229">Treasure Island</match>

<match mid="/m/0dnv98" score="298.398956" year="1934" imdb_id="tt0025907">Treasure Island</match>

<match mid="/m/02vr1mt" score="291.193634" year="1988" imdb_id="tt0465041">Treasure Island</match>

<match mid="/m/0glqxkk" score="256.178223" year="1982" imdb_id="tt0084452">Treasure Island</match>

<match mid="/m/02_fm2" score="242.696091" year="2002" imdb_id="tt0133240">Treasure Planet</match>

<match mid="/m/076wc3r" score="234.503937" year="1985" imdb_id="tt0090199">Treasure Island</match>

<match mid="/m/04gsb_p" score="232.238983" year="1999" imdb_id="tt0248568">Treasure Island</match>

<match mid="/m/03d8xy4" score="232.115433" year="1972" imdb_id="tt0280371">Treasure Island</match>

<match mid="/m/02vr8jc" score="222.359039" year="1971" imdb_id="tt0067002">Animal Treasure Island</match>

<match mid="/m/05c2_7k" score="213.867325" year="2006" imdb_id="tt0811011">Pirates of Treasure Island</match>

<match mid="/m/04csrh1" score="210.738998" year="1920" imdb_id="tt0011785">Treasure Island</match>

<match mid="/m/0crsd_v" score="191.344147" year="1999" imdb_id="tt0181868">Treasure Island</match>

<match mid="/m/0ztj51t" score="175.443268" year="1987" imdb_id="tt0787225">Treasure Island</match>

<match mid="/m/0dlmcwr" score="166.556198" year="1954" imdb_id="tt0047406">Return to Treasure Island</match>

<match mid="/m/04j09gc" score="164.241943" year="1939" imdb_id="tt0031147">Charlie Chan at Treasure Island</match>

<match mid="/m/0crrmyz" score="161.611359" year="" imdb_id="">Treasure Island</match>

</movie>

The code itself is very compact - just over 100 lines of XSLT and exploits what I hope is the lazy evaluation of a sequence expression

MatchedMovie=(xpath expression for top heuristic, xpath expresion for next heuristic, .... xpath expression for last heurstic)[1]

This ability to plug and unplug heuristic rules makes me believe this can be the basis of a framework. I could certainly see it being applied to music data.

As you can see it relies more on harvesting semantic metadata rather than algorithms and yes it does solve the Treasure Island problem correctly.

I think it's tidier than the GroupLens project approach.

On Wed, Jul 1, 2015 at 7:04 PM, daniela florescu <dflorescu@me.com> wrote:

XQuery needs some serious extensions if you want to do what Helena did in her PhD….
(BTW, I was working with her when I wrote Quilt with Don Chamberlin… so can see some similarities ..)

Two major extensions would be:
1. FLWOR doesn’t stop when there is an exception, but just logs the exception and moves on
2. Grouby has to be extended from a simple hash to a more general clustering algorithm

Dana

On Jul 1, 2015, at 3:35 PM, Ihe Onwuka <ihe.onwuka@gmail.com> wrote:

On Wed, Jul 1, 2015 at 2:59 PM, daniela florescu <dflorescu@me.com> wrote:
Ihe,

transforming XQuery to be able to do data cleaning has been a LONG desire of mine.

The problem articulated in the paper with Citeseer publications is similar to the issues I face, for movies there are additional weapons that can be brought to bear because actors, directors and movie titles all have several aliases documented on various sites. That said the problem with movies may be harder because the incidence of two different papers sharing the same title is probably relatively low.

Reading on.....