XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] XML to graph

The Zorba cleaning library is the most immediately interesting. Since I am not starting from a blank slate (see below) and am applying some domain knowledge I  think the problem I am solving fits into too small of a subset of the scope of the paper and Lisa Getoor's tutorial. There is definitely commonality in some of the heuristics but all I am doing is matching.  The Zorba  library  may have changed my approach had I known about it before but it would not have helped to clean movie ratings where I was dealing with how to turn a completely free format movie rating where anything could be entered like a number on an unknown scale, a letter grade or things like

2 out of -4..+4

and 

**1/2

to a -2 to +2 scale.

What I have is a bunch of heuristics based on common metadata from different  movie silos . The heuristics are prioritised based on my own domain knowledge (not very satisfactory is it but easy to change). I do a very superficial form of stemming (lower case, get rid of non-alphanumerics) after which the highest ranked heuristic is if the titles match and they have a director in common they are the same movie. The lowest priority heuristic is if the movie release dates are the same and they have an actor in common they are the same (many movies don't have release date information). Some of these heuristics entail lookups of information culled from other repositories.

The use of the freebase search API's gives me a list of candidate solutions  so I am not starting from a blank slate and it allows me to circumvent concepts like edit distance. Below is a good example of the scope of the problem. We are trying to find the correct movie match for a 1998 release of Treasure Island. These are the ranked matches for that search term from the Freebase Search API -  no the right match isn't the top ranked one and yes it is there even though 1998 is not the year of any of the candidate matches.

 <movie term="Treasure Island" year="1998" rtLink="/m/1116410-treasure_island/">
    <match mid="/m/0fw837" score="375.183075" year="1950" imdb_id="tt0043067">Treasure Island</match>
    <match mid="/m/0gyk56x" score="362.390839" year="2012" imdb_id="tt1820723">Treasure Island</match>
    <match mid="/m/05351g" score="357.812256" year="1996" imdb_id="tt0117110">Muppet Treasure Island</match>
    <match mid="/m/027hq_7" score="312.396545" year="1990" imdb_id="tt0100813">Treasure Island</match>
    <match mid="/m/0d6_3x" score="303.274017" year="1972" imdb_id="tt0069229">Treasure Island</match>
    <match mid="/m/0dnv98" score="298.398956" year="1934" imdb_id="tt0025907">Treasure Island</match>
    <match mid="/m/02vr1mt" score="291.193634" year="1988" imdb_id="tt0465041">Treasure Island</match>
    <match mid="/m/0glqxkk" score="256.178223" year="1982" imdb_id="tt0084452">Treasure Island</match>
    <match mid="/m/02_fm2" score="242.696091" year="2002" imdb_id="tt0133240">Treasure Planet</match>
    <match mid="/m/076wc3r" score="234.503937" year="1985" imdb_id="tt0090199">Treasure Island</match>
    <match mid="/m/04gsb_p" score="232.238983" year="1999" imdb_id="tt0248568">Treasure Island</match>
    <match mid="/m/03d8xy4" score="232.115433" year="1972" imdb_id="tt0280371">Treasure Island</match>
    <match mid="/m/02vr8jc" score="222.359039" year="1971" imdb_id="tt0067002">Animal Treasure Island</match>
    <match mid="/m/05c2_7k" score="213.867325" year="2006" imdb_id="tt0811011">Pirates of Treasure Island</match>
    <match mid="/m/04csrh1" score="210.738998" year="1920" imdb_id="tt0011785">Treasure Island</match>
    <match mid="/m/0crsd_v" score="191.344147" year="1999" imdb_id="tt0181868">Treasure Island</match>
    <match mid="/m/0ztj51t" score="175.443268" year="1987" imdb_id="tt0787225">Treasure Island</match>
    <match mid="/m/0dlmcwr" score="166.556198" year="1954" imdb_id="tt0047406">Return to Treasure Island</match>
    <match mid="/m/04j09gc" score="164.241943" year="1939" imdb_id="tt0031147">Charlie Chan at Treasure Island</match>
    <match mid="/m/0crrmyz" score="161.611359" year="" imdb_id="">Treasure Island</match>
  </movie>

The code itself is very compact - just over 100 lines of XSLT and exploits what I hope is the lazy evaluation of a sequence expression 

MatchedMovie=(xpath expression for top heuristic, xpath expresion  for next heuristic, .... xpath expression for last heurstic)[1]

This ability to plug and unplug heuristic rules makes me believe this can be the basis of a framework. I could certainly see it being applied to music data. 

As you can see it relies more on harvesting semantic metadata rather than algorithms and yes it does solve the Treasure Island problem correctly.

I think it's tidier than the GroupLens project approach.


On Wed, Jul 1, 2015 at 7:04 PM, daniela florescu <dflorescu@me.com> wrote:
XQuery needs some serious extensions if you want to do what Helena did in her PhD….
(BTW, I was working with her when I wrote Quilt with Don Chamberlin… so can see some similarities ..)

Two major extensions would be:
1. FLWOR doesn’t stop when there is an exception, but just logs the exception and moves on
2. Grouby has to be extended from a simple hash to a more general clustering algorithm 


Dana


On Jul 1, 2015, at 3:35 PM, Ihe Onwuka <ihe.onwuka@gmail.com> wrote:



On Wed, Jul 1, 2015 at 2:59 PM, daniela florescu <dflorescu@me.com> wrote:
Ihe,

transforming XQuery to be able to do data cleaning has been a LONG desire of mine.


The problem articulated in the paper with Citeseer publications is similar to the issues I face, for movies there are additional weapons that can be brought to bear because actors, directors and movie titles all have several aliases documented on various sites. That said the problem with movies may be harder because the incidence of two different papers sharing the same title is probably relatively low.

Reading on.....






[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS