[
Lists Home |
Date Index |
Thread Index
]
- To: "'Bullard, Claude L \(Len\)'" <clbullar@ingr.com>
- Subject: RE: [xml-dev] Beyond Ontologies
- From: "Didier PH Martin" <martind@netfolder.com>
- Date: Wed, 1 Oct 2003 13:40:50 -0400
- Cc: <xml-dev@lists.xml.org>
- Importance: Normal
- In-reply-to: <15725CF6AFE2F34DB8A5B4770B7334EE03F9ED16@hq1.pcmail.ingr.com>
Hi Len,
Didier said:
>For instance, why some sites have more traffic than others?
>Simply, because they are better positioned in search engines like
>Google.
Len replied:
Usually because they have information of interest to some community.
I rank that by my experiment with allowing my song "Sam (for Liz)"
to be offered from free at www.bewitched.net. This has dimensions
in that the name of the site is obvious, but also that the people
who own it are extremely well connected to the producer and cast
of the show. So the content they offer is timely and is rare.
In other words, the dimensions that determine the quality of the
site are both dimensions of the web as a system and what it wants,
and of the users and what they want. In combination, one gets a
compelling site. What metric do I get? A continuous stream of
mail about the song from that site despite the fact that it is
also posted at mp3.com. One might think that mp3.com being a
music site would engender more mail, but it doesn't. The majority
of mail I get from there is from other members asking me to cross
link or to sell me services. I know the site gets a lot of
traffic because it feeds back to me at a higher rate despite the
fact that a song on that site is simply just another piece of content.
Didier replies:
So in that case we can say that:
a) People organized a personal web in their "favorite" section. This is
a personal ontology where keyphrases are associated to URLs. Topic map
people would recognize here a topic map without topic associations or
topic facets. So, in the previous example it seems that a lot of people
are using their personal ontology to access the mentioned web site.
b) New people are accessing the mentioned web site either by word of
mouth or by search engine access. In either case, they are using
keyphrases to qualify this site or using keyphrases to look for this
site. In that case, people will find the site if their personnel
ontology matches the search engine classification or if they trust the
"maven" or the "connector" in relationship web. Again, a keyphrase
positioned in a personal ontology will trigger a certain behavior: go to
that site and find your song.
In all these cases, the ontology is either tacit or controlled by the
search engine algorithms. Yes there is a semantic web but not based on
RDF or any formal ontology, most of it is not yet explicit, it is still
tacit. A lot of commercial interest is involved in keeping it tacit.
Didier said:
>Why are they better positioned because their site is structured
>in some ways that search engines like and moreover a lot of other well
>ranked sites cite them (Google is a citation based classification
>system). However, this brings some interesting perspective on the
>semantic web.
Len replied:
On the other hand, I wrote a paper on Information Ecosystems that
was posted two years ago by a company in New York. When I google
that term, I get back approximately a half million hits and my
paper is at the top. I have trouble believing that the paper
in pdf format gets that many citations. So other metrics besides
citation are in play.
Didier replies:
Maybe not if you say so. However, the site in question is using a lot
the keyphrase "ecosystem" and "information ecosystem". Therefore the
site's theme is better correlated to this keyphrase. Your document is
also parsed and classified by google since this latter can parse and
classify PDF documents. Since this site is associated with a vector in
the theme space related to the keyphrase "information ecosystem", then
your page's vector position is more closely positioned to the
"information ecosystem" locus. Remember that I said "structured in some
ways that search engines like". This means that the page content makes
it more related to the "information ecosystem" vector. However, if you
would have another page on the web and having about the same weight, but
would have more links pointing to it. Then, this page would be
classified as closer to the "information ecosystem" locus on the basis
or the votes/citations from other pages located in other domains. Just
take some more popular keyphrases and you will notice that the pagerank
can make a difference when two pages are equally weighted in terms of
keyphrase relevance strictly from their content/structure. I mention
here the structure because if a keyphrase is included in a header it
doesn't weight as much as if it is contained in a paragraph. Yes other
metrics are in play, citations are votes used to discriminate equal
weight. However, in certain cases, the votes weights more and the entire
classification is broken. This is what is happening with blogs. This
said, other classifications schemas like toema uses the notion of
community cluster around a certain keyphrase to reduce the influence of
free form votes. Google is slowly adapting its own algorithm to this
kind of scheme. This implies that the tacit web ontology is translated
into community clusters. Said differently, community cluster set and
keyphrases sets are related with a relation "is_part_of" and a certain
weight. We now speak of physical incarnation of tacit ontologies with
fuzzy set ownership. If a page associated to a particular theme is
referred by a community cluster then its vote as "is_part-of" this
keyphrase is having more weight.
Simply said:
a) Actually the web is based on an implicit or tacit ontology
b) This ontology finds its physical incarnation with community clusters
and their link structure. There is a relationship between words and
sites.
c) Social networks and their related economics are also based on
implicit or tacit ontologies. Call that, brands, constructs or whatever
but, nonetheless, it exist in the mind of people as tacit ontologies and
we refer to them either by constructs, brand or URLs.
A real semantic web revolution may happen if:
a) Search engines publish their result in RDF, OWL or any other format
that knowledge engine can process. It could be done today simply by
using a PERL script to translate from HTML into one of these formats and
then process that. The absence of such script (or maybe there is one but
I am not aware of it - please, if someone knows one, let us know, this
may be useful).
b) there exist a corpus of relations between
keyphrases/topics/themes/concepts. Then in that case we can make some
inferences. This is precisely what Adsense is doing (with some glitches
sometimes or in other times with brio). Just look at my site where I did
unconsciously an experiment : http://dsssl.netfolder.com you'll notice
that the ads are about XML. Google related the main theme DSSSL and
OpenJade to XML. It can achieve this kind of relationship through DMOZ
and its explicit ontology, or through some database coming from the
newly acquired companies.
Didier said:
>a) Attractors (site having a lot of traffic) are associated to some
>keyphrases (a main theme and some related concept - see how adsense is
>working). Thus we can model the attractor in relation to ontologies by
>associating to a topic/class/object/keyphrase a set of sites.
Len replied:
Yes. Emergent topic maps.
Didier replies:
Precisely. A keyphrase can be considered an attractor. A community
cluster can also be considered as a network representing the proximity
to this attractor locus. Some keyphrase like "green tea weight loss" can
be decomposed into two themes "green tea" and "weight loss" therefore
can potentially be owned by two sets or two attractors a) "green tea"
and b) "weight loss". Internal structure and content of pages will
determine their proximity to an attractor. Votes/links will amplify or
reduce the fuzzy set function value. For instance, taking the previous
example, if a "weight loss" community links heavily to a page (from the
internal structure or content), then even if its internal structure and
content position its vector equally as close to "weight loss" and to
"green tea", then the community's votes will make it closer to "weight
loss". The community cluster will simply push the page toward the
"weight loss" attractor.
Didier said:
>b) some people connected on the web propagate
>keyphrases/brands/concepts. These connectors act as gate keepers or as
>amplifiers.
Len replied:
They are opinion leaders in some cases and that is one of the dangers
of the system. It propagates opinion which can take on a life of its
own. In the Enterprise Engineering papers, I warned about
'superstitious
acquisition', the danger of using citation because it can be only rumor
backed up by a cult of personality. Still, let's take a simpler
example.
Because we know that XML-Dev is reasonably well read, it would be
interesting
to see stats on how many hits on the search engines the terminology of
chaos and complexity theory recorded this week. We see the bottom up
driving of ontological creation, and if automated, these are what
Costello should be looking at.
Didier replies:
Interesting experiment to do.
Didier said:
>d) Search engine are the real semantic web and they connect URI with
>words. More and more as demonstrated with the "~" operator (in Google)
>or with Adsense, they possess the concept of association or related
>concept to a theme. Search engines own the semantic web and ontologies.
Len replied:
To some extent, yes, but the search engine is just an engine. It is
the feedback loop that creates the ontologies bottom up and then
the direction those ontologies give to the direction of a search
that is the nonlinear dynamic power. Look for the intelligent selector.
One can do this with agents, yes, but so far, we are doing it with
our own gray matter. The web is indeed an amplifier, and its signal
processing clearly demonstrates the effects of controlled feedback,
and that is directed evolution if not a top down hierarchy as such.
In fact, a top down directed evolution is precisely what I fear about
a so-called, semantic web.
Didier replies:
I took my time to think seriously and hard about your statement that it
is the feedback loop that creates the ontologies buttom up. I disagree.
The ontology is there either implicit or tacit or explicit as in yahoo
or DMOZ. The aggregation of URL around an
attractor/concept/theme/keyphrase is simply based on "what this page is
saying about itself" and "what the other are saying about this page".
The former classification is based on the page's content and structure
and the algorithms we used to classify it. Actually, the algorithms are
mostly statistical but more and more the search engines are going beyond
stemming and start digging into named phrases for relevancy (we are not
there yet since it requires some advancement in computational
linguistics - but we tremendously improved the state of the art in the
last 10 years). The latter is simply to confirm that we got the right
classification. Actually, the latter is having a lot of weight because
statistical methods are not as good as they should be to classify
documents. As the linguistics methods and explicit knowledge used to
classify, the less important the votes will be. Until then, votes or
opinions are what is used to know "what this page is all about". If a
lot of sites related to "green tea" link to a page with anchors
containing "green tea", then a certain inference could be made that the
target page is about "green tea". Have a lot of these and you
re-enforced your opinion. Thus, actual classification is based on a
certain "social concensus". That game can be corrupted as we know with
blogs. Objectively I cannot say if it is corrupted or that the social
web represented with document posted and associated to keyphrases are
what is more or less important. Same problem with democacry :-) we don't
necessarily have the best or the more relevant, we have what the
majority voted for :-)
Ouff, enough for today, let's go back to work. I am working on an
interesting project: the 4 generation web. No more fat servers and thin
client (It makes me sick to see how we returned to the mainframe
paradigm with a different hardware). The project I am working on uses a
language (xml based) that we call PDML used to transfer from the server
to the client a set of objects defines with an ontology (class
hierarchy). Have them encoded in XML and re-constructed in the client.
Have them live for a while on the client and respond to users
interactions, then come back to the server to modify the database. Its
REST based with GET (object set) and PUT (object set) works very well
with javascript (a prototype/instance based language) and python (An
mixed object oriented prototype/instance based language). It no more fat
server, thin client it is now object storage - object
instantiation/interaction environment. When you go through Alice's
mirror its fun to see again that the dark ages of the last years could
be overcome and progress start again where we left it in the 80s :-)
That was long, but hey, I was silent for a while on this list :-). I was
thinking...
Cheers
Didier PH Martin
|