OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: Success factors for the Web and Semantic Web

[ Lists Home | Date Index | Thread Index ]
  • From: Paul Tchistopolskii <paul@qub.com>
  • To: xml-dev <xml-dev@lists.xml.org>
  • Date: Thu, 21 Dec 2000 02:41:31 -0800

----- Original Message -----
From: Michael Champion <mike.champion@softwareag-usa.com>

> - Does it really meet a real, unmet human and business need?  As several
> people have mentioned, the search engines, especially Google, are getting
> pretty darned useful lately.  True, this is partly due to the promotion of
> metadata and the synergy between the search engines and the HTML <meta>
> ag  - if you want good placement in a search engine, you put good metadata
> in your HTML.  But it's due largely, as Paul Tchistopolskii points out, to
> algorithms that extract useful information from the HTML itself, especially
> the "page ranking" technique.  The SW might offer real advantages over what
> we have now, but not enough to overcome the "worse is better" bias built
> into our brains, economic system, etc.

The sad part is that I think Google is good not because it works
together with <meta>. I think Google is good  because it *kills*
the <meta>-based stuff. Google is utilizing many technologies
( see  [1] and [2] ). Unfortunately, I think all of those technologies
are almost orthogonal to markup.

To be good, search engine should *not* trust to <meta> tags, because
we all know about thousands of morons who ( successfully ) tried
to cheat on 'dumb' search engines, trying to get better rank, abusing
the <meta> tags. ( When 'bad' page gets a good rank - it takes the
spot of 'good' page. I stopped using altavista .... As a simple test -
try typing "python java" in both google and altavista ;-)

"Index of citation" works because it is almost impossible to abuse it.
Any company can spend money on banner ads, verbose 'meta'-based 'hooks'
e .t.c, but usually only really important pages ( scientific publications )
are quoted by other  pages ( publications ) !

( Google definitely has much more than just "citation" rule, but
this rule alone almost kills <meta> and I think any other rule used
by Google also has greater priority than <meta>, because <meta> is
too easy to abuse. )

By the way. I've downloaded and installed Google Toolbar. Google Toolbar
reports the rank of the page when I'm  browsing the web. Google
rocks. It is incredibly accurate in page ranking. Really. Every time
I visit some site, I now look at that green indicator on the toolbar
and that indicator tells me is the page 'really important' and ... Well ...
It looks like magic, but ranking is always very close to reality.

This magic works because cheating on "index of citation" is very hard.
Cheating on <meta> tags is *very* easy.

"XML will replace screen scrapping, because every vendor
will begin publishing data in XML"  is also too optimistic, I think,
for exactly the similar reason.

1. 'Meta' is too idealistic, because <meta> could be abused.

2. 'Vendors will publish everything in XML' is too idealistic
( and I think because it is technically not possible to deliver
the 'same'  XML to every media, so this will cause the
situations when  some markup will get 'lost' e t.c. ==
'screen scrapping' == Google ;-) e t.c.  )

3. 'Give everybody an XML parser  and forget about Yacc' is
also too idealistic, I think.


So UNIX, C, Google, Perl are  'worse is better' ... OK ...
we should also admit that the Web ( HTTP + HTML + CGI )
is no doubt the same 'worse is better'.  Right?

I don't think The Semantic Web smells like 'worse is better'.
Does it ? If Semantic Web is 'worse is better', the prototype
of it should be very easy  to implement. Like it was with other
'worse is better'....

Maybe somebody already has a prototype of SW implemented ?
I appreciate the URL.

Many thanks.


[1] http://www.google.com/technology/index.html
[2]  from : http://www.google.com/corporate/execs.html#sergey1

Brin's research interests include search engines, information extraction from unstructured
sources, and data mining of large text collections and scientific data. He has published
more than a dozen publications in leading academic journals, including "Extracting
Patterns and Relations from the World Wide Web"; "Dynamic Data Mining: A New Architecture
for Data with High Dimensionality," which he published with Larry Page; "Scalable
Techniques for Mining Casual Structures"; "Dynamic Itemset Counting and Implication Rules
for Market Basket Data"; and "Beyond Market Baskets: Generalizing Association Rules to

  • Follow-Ups:
    • xsd:year
      • From: Elliotte Rusty Harold <elharo@metalab.unc.edu>


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS