xml-dev - [moved from XSL-List] New Title: What XML formats can the search engines

[moved from XSL-List] New Title: What XML formats can the search engines

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: [moved from XSL-List] New Title: What XML formats can the search engines extract reliable meta-data and data from?
From: "M. David Peterson" <m.david.x2x2x@gmail.com>
Date: Sun, 5 Mar 2006 13:56:14 -0700
Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:sender:to:subject:mime-version:content-type; b=Klr56RAl5zHRljDJR5OjeFNvPcqHoe4wChbeNsS6rFY4V2MY2oEwS8flz6S2lDhKPKMhJpu8QjcdO3lS9qxx0ZFUf3ZK0N+HI3Kr7KCtYEYhl5Me4+FG6VgUNiifCjYJWf+UlPHgdhpo2vs6OnwONTSE4YmCWTMAEWHQ5U4anMI=
Sender: xmlhacker@gmail.com

To summarize the current topic in this conversation, we are trying to determine what type of XML file formats SEC's can and do extract reliable meta-data and subsequent data from, and have this data be listed within the range of returned links in which the average human will look at before trying a new search phrase or giving up entirely. This is my last response... anybody who wants to include their most recent response to this post, please tack it on to this post.

The comment I am responding to is:

I'm stunned that most of you seem to believe that Google ignores XML pages
and you have to transform the XML server-side to feed the search engine.
For evidence of the contrary try the search:
staudinger site:free.pages.at filetype:xml

Manfred

[Begin Response]

Hmmm... we're talking about two different things.  Sure, google will
parse locate the xml file, and run it through its text processing
algorithms, extracting and sorting the information it deems
appropriate.  However, in the case of raw xml, there is no real type
per se.  In a defined XML format, such as XHTML, there is a level of
understood document structure in which assumptions can be accurately
made.  For example title, or section header (tags h1-h6), keywords if
included and correctly labeled in a meta tag.

The same can be said about Atom data feeds, or the Open Document
Format, or any number of XML-based documents in which information
which is seen as something that is human understandable can be
extracted and put inside of their internal database and used to by
their internal query engine to determine relevancy to a particular
search phrase.

But raw XML that has no specification other than the XML 1.x
specification it has been built against is MUCH MORE difficult to try
an reliably extract qualified information.  That doesnt mean that it
can't parse the text of the document and pull out what seems to be
relavent information, but the chances of that information ever making
it to the eyeballs of a human performing a search are as close to nil
as you can get and still not be nil.

Of course, if you do a specific document type search for xml documents
you'll find LOTS of them.  There's just not any real presence of human
understandable data elements that can accurately be displayed.  Maybe
they will get lucky here and there, but you can't build a high quality
search engine who's foundation is built on the chance that logical
data could be extracted.  Therefore your not going to find all that
many documents of type XML that are not of an understood XML format
anywhere near the first X number of pages that the average human
performing a search on Google will look at before giving up.  Not
absolutely certain what that exact number happens to be at this moment
in time, but I'm sure someone does and I am guessing its probably less
than 100.

Of course if you do a site specific search, and that particular site
ONLY has XML documents, then obviously all you will find in return is
XML documents.  then again, the only way Google is going to find those
documents in the first place is if it can extract links from at least
one known document to begin the spidering process.  And to be honest,
I can't say for sure if they even bother to search for links inside of
generic XML documents.

Q: Without turning this into a conversation on "Theories of Google
Search Algorithm's" (Please don't... I'm already in enough trouble
with Tommie from this weekends adventures as it is ;) :D ) -- does
anybody know for sure what Google, Yahoo!, MSN Search, and/or any of
the data feed specific search companies such as PubSub, Syndic8, and
Technorati will parse  for links and what they will not.  (NOTE: I can
almost be certain that the data feed specific search engines are just
that, data feed specific.  But that too is a guess)

--
<M:D/>

M. David Peterson
http://www.xsltblog.com/

Prev by Date: Re: [xml-dev] Extensibility of a data language
Next by Date: RE: [xml-dev] can attribute in XML schema hold value and unit
Previous by thread: Re: [xml-dev] Extensibility of a data language
Next by thread: XML Schema and BNF
Index(es):
- Date
- Thread