[
Lists Home |
Date Index |
Thread Index
]
- To: "XML Developers List" <xml-dev@lists.xml.org>
- Subject: [Watchers of the Web] [Research Initiative] Measure the Evolving Form of Information on the Web
- From: "Costello, Roger L." <costello@mitre.org>
- Date: Fri, 12 May 2006 09:08:48 -0400
- Thread-index: AcZ1xTS0H+zAPaGNQ3OWm46+VxFkHQ==
- Thread-topic: [Watchers of the Web] [Research Initiative] Measure the Evolving Form of Information on the Web
Hi Folks,
I would like for us (the xml-dev
community) to collaborate on a world-wide research initiative. Below is a
description of the research. I have two requests:
1. I have tried
to be both clear and complete in the research description. But if you find
at any point a lack in clarity of the description, or an incompleteness then
please let me know (and hopefully, provide advice to how to improve
things).
2. I need your participation (see
below).
/Roger
Research
Question
What is the relative usage of the various
content (MIME) types on the Web, and how is that usage evolving over
time?
Background
There are
over 350 different content (MIME) types. Some common content types include
HTML, XML, GIF, JPG, JPEG, MP3, MPEG, RSS, SVG. Information exchanged on
the Web is in the form of one of these content types.
What is the
state of the Web today with respect to the use of the different content
types? For example, 15 years ago HTML was clearly the dominate content
type. Is that true today? Has a shift occurred? Have other
content types nudged out HTML for top ranking? The purpose of this
research is to get answers to these questions.
Methodology
Collect data from the web caches and
logs of one or more large retail Internet Service Providers (ISP) on each of the
following continents:
Europe North America South
America Asia Australia Africa
The data to be collected is a
numerical count of accesses to resources, per content type. That is, look
at the Internet Service Provider's log file and count the number
of requests that were made by users for HTML documents, count the number of
requests that were made by users for RSS documents, count the number of requests
that were made by users for XML documents, and so forth for each content
type.
Here is an example to demonstrate the
methodology:
Today (May 8, 2006) cnn.com is running a news story about
using quantum science to determine the best way to score a goal in soccer. CNN
allows you to consume the news story in any of these forms:
- HTML -
audio (MP3) - video (MPEG) - RSS
Let's suppose that CNN uses an
ISP, and the ISP log file contains all the requests for that news
story. At the end of the day we open the log file and tally up all the
requests for the news story. And here are the numbers:
- 50 clients
consumed the news story in HTML form - 20 clients consumed the news story in
audio (MP3) form - 10 clients consumed the news story in video (MPEG)
form - 20 clients consumed the news story in RSS form
If these numbers
represented a statistically significant sampling of the Web, then we could
state:
"On May 8, 2006 the information on the Web took this
form:"
Content Type
Percentage --------------------------- HTML
50% MP3
20% MPEG
10% RSS
20%
Obviously, examining just the log file for one story on CNN is
not a statistically significant sampling. We need to collect data from the
log file of a large ISP for all requests that occurred. And we need
to do the measurement in different geographies.
Note: the data to be
counted is the "main content type", not "dependent content types". Let me
explain what I mean. Suppose that the HTML form of the above news story
contains two embedded GIF images. The HTML document is the "main content
type". The two GIF images are the "dependent content types". Only
the HTML document is counted, i.e., increment the count of the number
of HTML content types by one.
Period of Data
Collection
24 hours (day and time to be
determined)
Request for
Participation
Do you have access to the log file of
a large ISP? Would you be willing to sift through their data? If so,
please contact me.
Acknowledgement
I wish to gratefully acknowledge
the valuable contributions the following people have made to the formulation of
this research initiative:
Len Bullard Joe Chiusano Jay
Crossler Ian Graham Chris Gray Greg Hunt Bob Irving Michael
Kay Tim Kehoe Frank Manola Rick Marshall Marc Nobile Joe
Nyangon Dave Pawson Martin Probst Liam Quin Bryan
Rasmussen Sterling Stouden Nathan Vuong
|