OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   [Watchers of the Web] [Research Initiative] Measure the Evolving Form of

[ Lists Home | Date Index | Thread Index ]
  • To: "XML Developers List" <xml-dev@lists.xml.org>
  • Subject: [Watchers of the Web] [Research Initiative] Measure the Evolving Form of Information on the Web
  • From: "Costello, Roger L." <costello@mitre.org>
  • Date: Fri, 12 May 2006 09:08:48 -0400
  • Thread-index: AcZ1xTS0H+zAPaGNQ3OWm46+VxFkHQ==
  • Thread-topic: [Watchers of the Web] [Research Initiative] Measure the Evolving Form of Information on the Web

Hi Folks,

I would like for us (the xml-dev community) to collaborate on a world-wide research initiative.  Below is a description of the research.  I have two requests:

1. I have tried to be both clear and complete in the research description.  But if you find at any point a lack in clarity of the description, or an incompleteness then please let me know  (and hopefully, provide advice to how to improve things).

2. I need your participation (see below). 


Research Question

What is the relative usage of the various content (MIME) types on the Web, and how is that usage evolving over time?


There are over 350 different content (MIME) types.  Some common content types include HTML, XML, GIF, JPG, JPEG, MP3, MPEG, RSS, SVG.  Information exchanged on the Web is in the form of one of these content types. 

What is the state of the Web today with respect to the use of the different content types?  For example, 15 years ago HTML was clearly the dominate content type.  Is that true today?  Has a shift occurred?  Have other content types nudged out HTML for top ranking?  The purpose of this research is to get answers to these questions.


Collect data from the web caches and logs of one or more large retail Internet Service Providers (ISP) on each of the following continents:

North America
South America

The data to be collected is a numerical count of accesses to resources, per content type.  That is, look at the Internet Service Provider's log file and count the number of requests that were made by users for HTML documents, count the number of requests that were made by users for RSS documents, count the number of requests that were made by users for XML documents, and so forth for each content type.

Here is an example to demonstrate the methodology:

Today (May 8, 2006) cnn.com is running a news story about using quantum science to determine the best way to score a goal in soccer. CNN allows you to consume the news story in any of these forms:

- audio (MP3)
- video (MPEG)

Let's suppose that CNN uses an ISP, and the ISP log file contains all the requests for that news story.  At the end of the day we open the log file and tally up all the requests for the news story.  And here are the numbers:

- 50 clients consumed the news story in HTML form
- 20 clients consumed the news story in audio (MP3) form
- 10 clients consumed the news story in video (MPEG) form
- 20 clients consumed the news story in RSS form

If these numbers represented a statistically significant sampling of the Web, then we could state:

"On May 8, 2006 the information on the Web took this form:"

Content Type    Percentage
HTML            50%
MP3             20%
MPEG            10%
RSS             20%

Obviously, examining just the log file for one story on CNN is not a statistically significant sampling.  We need to collect data from the log file of a large ISP for all requests that occurred.  And we need to do the measurement in different geographies.

Note: the data to be counted is the "main content type", not "dependent content types".  Let me explain what I mean.  Suppose that the HTML form of the above news story contains two embedded GIF images.  The HTML document is the "main content type".  The two GIF images are the "dependent content types".  Only the HTML document is counted, i.e., increment the count of the number of HTML content types by one.

Period of Data Collection

24 hours (day and time to be determined)

Request for Participation

Do you have access to the log file of a large ISP?  Would you be willing to sift through their data?  If so, please contact me.


I wish to gratefully acknowledge the valuable contributions the following people have made to the formulation of this research initiative:

Len Bullard
Joe Chiusano
Jay Crossler
Ian Graham
Chris Gray
Greg Hunt
Bob Irving
Michael Kay
Tim Kehoe
Frank Manola
Rick Marshall
Marc Nobile
Joe Nyangon
Dave Pawson
Martin Probst
Liam Quin
Bryan Rasmussen
Sterling Stouden
Nathan Vuong


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS