OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] [Watchers of the Web] The evolving form of information on

[ Lists Home | Date Index | Thread Index ]

On Sun, May 07, 2006 at 10:12:32AM -0400, Costello, Roger L. wrote:
> I would like to know:
> Of all the information being exchanged on the Web:
> what percentage of the information is in the form of the HTML content
> type, what percentage of the information is in the form of the XML
> content type, [...]

It depends on what you mean by "information" here.  If you are
referring to Shannon, for example, the second pass of an interlaced
GIF carries less information than the first, in most cases, and the
same is true for JPEG images.

If you're concerned about total volume of data and bandwidth,
my guess right now would be closer to:

music and video files: 70%
  (much more if you count non-Web transfer methods)
Other binary files: 20%
  (e.g. pirated copies of PhotoShop, stolen fonts, as well as
  legitimate installation files, Windows update etc.,
  accessed via a URI-based mechanism)

Of the remaining 10%, I'd guess 95% by size is image content.

The entire King James Bible weighs in at around 5 Megabytes as
plain text.  A Megabyte isn't all that huge for an image from
a digital camera these days, _vide_ flickr.

I'd also guess that RSS (XML-based) is significant in traffic.

> In addition, I am interested in seeing how the percentage is changing
> over time - I am interested in seeing the evolving form of information
> on the Web.

You might find that some of the search engine people have some sort
of metric based on number of documents, or numbers of URIs and
corresponding MIME types.  Ian Hickson has done some investigation
of this sort I think.

Anyone running a large HTTP proxy, e.g. for a school, college,
corporation or ISP, will have figures on bandwidth.

Actually analysing images and text for information content is a much
harder thing. Do the random art criticism texts generated by a program
I wrote [1] contain information? Or the random sonnets from Rich
Salz's program [2]? What about lists of things? Or, worse, lists of
randomly-generated things such as fake fantasy names [3]?

Sometimes it's better to tackle the easier problem and get data that
is useful than to tackle the more interesting but probably intractable
one.  The debate about whether illustrations carry information can be
illustrated at [4]. :-)


[1] random art criticism with random artwork:

[2] randomly generated "poetry"

[3] randomly generated fantasy gaming names

[4] http://www.fromoldbooks.org/Blades-Pentateuch/pages/discourse-into-the-night/

Liam Quin, W3C XML Activity Lead, http://www.w3.org/People/Quin/
http://www.holoweb.net/~liam/  * http://www.fromoldbooks.org/


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS