OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] [Watchers of the Web] The evolving form of information on

[ Lists Home | Date Index | Thread Index ]

Generally in information theory it is the amount of data and not the
amount of individual files that is important, after that one generally
has rules for weeding out data that is considered redundant for ones
particular application, redundantcy in data is however specific to the
file type. As an example:

redundant data for html:

site logo image tag on each page,
menus

these would probably only be counted once if the application could detect them.

Also is the data in the head of the document actually counted as data
in this context, I would suppose not.

But if one were trying to get meaningful metrics about what types of
information are the most important on the web I think one needs
something other than counting the amount of files or the sizes of
those files.

some quick ideas:

I think something similar to what google did with defining the link as
what is important would be necessary, or at least a component of any
study of what is the most important information.

This however is just a guess.

Also user-agents.

The prevalence of particular user-agents can indicate what are the
most important sources of information, I think we can agree that a gif
served up to an rss reader is useless in that context.

anyway, if one were to decide on size of files or number of files I
would have to go with size of files because amount of data is
determined by the size and not how it is split.
But the web may need a different theory.

Cheers,
Bryan Rasmussen

On 5/7/06, Costello, Roger L. <costello@mitre.org> wrote:
> Bryan asks an excellent question:
>
> What data should be collected to provide meaningful results?
>
> Bryan notes two kinds of data that could be collected:
>
> (1) File Count Data: count the number of files, i.e., count the number
> HTML files, count the number of MP3 files, count the number of MPEG
> files, and so forth.
>
> A problem to be resolved is: suppose that an HTML file contains, say,
> three GIF images.  Do you count that as:
>
> 1 for HTML
> 2 for GIF
>
> Will file count yield the best data?
>
> (2) Byte Count Data: count the number of bytes of the information on
> the Web that is in HTML form, count the number of bytes of the
> information on the Web that is in MPEG form, and so forth.
>
> Would byte count yield more meaningful data than file count?
>
> Is there other data besides file count data and byte count data?
>
> If you were to design an experiment to determine the percentage of
> information per content type, what data would you measure?  (I am not
> asking "how" to measure the data; I am asking "what" data you would
> measure)
>
> Any ideas?  /Roger
>
>
> -----Original Message-----
> From: bryan rasmussen [mailto:rasmussen.bryan@gmail.com]
> Sent: Sunday, May 07, 2006 10:32 AM
> To: Costello, Roger L.
> Cc: xml-dev@lists.xml.org
> Subject: Re: [xml-dev] [Watchers of the Web] The evolving form of
> information on the Web?
>
> How exactly are you defining percentage, for example percentage of
> actual data size would probably be quite a bit differently than
> percentage of number of files.
>
> Cheers,
> Bryan Rasmussen
>
> On 5/7/06, Costello, Roger L. <costello@mitre.org> wrote:
> > Hi Folks,
> >
> > There are over 350 different content (MIME) types.  Some common
> content
> > types include HTML, XML, GIF, JPG, JPEG, MP3, MPEG, RSS, SVG.
> >
> > Information exchanged on the Web is in the form of one of these
> content
> > types.  (Sometimes an information exchange contains a collection of
> > items, each item with different content type.)
> >
> > I would like to know:
> >
> > Of all the information being exchanged on the Web:
> >
> > what percentage of the information is in the form of the HTML content
> > type, what percentage of the information is in the form of the XML
> > content type, what percentage of the information is in the form of
> the
> > GIF content type, what percentage of the information is in the form
> of
> > the MP3 content type, what percentage of the information is in the
> form
> > of the MPEG content type, what percentage of the information is in
> the
> > form of the JPG content type, and so forth, for all the content
> types.
> >
> > I speculate that the percentages are something like this:
> >
> > Content type   Percentage
> > ---------------------------
> > HTML           90%
> > JPG             2%
> > JPEG            2%
> > GIF             2%
> > MP3             2%
> > XML             1%
> > ...
> >
> > However, that's purely my guess.  (What is your guess?)
> >
> > In addition, I am interested in seeing how the percentage is changing
> > over time - I am interested in seeing the evolving form of
> information
> > on the Web.
> >
> > Has anyone done such an investigation?
> >
> > /Roger
> >
> > -----------------------------------------------------------------
> > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> > initiative of OASIS <http://www.oasis-open.org>
> >
> > The list archives are at http://lists.xml.org/archives/xml-dev/
> >
> > To subscribe or unsubscribe from this list use the subscription
> > manager: <http://www.oasis-open.org/mlmanage/index.php>
> >
> >
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://www.oasis-open.org/mlmanage/index.php>
>
>




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS