OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Request: Techniques for reducing the size of XML instances

On Tue, 31 Jul 2001, HUGHES,MARK (Non-HP-FtCollins,ex1) wrote:

>   That's an excellent point - passing around a tokenized form of an XML
> document to simplify parsing is a reasonable idea.  Personally, I'd just
> use the Pyxie format <http://www.pyxie.org/>, as it's *VERY* easy to
> produce and to parse again, and has the tremendous advantage of still
> being plain-text, so it's easy to debug and test.

That's certainly in keeping with some of the binary XML approaches - the
distinction between "binary" and "textual" is bogus, really, but a
nomenclature we're stuck with for now.

It's all binary anyway. "text" just uses a fairly standardish binary
format (although the blueberry thread shows that this "text" format is a
bit shifty anyway)

PS: Just ran a quick test, timing gzip. gzipping 11449004 bytes on a K6/2
400 took 10.693 seconds of CPU time to compress to 2738792 bytes. If this
machine were serving compressed XML, it wouldn't be able to max out a
10Mbit link, even assuming that whatever processing it was doing to create
these data took zero time... This was a coredump file I compressed rather
than a large amount of XML, which will skew the results a bit, but it
looks like three to four times the CPU power of my laptop would be
required to even handle the communications overhead of generating a 10Mbit
gzipped XML stream. I recently had to help implement a system that read a
small amount of data from disk and performed some computation, sending the
data over a 100Mbit link to the next stage of servers. It had to pretty
much fill that 100Mbit link to meet spec[1] and it was lower power than my
laptop. gzipping XML would not have been an option; the system could only
just about fit the raw data down a 100Mbit link with the required TCP/IP
protocol overhead, let alone if it had XML markup all over it.

Non-gzipped XML would have probably been OK in this situation since,
luckily, this data happens to be a series of strings of about 20k in
length, so the overhead of <?xml version='1.0' ?><message>...</message>
wouldn't be an issue, but if it were highly structured or numerical data,
the overhead of <number>123456789</number> over a single 32 bit word (a
factor of 4) would have meant we'd need 4 100Base/T links coming from this
machine to fit the required just-under-100Mbit/sec of raw data - or
gigabit Ethernet.

Raw data processing took just under 50% of the machine's CPU. If we'd had
to emit XML, we'd have had to gzip it all to fit it down the 100Mbit/sec
Ethernet, and there just wouldn't be enough CPU to do that.


[1] The spec mandated something along the lines of 1,000 80Kb data packets
a second, IIRC - add TCP/IP overhead to that and you're pushing a
100Mbit/sec Ethernet, which was what the machine had connected to it.

                               Alaric B. Snell
 http://www.alaric-snell.com/  http://RFC.net/  http://www.warhead.org.uk/
   Any sufficiently advanced technology can be emulated in software