[
Lists Home |
Date Index |
Thread Index
]
[sorry if this gets through twice, I got weird messages from the mail
sevrer]
John Cowan wrote:
> Robin Berjon scripsit:
>>>A variety of
>>>small-scale studies have shown that general-purpose compression is
generally
>>>as good as, or better than, some scheme that knows it's compressing XML.
>>
>>Err, quite the opposite. XMill beats gzip.
>
> This one is news to me, but I'm looking into it now.
You may also wish to take a look at Box (http://box.sf.net/). I don't
remember how well it compares to gzip in compression but it's fast to
decode (the website is down today with all other SF sites so I can't
look it up right now).
>>BiM/BiX requires a schema,
>
> Yes: by "knows it's compressing XML" I meant to imply "and doesn't know
> anything more than that".
I know, and that obviously makes things a little bit more complicated.
However in most non-pathological cases it is possible to apply
machine-learning techniques to deduce schema information (it also works
on pathological cases -- ie instances for which the only fathomable
pattern is the instance itself -- but it's rather useless there). That's
something we're seriously investigating in order to efficiently support
xs:any and xs:anyAttribute (for instance).
There is also a fair number of cases in which there is no schema per se,
but it can be usefully inferred from other metadata such as a WSDL
document, an XQuery...
>>but there are many ways in which a schema can be deduced, even with just
>>a raw document (and it can be done more intelligently than most tools
>>that deduces schema information from instances I've seen out there do
>>it).
>
> Pointer(s)?
The schema deducers I was referring to are the one included in Castor,
and the one on gotdotnet.com:
http://www.castor.org/
http://gotdotnet.com/team/xmltools/xsdinference/
Those tools are probably useful in cases where you just need a schema
but don't care that it is the simplest schema for the given instance or
set of instances. They tend to produce schemata that are pretty much
snapshots of the instance and more or less exactly mirror it.
The schema inferencer we're developing tries hard to get the simplest
schema. The reason for this is that we need it to produce a schema that
strikes the correct balance between generality and concision. Obviously
if you are to send a decoder update (using decoder bytecode) in the
stream, you want that extra information to decode more and better
encoded data than it costs to send the decoder itself. I should normally
have something to show in that area early next year.
--
Robin Berjon <robin.berjon@expway.fr>
Research Engineer, Expway
7FC0 6F5F D864 EFB8 08CE 8E74 58E6 D5DB 4889 2488
|