xml-dev - RFC, more on my binary xml format and other stuff

RFC, more on my binary xml format and other stuff
[ Lists Home | Date Index | Thread Index ]
To: <xml-dev@lists.xml.org>
Subject: RFC, more on my binary xml format and other stuff
From: "cr88192" <cr88192@hotmail.com>
Date: Fri, 9 Sep 2005 11:10:34 +1000
ok, I doubt really anyone cared last time, but the format is a little closer 
to stabilizing. well, ok, mostly it was more me changing things "around" the 
format (eg: hacking things so I could see namespaces during parsing, vs, 
just after parsing, ...).

note that, I am no one of interest, presently a hobbyist living with parents 
and attending classes at a local community college (off on a small 
east-asian island in the pacific). likewise, I have no money or job either, 
so I am not willing to try to pay for anything (or try to get parents to buy 
it), hell, most stuff doesn't ship here anyways (not in the us or another 
major country? many places don't ship).

ok, my implementation of things is similar to dom.
I have recently ran into sax, and can now see how clearly it differs from 
dom.
I can say that my api resembles dom, in fact, some ideas were borrowed from 
it. however, the w3c did not specify any c bindings afaik, no use to try to 
conform exactly to a non-existant spec...

ok, sax looks like it could be cool. it's design could allow more direct 
competition, eg, with the line-oriented text files I typically use in other 
cases.
reconciling dom and sax style designs though could be a problem, likely it 
would require a seperate api, and possibly seperate back-end 
parsing/printing code as well.


I can say that, vs trivially encoded wbxml, my format wins size-wise.
it is my understanding that, apart from dtd's, all the text needs to be in 
the strings table. the problem is the number of bytes it takes to refer to 
something in the strings table (eg: 2-3 bytes, per name). in my case, many 
tags can become a single byte anyways absent a dtd, which may be an unfair 
advantage in this case.

for my tests, wbxml ends up >2x the size of my format, which is about 2x the 
size of gzip'ing the source xml. wbxml is still about 1/5 the size of the 
input file though.


sadly, it is no longer as simple as it was originally, though "mru" is still 
a basic teqnique, along with markov modeling for text strings (encodes a 
little faster than lz77), ...
the whole reader/writer setup is presently about 900 loc of c source 
(initially, it was closer to about 300 loc).

intended uses:
offline binary file storage (ad-hoc file formats, being intermixed with 
other binary formats, ...);
working as a motivator for my use of xml in more places;
cases where the parse-trees "are" the internal representation, and otherwise 
I was serializing to other forms;
..

non-intended uses:
particularly large files (well, more on the grounds of worrying about memory 
use, though I don't see why not if a sax-like api is created);
over the network (becuase, it is non-standard after all...);
"interchangable" files (I also spec'ed mappings for the features to textual 
xml, which is intended to be used for any kind of human-viewing or 
interchange);
the case where the internal structure is not xml (really, why then just not 
use a specialized format?...).

in general, the format does not use any "external" means to try to save 
space or deal with structure (eg: dtd's or schemas), instead, it relies on a 
"context" which is built dynamically during reading or writing to eliminate 
common patterns (much as typical data compressors do, though with the term 
"context" being rather vague here, eg, symbol counts/probabilities used in 
adaptive huffman or arithmatic coding, or the windows used in lz77 and 
friends). in my case, context includes the mru's and data related to markov 
modeling/predicting.


binary payload is a feature, among other things.
also being vaguely considered is some possible means by which to make parts 
of the files "indexable". this would include, eg, either clearing or locking 
the context so that each indexed chunk could be read correctly and 
individually, and, of course, providing an index mechanism. an alternative: 
for the toplevel/indexable structures, use a clearly different 
representation (eg: modified IFF, a purely inline/verbose format, ...).

<foo xmlns:dtx="uuid:78422bcf-661a-470d-bcd0-cb592cb0f783">
    <bar>
        <baz dtx:type="binary.base64">
            Afg5G8a5Kw...
        </baz>
    </bar>
</foo>

whole magical namespace thinggy...
attributes in this namespace cause special behavior on the part of the 
parser.
otherwise, there is xml-schema, but, of course, I am not using that.
likewise, this namespace will be used for representing/controlling some of 
the funky features in the binary variant (binary nodes, ...).

for an indexable toplevel I could be like:
<foo dtx:index="true" xmlns:dtx="uuid:78422bcf-661a-470d-bcd0-cb592cb0f783">
    <bar>...</bar>
    <bar>...</bar>
    <bar>...</bar>
    <bar>...</bar>
    ...
</foo>

for the structure I could make something like:
0x1F 0x01 <STR ns> <STR tag> <ATTRLST2>
    <U32 szdata> <BYTE data[szdata]>
    <U32 szindex> <BYTE index[szindex]>

ATTRLST2=(<STR ns> <STR key> <STR value> <ATTRLST2>) | (0 0 0)

each indexed item would be as before, but with an initial-state context and 
possibly a length prefix.
the index would be a large array of offsets. U32's could be used for sizes 
and offsets namely because they allow dynamic seeking and filling them in 
later.

alternatively, there are other ways an index could be implemented, eg, 
allowing dynamic change of file contents (eg: something b-tree like), 
dunno...

an IFF based format is also possible:
FORM sbxe {
    LIST nidx {
        tag {...}
        data {nodes...}
        idx {indices...}
    }
    ...
}

my current files could probably be wrapped, eg, as:
FORM sbxe {
    data {...}
}


but, I am sure, probably no one cares anyways.
Prev by Date: XML FAQ 4.3
Next by Date: Notification Account Information
Previous by thread: XML FAQ 4.3
Next by thread: Notification Account Information
Index(es):
- Date
- Thread