[
Lists Home |
Date Index |
Thread Index
]
ok, I doubt really anyone cared last time, but the format is a little closer
to stabilizing. well, ok, mostly it was more me changing things "around" the
format (eg: hacking things so I could see namespaces during parsing, vs,
just after parsing, ...).
note that, I am no one of interest, presently a hobbyist living with parents
and attending classes at a local community college (off on a small
east-asian island in the pacific). likewise, I have no money or job either,
so I am not willing to try to pay for anything (or try to get parents to buy
it), hell, most stuff doesn't ship here anyways (not in the us or another
major country? many places don't ship).
ok, my implementation of things is similar to dom.
I have recently ran into sax, and can now see how clearly it differs from
dom.
I can say that my api resembles dom, in fact, some ideas were borrowed from
it. however, the w3c did not specify any c bindings afaik, no use to try to
conform exactly to a non-existant spec...
ok, sax looks like it could be cool. it's design could allow more direct
competition, eg, with the line-oriented text files I typically use in other
cases.
reconciling dom and sax style designs though could be a problem, likely it
would require a seperate api, and possibly seperate back-end
parsing/printing code as well.
I can say that, vs trivially encoded wbxml, my format wins size-wise.
it is my understanding that, apart from dtd's, all the text needs to be in
the strings table. the problem is the number of bytes it takes to refer to
something in the strings table (eg: 2-3 bytes, per name). in my case, many
tags can become a single byte anyways absent a dtd, which may be an unfair
advantage in this case.
for my tests, wbxml ends up >2x the size of my format, which is about 2x the
size of gzip'ing the source xml. wbxml is still about 1/5 the size of the
input file though.
sadly, it is no longer as simple as it was originally, though "mru" is still
a basic teqnique, along with markov modeling for text strings (encodes a
little faster than lz77), ...
the whole reader/writer setup is presently about 900 loc of c source
(initially, it was closer to about 300 loc).
intended uses:
offline binary file storage (ad-hoc file formats, being intermixed with
other binary formats, ...);
working as a motivator for my use of xml in more places;
cases where the parse-trees "are" the internal representation, and otherwise
I was serializing to other forms;
..
non-intended uses:
particularly large files (well, more on the grounds of worrying about memory
use, though I don't see why not if a sax-like api is created);
over the network (becuase, it is non-standard after all...);
"interchangable" files (I also spec'ed mappings for the features to textual
xml, which is intended to be used for any kind of human-viewing or
interchange);
the case where the internal structure is not xml (really, why then just not
use a specialized format?...).
in general, the format does not use any "external" means to try to save
space or deal with structure (eg: dtd's or schemas), instead, it relies on a
"context" which is built dynamically during reading or writing to eliminate
common patterns (much as typical data compressors do, though with the term
"context" being rather vague here, eg, symbol counts/probabilities used in
adaptive huffman or arithmatic coding, or the windows used in lz77 and
friends). in my case, context includes the mru's and data related to markov
modeling/predicting.
binary payload is a feature, among other things.
also being vaguely considered is some possible means by which to make parts
of the files "indexable". this would include, eg, either clearing or locking
the context so that each indexed chunk could be read correctly and
individually, and, of course, providing an index mechanism. an alternative:
for the toplevel/indexable structures, use a clearly different
representation (eg: modified IFF, a purely inline/verbose format, ...).
<foo xmlns:dtx="uuid:78422bcf-661a-470d-bcd0-cb592cb0f783">
<bar>
<baz dtx:type="binary.base64">
Afg5G8a5Kw...
</baz>
</bar>
</foo>
whole magical namespace thinggy...
attributes in this namespace cause special behavior on the part of the
parser.
otherwise, there is xml-schema, but, of course, I am not using that.
likewise, this namespace will be used for representing/controlling some of
the funky features in the binary variant (binary nodes, ...).
for an indexable toplevel I could be like:
<foo dtx:index="true" xmlns:dtx="uuid:78422bcf-661a-470d-bcd0-cb592cb0f783">
<bar>...</bar>
<bar>...</bar>
<bar>...</bar>
<bar>...</bar>
...
</foo>
for the structure I could make something like:
0x1F 0x01 <STR ns> <STR tag> <ATTRLST2>
<U32 szdata> <BYTE data[szdata]>
<U32 szindex> <BYTE index[szindex]>
ATTRLST2=(<STR ns> <STR key> <STR value> <ATTRLST2>) | (0 0 0)
each indexed item would be as before, but with an initial-state context and
possibly a length prefix.
the index would be a large array of offsets. U32's could be used for sizes
and offsets namely because they allow dynamic seeking and filling them in
later.
alternatively, there are other ways an index could be implemented, eg,
allowing dynamic change of file contents (eg: something b-tree like),
dunno...
an IFF based format is also possible:
FORM sbxe {
LIST nidx {
tag {...}
data {nodes...}
idx {indices...}
}
...
}
my current files could probably be wrapped, eg, as:
FORM sbxe {
data {...}
}
but, I am sure, probably no one cares anyways.
|