[
Lists Home |
Date Index |
Thread Index
]
continuing on a kind of permathread I guess, oh well.
around late last night, an idea popped up for a fairly simplistic binary xml
encoding, which aims to:
be smaller than textual xml;
not be signifigantly more complicated than textual xml.
before going to sleep, I beat together the basic idea.
this morning, I messed with the spec a little.
I am not sure if I will do anything with this. I may implement it possibly
at least for my own uses (basic data storage type stuff). xml is ammusingly
enough more often used as an internal representation of data than an
external one in my projects...
I just thought maybe people here might be interested.
any thoughts or comments?
spec dump:
---
Simplistic Binary XML Encoding
Goals:
Does not require a complicated encoder or decoder;
Does not involve a seperate compression/decompression pass.
Does Not Attempt:
Large data sets or random access;
Decent compression;
Complete representation of XML features.
The encoding is viewed as a stream of bytes.
Strings are encoded as ASCII or UTF-8, and with '\0' as a terminator.
Files will begin with the string "SBXE".
Later versions may alter the string to reflect the version, or include extra
data after this string.
0x00: general purpose ending marker
0x01..0x1F: Special, Reserved
0x20..0x3E: Namespace Prefix MRU
0x3F: Namespace String
0x40..0x7E: Opening Tag/Attr MRU
0x7F: Opening Tag/Attr String
0x80..0xFE: Text MRU
0xFF: Text String
Node
[<NS>] <TAG> <ATTR*> 0 <BODY*> 0
Attr
[<NS>] <TAG> <TEXT*>
Body
<NODE>|<TEXT>
Text is represented potentially as globs of raw strings and MRU references.
A single text string should be limited to 255 bytes or less.
MRU Scheme
Whenever a given string is being encoded, it can be checked if it were
encoded recently, and, if so, a reference to the correct spot in the MRU
list can be encoded and that value is moved to the front.
Otherwise, the new string is encoded directly, and added to the front of the
list.
Higher numbers mean more recent matches, so things shift in the direction of
lower numbers. Upon shifting off the end a string is essentially forgotten.
Tags and Attributes will have the same space in the encoding, but will refer
to different MRU lists.
--
the mru list would be based on the linear contents of the file.
ok, I don't have any good examples.
<foo><bar>baz</bar><bar baz="baz"/></foo>
41 bytes
'SBXE\0' 0x7F 'foo\0' 0x00 0x7F 'bar\0' 0x00 0xFF 'baz\0' 0x00 0x00
0x7E 0x7F 'baz\0' 0xFE 0x00 0x00 0x00
34 bytes (29 absent the prefix).
longer examples would probably do a little better.
|