[
Lists Home |
Date Index |
Thread Index
]
Hi all,
Have you looked into BinX regarding binary XML formats? As far as I can
tell (and from only a brief knowledge of binx) your idea is fairly
similar in purpose (maybe in implementation, but I don't know enough to
say).
Just thought I'd mention it in case it can help to prevent needless
duplication of work,
Cheers
Rich
--------------------------------
Richard Bruin
PhD Student
Department of Earth Sciences
University of Cambridge
On Wed, 2004-11-17 at 16:44, David Lieberman AWDSF wrote:
> Neat!
>
> Good ideas, all. (if you ask me)
>
> It'll be interesting to see when you're finished.
>
> David Lieberman
> http://www.awdsf.com
>
>
> -----Original Message-----
> From: cr88192 [mailto:cr88192@hotmail.com]
> Sent: Wednesday, November 17, 2004 6:55 AM
> To: xml-dev@lists.xml.org
> Subject: [xml-dev] hello, new to list, thoughts
>
> ok, I was lead here as, as far as I can tell, the group gmane.text.xml.devel
>
> is a mirror of this list (given that a reference to this list is appended to
>
> the messages there).
> I appologize if this is not the case.
> pardon if I am being a troll, I am new here.
>
> I am mostly requesting general comments, eg, on things that could be
> improved in my idea. I am not expecting anyone really to take me seriously.
>
>
> ok. so I have been recently doing something roughly along the lines of a
> binary xml (not exactly, but I have designed it such that basic subset of
> xml maps fairly well, and doesn't give up any real info in the conversion).
> namespace definitions, however, are similar but not exactly the same, so
> this would be left to the conversion tool.
>
> I tried to retain what I felt to be the general spirit and semantics of xml,
>
> as imo they seem quite decent for data (though, difficult to fully recognize
>
> as such or reason about). imo, they are better than a plain tree. I had
> considered stripping down the semantics at some points, but some things
> seemed interdependent (one needs rules and complexity in some places to
> grant freedoms in others, and one needs to figure out where to be strict and
>
> where to be lax...).
>
> why?
> personally, I don't feel that within xml's core domains (network
> communications, messages, "documents", ...) a binary variant would be
> particularly helpful, however:
> my interests lie more in data storage (for what xml works good for, I use
> that);
> formats like riff, ebml, ... leave a lot to be desired imo wrt semantics;
> textual xml is not that great for data storage imo, eg, one can't skip over
> data or jump around in the same way they can with, say, riff, and imo base64
>
> coding would not be very good for cases when size is a priority, or if much
> of the data is large binary chunks.
>
> in a kind of ideological frenzy, I beat something together, and have spent a
>
> while refining it.
>
> I also have a basic implementation (not yet online, I have been a bit behind
>
> on this kind of thing recently).
> not much coding has been being done recently (largely I am stuck fiddling
> with the details of the design, among other things not related to this).
>
> things like file size or processor overhead are not high priorities here, I
> just try to save space when possible, and avoid wasting too much processor
> time. my results thus far have shown the generated files to be slightly
> smaller than the input xml (presently lacking any kind of content string
> compression, though tag compression is done). this could be viewed as a good
>
> sign I guess.
>
>
>
> here is a recent draft of the spec:
> ----
> XLIFF (0.1.1):
> Partially fueled by an argument about EBML, I came up with this.
> It has little to do with LIFF, but, hell, I am not really using LIFF (it
> ended up too close to RIFF and too generally ugly...).
> This format shows itself to be kind of a pain to code up, but this is not
> entirely unexpected.
>
> Cleaning up spec some from 10-29 version, minor alterations.
> Stripping out container stuff, as it clutters the spec and doesn't really
> make sense in this context anyways (keeping the old version as the idea may
> be adaptable to a different format).
>
> Goals:
> A binary format with similar flexibility to XML (X);
> Support for large files and datasets (L);
> Sort of like RIFF and IFF (IFF).
>
> This will be a TLV format with attributes and namespaces similar to XML.
> It will use a tag dictionary to help reduce the total file size.
> It should be acceptable for random access and "big chunks of data" style
> uses (like RIFF, IFF, and EBML).
>
>
> Numbers:
> The MSB (bit 7) serves to indicate the precense of following (higher order)
> bytes.
>
> 0xxxxxxx, -64..63
> 1xxxxxxx 0xxxxxxx, -8192..8191
> 1xxxxxxx 1xxxxxxx 0xxxxxxx, -1M..1M
> ..
>
> Values are in Low-High order and with 2s complement encoding.
> As a result, the sign is implicitly contained in bit 6 of the last byte.
>
> The maximum value of a number depends on the implementation, and an
> implementation is allowed to refuse overly large numbers.
> However, I will spec that the limit should be at least 32 bits (a 35 bit
> number with the upper 4 bits either all 0 or 1).
>
>
> Strings:
> {
> Number len; //length if >0, dictionary index if <0, empty string if 0
> if(len>0)byte str[len];
> }
>
> Strings may be indexed in a dictionary. The exact semantics for dictionaries
>
> will depend on context.
>
> Node:
> {
> String ns;
> String tag;
> Number alen;
> if(alen>0)byte attr[alen];
> Number dlen;
> if(dlen>0)byte data[dlen]; //the contents depend on the ns and tag
> }
>
> Attr:
> {
> String ns;
> String tag;
> Number dlen;
> if(dlen>0)byte data[dlen];
> }
>
> In tags and attributes, negative chunk lengths are reserved.
>
> Implicit XLIFF attributes are allowed in nodes without restrictions as they
> are not generally viewed as part of the content.
> Duplicate attributes are not allowed.
> Attributes may not contain either tags or other attributes (other
> non-nestable structures will generally be allowed). Attributes should be
> kept small in both size and number.
>
>
>
> Tag Dictionary:
> There will be a dictionary responsible for namespaces, tags, and attribute
> names:
> This dictionary will behave similarly to a stack;
> Any new strings are added to the end of the current dictionary level
> (encoded directly and not allready present);
> On descent into a node, the dictionary is retained from the parent, creating
>
> a new level;
> Any new strings are added to the end of the current dictionary level;
> On exit from a node, any strings added in that level are removed (making it
> as if the descent had not occured).
>
> The use of a dictionary allows denser packing (due to, eg, tags being 1 or 2
>
> bytes).
> Fairly dense packing might require building the entire dictionary upfront,
> but an encoder can have less dense packing, eg, by just encoding strings
> directly.
> The need for upfront dictionaries for packing to work well is related to
> ideas for allowing faster processing by not having to descend into subnodes
> to build an up-to-date dictionary, and also to allow random access in some
> cases.
>
> Body Dictionaries:
> Each namespace will also have a "body dictionary", which may be used for
> compressing content strings in a content specific manner. The maintainence
> of these dictionaries is largely left to the format in question (however,
> they will implicitly pop off anything added within a node).
>
> The format is a tree, with the default toplevel tag flagging the format.
>
> Special tags could exist for maintenence purposes (eg: adding a basic set of
>
> common strings to the dictionary, ...).
> Like XML, some special tags may exist prior to the root to define things
> (the base dictionary, ...).
>
>
> Namespaces:
> The empty string namespace is the "default" namespace.
>
> Except builtin XLIFF namespaces (default, XLIFF, ...) namespaces are to be
> declared prior to use.
> Formats are given control over how namespaces are used/defined.
>
> Namespaces refer to several URI's:
> the Type URI, which defines the physical type of the container (eg: XML).
> the Namespace URI, which defines the semantic type of the container (eg: an
> XML Namespace).
>
> Any namespaces beginning with "XLIFF" are reserved for use by XLIFF.
> Further:
> XLIFF.*: basic XLIFF namespaces, failure to understand tags/attributes
> should cause failure;
> XLIFF.S.*: semantic XLIFF namespaces, these are allowed to be ignored, but
> should be preserved;
> XLIFF.O.*: optional XLIFF namespaces, these may be ignored or stripped off
> without effecting content.
>
> There will be a basic "XLIFF" namespace, and being unable to parse tags in
> this namespace will be viewed as an error (this namespace will handle things
>
> which may change the format of subsequent data, effect dictionaries, ...).
> XLIFF attributes are required to be understood before attempting to parse
> the contents of a node.
>
> "XLIFF.S" will be used for semantic XLIFF tags, failure to understand them
> will not compromise decoding of the format.
>
> "XLIFF.O" will be used for optional XLIFF tags, failure to understand or
> removal of them will not compromise decoding of the format.
>
> "XLIFF.NS" could be a namespace for namespace declatations (like in XML).
> eg, "XLIFF.NS:foo" as an attribute could declare a foo namespace.
> the content of these tags could be an array of pairs of strings, eg:
> "TypeURI", "xliff:foo_container", "NSURI", "xliff:bar_ns".
>
>
> XLIFF:Header
> An tag required at the start of an XLIFF file serving to mark it as a valid
> XLIFF file, and to give general info about the file.
>
> XLIFF:TypeName Header Attribute, gives a general "file type name" used for
> identifying the type (appart from examining the contents or namespaces). It
> is encoded as a raw string.
>
> XLIFF:HeaderFlags Header Attribute
> Contains a number marking various flags for a file. Unknown flags may be
> ignored.
> 1&=dictionary is static within the file.
>
> Other attributes may be found in the header besides those related to XLIFF.
> An example would be custom or format specific tags.
>
>
> XLIFF:DictStrings Tag
> Defines a glob of strings to be added to the tag dictionary.
> This may be used for reducing the number of occurances of some common tag
> which may only occure in sublevels or such.
>
> XLIFF:NodeFlags Attribute
> Contains a number marking various flags for a node. Unknown flags may be
> ignored.
> 1&=this node is compound;
> 2&=dictionary is static within this node.
>
> XLIFF.O:JUNK Tag/Attribute
> Marks a space as being "junk", thus allowing leaving some space for new
> tags, attributes, or padding.
>
>
>
> XML in XLIFF
>
> There are 2 ways to do XML in XLIFF:
> A unified document (all the content, or at least the toplevel, is XML);
> A mixed document (the toplevel is not necissarily XML).
>
> In the unified document case the toplevel tag is 'XML' (with at least the
> default namespace declared as being XML), which may contain any xml header
> tags and the xml root.
> Namespace declarations are to be converted to XLIFF style.
>
> In the mixed document case, at least the basic xml namespaces are to be
> declared in the file toplevel (along probably with others).
> The nstype for XML is 'xliff:binxml'.
>
> String Globs
> All textual data will be represented by a number of strings stuck end to
> end.
> These will use a "textglob dictionary", which will follow the same rules as
> that for the tag dictionary.
>
> Attribute data is defined as a glob of strings.
>
> An empty tag value flags a glob of textual data. The body for this is a
> string glob.
>
> --
>
>
> here is a small fragment from the test app:
> ----
> int EncodeXMLNode(XLIFFW_Context *ctx, NetParse_Node *node)
> {
> NetParse_Attr *acur;
> NetParse_Node *ncur;
>
> if(node->text)
> {
> XLIFFW_BeginTag(ctx, "", "");
> XLIFFW_BeginAttrs(ctx);
> XLIFFW_EndAttrs(ctx);
>
> XLIFFW_BeginBody(ctx);
> XLIFFW_WriteString(ctx, node->text);
> XLIFFW_EndBody(ctx);
> XLIFFW_EndTag(ctx);
>
> return(0);
> }
>
> XLIFFW_BeginTag(ctx, node->ns, node->key);
> XLIFFW_BeginAttrs(ctx);
>
> acur=node->attr;
> while(acur)
> {
> XLIFFW_BeginAttr(ctx, acur->ns, acur->key);
> XLIFFW_WriteString(ctx, acur->value);
> XLIFFW_EndAttr(ctx);
> acur=acur->next;
> }
>
> if(node->first)XLIFFW_NodeFlagsAttr(ctx, XLIFF_NFL_COMPOUND);
> XLIFFW_EndAttrs(ctx);
>
> XLIFFW_BeginBody(ctx);
> ncur=node->first;
> while(ncur)
> {
> EncodeXMLNode(ctx, ncur);
> ncur=ncur->next;
> }
>
> XLIFFW_EndBody(ctx);
> XLIFFW_EndTag(ctx);
>
> return(0);
> }
> --
>
> and another:
> ----
>
> n=NetParse_XML_LoadFile("form0.xml");
>
> wctx=XLIFFW_OpenWrite("test0.xliff");
>
> XLIFFW_WriteHeader(wctx, "xliff:test", 3);
>
> // XLIFFW_BeginDictStrings(wctx);
> // WriteXMLDictNode(wctx, n);
> // XLIFFW_EndDictStrings(wctx);
>
> XLIFFW_BeginTag(wctx, "", "XML");
> XLIFFW_BeginAttrs(wctx);
> XLIFFW_BindNSAttr(wctx, "", "xliff:binxml", "");
> XLIFFW_NodeFlagsAttr(wctx, XLIFF_NFL_COMPOUND);
> XLIFFW_EndAttrs(wctx);
>
> XLIFFW_BeginBody(wctx);
>
> EncodeXMLNode(wctx, n);
>
> XLIFFW_EndBody(wctx);
> XLIFFW_EndTag(wctx);
>
> XLIFFW_WriteEOF(wctx);
> XLIFFW_DestroyContext(wctx);
>
> rctx=XLIFFR_OpenRead("test0.xliff");
> DumpNodes(rctx);
>
> XLIFFR_DestroyContext(rctx);
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://www.oasis-open.org/mlmanage/index.php>
>
>
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://www.oasis-open.org/mlmanage/index.php>
>
|