xml-dev - RE: [xml-dev] hello, new to list, thoughts

RE: [xml-dev] hello, new to list, thoughts
[ Lists Home | Date Index | Thread Index ]
To: cr88192@hotmail.com
Subject: RE: [xml-dev] hello, new to list, thoughts
From: Richard Bruin <rbru03@esc.cam.ac.uk>
Date: 17 Nov 2004 17:29:36 +0000
Cc: xml-dev@lists.xml.org
In-reply-to: <200411171648.iAHGmQx29167@server1.primalspace.net>
Organization: Department of Earth Sciences
References: <200411171648.iAHGmQx29167@server1.primalspace.net>
Hi all,

Have you looked into BinX regarding binary XML formats? As far as I can
tell (and from only a brief knowledge of binx) your idea is fairly
similar in purpose (maybe in implementation, but I don't know enough to
say).

Just thought I'd mention it in case it can help to prevent needless
duplication of work,

Cheers

Rich

--------------------------------
Richard Bruin
PhD Student
Department of Earth Sciences
University of Cambridge

On Wed, 2004-11-17 at 16:44, David Lieberman AWDSF wrote:
> Neat!
> 
> Good ideas, all. (if you ask me) 
> 
> It'll be interesting to see when you're finished.
> 
> David Lieberman
> http://www.awdsf.com
> 
> 
> -----Original Message-----
> From: cr88192 [mailto:cr88192@hotmail.com] 
> Sent: Wednesday, November 17, 2004 6:55 AM
> To: xml-dev@lists.xml.org
> Subject: [xml-dev] hello, new to list, thoughts
> 
> ok, I was lead here as, as far as I can tell, the group gmane.text.xml.devel
> 
> is a mirror of this list (given that a reference to this list is appended to
> 
> the messages there).
> I appologize if this is not the case.
> pardon if I am being a troll, I am new here.
> 
> I am mostly requesting general comments, eg, on things that could be 
> improved in my idea. I am not expecting anyone really to take me seriously.
> 
> 
> ok. so I have been recently doing something roughly along the lines of a 
> binary xml (not exactly, but I have designed it such that basic subset of 
> xml maps fairly well, and doesn't give up any real info in the conversion). 
> namespace definitions, however, are similar but not exactly the same, so 
> this would be left to the conversion tool.
> 
> I tried to retain what I felt to be the general spirit and semantics of xml,
> 
> as imo they seem quite decent for data (though, difficult to fully recognize
> 
> as such or reason about). imo, they are better than a plain tree. I had 
> considered stripping down the semantics at some points, but some things 
> seemed interdependent (one needs rules and complexity in some places to 
> grant freedoms in others, and one needs to figure out where to be strict and
> 
> where to be lax...).
> 
> why?
> personally, I don't feel that within xml's core domains (network 
> communications, messages, "documents", ...) a binary variant would be 
> particularly helpful, however:
> my interests lie more in data storage (for what xml works good for, I use 
> that);
> formats like riff, ebml, ... leave a lot to be desired imo wrt semantics;
> textual xml is not that great for data storage imo, eg, one can't skip over 
> data or jump around in the same way they can with, say, riff, and imo base64
> 
> coding would not be very good for cases when size is a priority, or if much 
> of the data is large binary chunks.
> 
> in a kind of ideological frenzy, I beat something together, and have spent a
> 
> while refining it.
> 
> I also have a basic implementation (not yet online, I have been a bit behind
> 
> on this kind of thing recently).
> not much coding has been being done recently (largely I am stuck fiddling 
> with the details of the design, among other things not related to this).
> 
> things like file size or processor overhead are not high priorities here, I 
> just try to save space when possible, and avoid wasting too much processor 
> time. my results thus far have shown the generated files to be slightly 
> smaller than the input xml (presently lacking any kind of content string 
> compression, though tag compression is done). this could be viewed as a good
> 
> sign I guess.
> 
> 
> 
> here is a recent draft of the spec:
> ----
> XLIFF (0.1.1):
> Partially fueled by an argument about EBML, I came up with this.
> It has little to do with LIFF, but, hell, I am not really using LIFF (it 
> ended up too close to RIFF and too generally ugly...).
> This format shows itself to be kind of a pain to code up, but this is not 
> entirely unexpected.
> 
> Cleaning up spec some from 10-29 version, minor alterations.
> Stripping out container stuff, as it clutters the spec and doesn't really 
> make sense in this context anyways (keeping the old version as the idea may 
> be adaptable to a different format).
> 
> Goals:
> A binary format with similar flexibility to XML (X);
> Support for large files and datasets (L);
> Sort of like RIFF and IFF (IFF).
> 
> This will be a TLV format with attributes and namespaces similar to XML.
> It will use a tag dictionary to help reduce the total file size.
> It should be acceptable for random access and "big chunks of data" style 
> uses (like RIFF, IFF, and EBML).
> 
> 
> Numbers:
> The MSB (bit 7) serves to indicate the precense of following (higher order) 
> bytes.
> 
> 0xxxxxxx, -64..63
> 1xxxxxxx 0xxxxxxx, -8192..8191
> 1xxxxxxx 1xxxxxxx 0xxxxxxx, -1M..1M
> ..
> 
> Values are in Low-High order and with 2s complement encoding.
> As a result, the sign is implicitly contained in bit 6 of the last byte.
> 
> The maximum value of a number depends on the implementation, and an 
> implementation is allowed to refuse overly large numbers.
> However, I will spec that the limit should be at least 32 bits (a 35 bit 
> number with the upper 4 bits either all 0 or 1).
> 
> 
> Strings:
> {
> Number len; //length if >0, dictionary index if <0, empty string if 0
> if(len>0)byte str[len];
> }
> 
> Strings may be indexed in a dictionary. The exact semantics for dictionaries
> 
> will depend on context.
> 
> Node:
> {
> String ns;
> String tag;
> Number alen;
> if(alen>0)byte attr[alen];
> Number dlen;
> if(dlen>0)byte data[dlen]; //the contents depend on the ns and tag
> }
> 
> Attr:
> {
> String ns;
> String tag;
> Number dlen;
> if(dlen>0)byte data[dlen];
> }
> 
> In tags and attributes, negative chunk lengths are reserved.
> 
> Implicit XLIFF attributes are allowed in nodes without restrictions as they 
> are not generally viewed as part of the content.
> Duplicate attributes are not allowed.
> Attributes may not contain either tags or other attributes (other 
> non-nestable structures will generally be allowed). Attributes should be 
> kept small in both size and number.
> 
> 
> 
> Tag Dictionary:
> There will be a dictionary responsible for namespaces, tags, and attribute 
> names:
> This dictionary will behave similarly to a stack;
> Any new strings are added to the end of the current dictionary level 
> (encoded directly and not allready present);
> On descent into a node, the dictionary is retained from the parent, creating
> 
> a new level;
> Any new strings are added to the end of the current dictionary level;
> On exit from a node, any strings added in that level are removed (making it 
> as if the descent had not occured).
> 
> The use of a dictionary allows denser packing (due to, eg, tags being 1 or 2
> 
> bytes).
> Fairly dense packing might require building the entire dictionary upfront, 
> but an encoder can have less dense packing, eg, by just encoding strings 
> directly.
> The need for upfront dictionaries for packing to work well is related to 
> ideas for allowing faster processing by not having to descend into subnodes 
> to build an up-to-date dictionary, and also to allow random access in some 
> cases.
> 
> Body Dictionaries:
> Each namespace will also have a "body dictionary", which may be used for 
> compressing content strings in a content specific manner. The maintainence 
> of these dictionaries is largely left to the format in question (however, 
> they will implicitly pop off anything added within a node).
> 
> The format is a tree, with the default toplevel tag flagging the format.
> 
> Special tags could exist for maintenence purposes (eg: adding a basic set of
> 
> common strings to the dictionary, ...).
> Like XML, some special tags may exist prior to the root to define things 
> (the base dictionary, ...).
> 
> 
> Namespaces:
> The empty string namespace is the "default" namespace.
> 
> Except builtin XLIFF namespaces (default, XLIFF, ...) namespaces are to be 
> declared prior to use.
> Formats are given control over how namespaces are used/defined.
> 
> Namespaces refer to several URI's:
> the Type URI, which defines the physical type of the container (eg: XML).
> the Namespace URI, which defines the semantic type of the container (eg: an 
> XML Namespace).
> 
> Any namespaces beginning with "XLIFF" are reserved for use by XLIFF.
> Further:
> XLIFF.*: basic XLIFF namespaces, failure to understand tags/attributes 
> should cause failure;
> XLIFF.S.*: semantic XLIFF namespaces, these are allowed to be ignored, but 
> should be preserved;
> XLIFF.O.*: optional XLIFF namespaces, these may be ignored or stripped off 
> without effecting content.
> 
> There will be a basic "XLIFF" namespace, and being unable to parse tags in 
> this namespace will be viewed as an error (this namespace will handle things
> 
> which may change the format of subsequent data, effect dictionaries, ...). 
> XLIFF attributes are required to be understood before attempting to parse 
> the contents of a node.
> 
> "XLIFF.S" will be used for semantic XLIFF tags, failure to understand them 
> will not compromise decoding of the format.
> 
> "XLIFF.O" will be used for optional XLIFF tags, failure to understand or 
> removal of them will not compromise decoding of the format.
> 
> "XLIFF.NS" could be a namespace for namespace declatations (like in XML). 
> eg, "XLIFF.NS:foo" as an attribute could declare a foo namespace.
> the content of these tags could be an array of pairs of strings, eg:
> "TypeURI", "xliff:foo_container", "NSURI", "xliff:bar_ns".
> 
> 
> XLIFF:Header
> An tag required at the start of an XLIFF file serving to mark it as a valid 
> XLIFF file, and to give general info about the file.
> 
> XLIFF:TypeName Header Attribute, gives a general "file type name" used for 
> identifying the type (appart from examining the contents or namespaces). It 
> is encoded as a raw string.
> 
> XLIFF:HeaderFlags Header Attribute
> Contains a number marking various flags for a file. Unknown flags may be 
> ignored.
> 1&=dictionary is static within the file.
> 
> Other attributes may be found in the header besides those related to XLIFF. 
> An example would be custom or format specific tags.
> 
> 
> XLIFF:DictStrings Tag
> Defines a glob of strings to be added to the tag dictionary.
> This may be used for reducing the number of occurances of some common tag 
> which may only occure in sublevels or such.
> 
> XLIFF:NodeFlags Attribute
> Contains a number marking various flags for a node. Unknown flags may be 
> ignored.
> 1&=this node is compound;
> 2&=dictionary is static within this node.
> 
> XLIFF.O:JUNK Tag/Attribute
> Marks a space as being "junk", thus allowing leaving some space for new 
> tags, attributes, or padding.
> 
> 
> 
> XML in XLIFF
> 
> There are 2 ways to do XML in XLIFF:
>  A unified document (all the content, or at least the toplevel, is XML);
>  A mixed document (the toplevel is not necissarily XML).
> 
> In the unified document case the toplevel tag is 'XML' (with at least the 
> default namespace declared as being XML), which may contain any xml header 
> tags and the xml root.
> Namespace declarations are to be converted to XLIFF style.
> 
> In the mixed document case, at least the basic xml namespaces are to be 
> declared in the file toplevel (along probably with others).
> The nstype for XML is 'xliff:binxml'.
> 
> String Globs
> All textual data will be represented by a number of strings stuck end to 
> end.
> These will use a "textglob dictionary", which will follow the same rules as 
> that for the tag dictionary.
> 
> Attribute data is defined as a glob of strings.
> 
> An empty tag value flags a glob of textual data. The body for this is a 
> string glob.
> 
> --
> 
> 
> here is a small fragment from the test app:
> ----
> int EncodeXMLNode(XLIFFW_Context *ctx, NetParse_Node *node)
> {
>  NetParse_Attr *acur;
>  NetParse_Node *ncur;
> 
>  if(node->text)
>  {
>   XLIFFW_BeginTag(ctx, "", "");
>   XLIFFW_BeginAttrs(ctx);
>   XLIFFW_EndAttrs(ctx);
> 
>   XLIFFW_BeginBody(ctx);
>   XLIFFW_WriteString(ctx, node->text);
>   XLIFFW_EndBody(ctx);
>   XLIFFW_EndTag(ctx);
> 
>   return(0);
>  }
> 
>  XLIFFW_BeginTag(ctx, node->ns, node->key);
>  XLIFFW_BeginAttrs(ctx);
> 
>  acur=node->attr;
>  while(acur)
>  {
>   XLIFFW_BeginAttr(ctx, acur->ns, acur->key);
>   XLIFFW_WriteString(ctx, acur->value);
>   XLIFFW_EndAttr(ctx);
>   acur=acur->next;
>  }
> 
>  if(node->first)XLIFFW_NodeFlagsAttr(ctx, XLIFF_NFL_COMPOUND);
>  XLIFFW_EndAttrs(ctx);
> 
>  XLIFFW_BeginBody(ctx);
>  ncur=node->first;
>  while(ncur)
>  {
>   EncodeXMLNode(ctx, ncur);
>   ncur=ncur->next;
>  }
> 
>  XLIFFW_EndBody(ctx);
>  XLIFFW_EndTag(ctx);
> 
>  return(0);
> }
> --
> 
> and another:
> ----
> 
>  n=NetParse_XML_LoadFile("form0.xml");
> 
>  wctx=XLIFFW_OpenWrite("test0.xliff");
> 
>  XLIFFW_WriteHeader(wctx, "xliff:test", 3);
> 
> // XLIFFW_BeginDictStrings(wctx);
> // WriteXMLDictNode(wctx, n);
> // XLIFFW_EndDictStrings(wctx);
> 
>  XLIFFW_BeginTag(wctx, "", "XML");
>  XLIFFW_BeginAttrs(wctx);
>  XLIFFW_BindNSAttr(wctx, "", "xliff:binxml", "");
>  XLIFFW_NodeFlagsAttr(wctx, XLIFF_NFL_COMPOUND);
>  XLIFFW_EndAttrs(wctx);
> 
>  XLIFFW_BeginBody(wctx);
> 
>  EncodeXMLNode(wctx, n);
> 
>  XLIFFW_EndBody(wctx);
>  XLIFFW_EndTag(wctx);
> 
>  XLIFFW_WriteEOF(wctx);
>  XLIFFW_DestroyContext(wctx);
> 
>  rctx=XLIFFR_OpenRead("test0.xliff");
>  DumpNodes(rctx);
> 
>  XLIFFR_DestroyContext(rctx);
> 
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://www.oasis-open.org/mlmanage/index.php>
> 
> 
> 
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://www.oasis-open.org/mlmanage/index.php>
>
Follow-Ups:
- Re: [xml-dev] hello, new to list, thoughts
  - From: "cr88192" <cr88192@hotmail.com>
References:
- RE: [xml-dev] hello, new to list, thoughts
  - From: "David Lieberman AWDSF" <david@awdsf.com>
Prev by Date: Last Call of xml:id is published
Next by Date: XTech 2005, Gilbane Conference to Co-locate in Amsterdam 24-27 May
Previous by thread: RE: [xml-dev] hello, new to list, thoughts
Next by thread: Re: [xml-dev] hello, new to list, thoughts
Index(es):
- Date
- Thread