Re: [xml-dev] Random Access XML

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: John Cowan <cowan@mercury.ccil.org>
To: rjelliffe <rjelliffe@allette.com.au>
Date: Wed, 23 Feb 2011 17:17:11 -0500

rjelliffe scripsit:

> Yes, if people are happy to keep comments and PIs after the prolog, I  
> don't mind. (But I thought James' idea was to reduce the different  
> number of nodes types in the parse tree, because multiple node types  
> apparently freaks programmers out?)

Well, you know the zero-one-infinite rule.  Without PIs, you need only
element nodes; with PIs, you need document nodes, element nodes, and
PI nodes.  That's triple the number of node types.

MicroLark normally reports an error if a PI appears, but if you set the PI
feature to true, its push and pull parsers will report PIs, but the tree
builder will ignore them.  Only PIs that look like well-formed start-tags
(except for the question marks) are allowed, which covers things like
xml-stylesheet and xml-model.  XML declarations are still disallowed.

>> The only reason [MicroXML] doesn't ban > in attribute values is that
>> they are required for compatibility with Canonical XML.
>
> Oh, is that a requirement?

No, but it's convenient because it means that XML->MicroXML converters
already exist in the form of XML canonicalizers.

> (I think using non-ascii characters for token separators wont  
> get any traction, unless encodings are restricted to UTF-*.  [...])

MicroXML limits the encoding to UTF-8, with ASCII as a degenerate case.

> BTW, the idea of using paths in names to allow random access is not new  
> or mine. IIRC the Dynatext readers indexed their SGML into a one element  
> per line format, with a long path name at the beginning of each line.  
> This allowed fast contextual searches using normal line-oriented text  
> matching. I think Steve deRose had the patent on this, but I'd think it  
> would be expired by now.

A good thing, since I have such a script not for indexing but for
pipelining: it produces lines of the form "path\tvalue" for every element
path in a document, where "value" is the XPath value of the element.
There's a switch to allow paths ending in "/@foo" as well.

-- 
All Gaul is divided into three parts: the part          John Cowan
that cooks with lard and goose fat, the part            http://ccil.org/~cowan
that cooks with olive oil, and the part that            cowan@ccil.org
cooks with butter. --David Chessler

References:
- Random Access XML
  - From: rjelliffe <rjelliffe@allette.com.au>
- Re: [xml-dev] Random Access XML
  - From: John Cowan <cowan@mercury.ccil.org>
- Re: [xml-dev] Random Access XML
  - From: rjelliffe <rjelliffe@allette.com.au>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]