Re: [xml-dev] Random Access XML

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: rjelliffe <rjelliffe@allette.com.au>
To: <xml-dev@lists.xml.org>
Date: Sun, 20 Feb 2011 12:31:33 +1100

 On Sat, 19 Feb 2011 15:36:46 -0500, John Cowan <cowan@mercury.ccil.org> 
 wrote:
> rjelliffe scripsit:
>
>> 1) For a start, we need to be able to know whether "<" "</" and ">" 
>> are
>> tag delimiters without knowing context. So we must ban direct use of 
>> "<"
>> and ">" in attributes and also get rid of CDATA sections. We should 
>> get
>> rid of comments and PIs too, for the same reasons. (Actually, we 
>> only
>> need to ban comments and PIs from after the first start tag. For 
>> other
>> reasons, we might like to treat the first start-tag and before it
>> specially.)
>
> Of course, random < is already banned everywhere, so if you ban > in
> character content as well as attribute values, you get full 
> reversibility:
> each of <, </, <?, <!--, >, />, and --> is guaranteed to be the open 
> or
> close delimiter of a markup construct.

 Yes, if people are happy to keep comments and PIs after the prolog, I 
 don't mind. (But I thought James' idea was to reduce the different 
 number of nodes types in the parse tree, because multiple node types 
 apparently freaks programmers out?)

> MicroXML already bans > in character content so that it doesn't have 
> to
> special-case ]]>, as required for full XML compatibility.  The only 
> reason
> it doesn't ban > in attribute values is that they are required for
> compatibility with Canonical XML.

 Oh, is that a requirement?

>> 3) The generic identifier would have to be more like an XPath.
>
> This could be achieved by convention, using a legal but rarely
> employed delimiter like U+00B7 MIDDLE DOT, or any of the vast number 
> of
> delimiters allowed by XML 1.0 Fifth Edition.

 Yes, lets make the 5th edition useful! :-)  Using special characters ad 
 hoc in names may be bad, but using them for systematic delimiters could 
 be good.  (I think using non-ascii characters for token separators wont 
 get any traction, unless encodings are restricted to UTF-*. Or allow an 
 builtin entity reference for the delimiter chosen.)

 For the sake of argument, say we use ‣ [triangle] eg 
 <book‣section‣personalName>, which is like a breadcrumbbar notation.  A 
 SAX processor for Random Access XML would plug after a normal SAX parser 
 and replace element names like 'book‣section‣personalName' or 
 'section‣personalName' with 'personalName'. (I.e. report back just the 
 element name--the last item. If sections only appear in books, then the 
 start tags <book‣section‣personalName> and <section‣personalName> should 
 not alter the infoset.)

 If we wanted to reduce name lengths, we could allow simple wildcards or 
 ellipsis too: eg <b…‣s…‣personalName>

 Cheers
 Rick Jelliffe

 BTW, the idea of using paths in names to allow random access is not new 
 or mine. IIRC the Dynatext readers indexed their SGML into a one element 
 per line format, with a long path name at the beginning of each line. 
 This allowed fast contextual searches using normal line-oriented text 
 matching. I think Steve deRose had the patent on this, but I'd think it 
 would be expired by now.

Follow-Ups:
- Re: [xml-dev] Random Access XML
  - From: John Cowan <cowan@mercury.ccil.org>
- Re: [xml-dev] Random Access XML
  - From: Dave Pawson <davep@dpawson.co.uk>

References:
- Random Access XML
  - From: rjelliffe <rjelliffe@allette.com.au>
- Re: [xml-dev] Random Access XML
  - From: John Cowan <cowan@mercury.ccil.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]