[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Random Access XML
- From: rjelliffe <rjelliffe@allette.com.au>
- To: <xml-dev@lists.xml.org>
- Date: Sun, 20 Feb 2011 12:31:33 +1100
On Sat, 19 Feb 2011 15:36:46 -0500, John Cowan <cowan@mercury.ccil.org>
wrote:
> rjelliffe scripsit:
>
>> 1) For a start, we need to be able to know whether "<" "</" and ">"
>> are
>> tag delimiters without knowing context. So we must ban direct use of
>> "<"
>> and ">" in attributes and also get rid of CDATA sections. We should
>> get
>> rid of comments and PIs too, for the same reasons. (Actually, we
>> only
>> need to ban comments and PIs from after the first start tag. For
>> other
>> reasons, we might like to treat the first start-tag and before it
>> specially.)
>
> Of course, random < is already banned everywhere, so if you ban > in
> character content as well as attribute values, you get full
> reversibility:
> each of <, </, <?, <!--, >, />, and --> is guaranteed to be the open
> or
> close delimiter of a markup construct.
Yes, if people are happy to keep comments and PIs after the prolog, I
don't mind. (But I thought James' idea was to reduce the different
number of nodes types in the parse tree, because multiple node types
apparently freaks programmers out?)
> MicroXML already bans > in character content so that it doesn't have
> to
> special-case ]]>, as required for full XML compatibility. The only
> reason
> it doesn't ban > in attribute values is that they are required for
> compatibility with Canonical XML.
Oh, is that a requirement?
>> 3) The generic identifier would have to be more like an XPath.
>
> This could be achieved by convention, using a legal but rarely
> employed delimiter like U+00B7 MIDDLE DOT, or any of the vast number
> of
> delimiters allowed by XML 1.0 Fifth Edition.
Yes, lets make the 5th edition useful! :-) Using special characters ad
hoc in names may be bad, but using them for systematic delimiters could
be good. (I think using non-ascii characters for token separators wont
get any traction, unless encodings are restricted to UTF-*. Or allow an
builtin entity reference for the delimiter chosen.)
For the sake of argument, say we use ‣ [triangle] eg
<book‣section‣personalName>, which is like a breadcrumbbar notation. A
SAX processor for Random Access XML would plug after a normal SAX parser
and replace element names like 'book‣section‣personalName' or
'section‣personalName' with 'personalName'. (I.e. report back just the
element name--the last item. If sections only appear in books, then the
start tags <book‣section‣personalName> and <section‣personalName> should
not alter the infoset.)
If we wanted to reduce name lengths, we could allow simple wildcards or
ellipsis too: eg <b…‣s…‣personalName>
Cheers
Rick Jelliffe
BTW, the idea of using paths in names to allow random access is not new
or mine. IIRC the Dynatext readers indexed their SGML into a one element
per line format, with a long path name at the beginning of each line.
This allowed fast contextual searches using normal line-oriented text
matching. I think Steve deRose had the patent on this, but I'd think it
would be expired by now.
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]