Lists Home |
Date Index |
On Fri, Apr 09, 2004 at 09:32:33AM -0400, Stephen E. Beller wrote:
> Following up on Ram's questions, if one was to cherry-pick, what kinds of
> situations would benefit from:
> 1. A document that is not written in a tagged tree structure?
> 2. A less verbose (streamlined) way to structure content, which (a) is
> easily read by humans and (b) comes in a single version that is optimized
> for rapid processing/rendering?
> 3. The ability to parse into multiple documents?
Not all documents are in fact ordered sequences of nested hierarchies.
If I select from the middle of one paragraph to the middle of another
in a word processor, and say "italics", I don't think of that as a
tree operation involving an arbitrary number of spans of "italic"
Another example: some texts have both poetic structure of verses and
lines, and also a discoursive structure in which people speak, and
which we also wish to capture.
Consider the following quote from Psalm 108 [NEB]
 God has spoken from his sanctuary:
`I will go up now and measure out Shechem;
I will divide the valey of Succoth into plots;
 Gilead and Manasseh are mine;
Ephraim is my helmet, Judah my sceptre;
 Moab is my wash-bowl, I fling my shoes at Edom;
Philistia is the target of my anger.'
<v n="7"><l>God has spoken from his sanctuary:</l>
<l><q>I will go up now and measure out Shechem;</l>
<l>I will divide the valey of Succoth into plots;</l></v>
<v n="8"><l>Gilead and Manasseh are mine;</l>
<l>Ephraim is my helmet, Judah my sceptre;</l></v>
<v n="9"><l>Moab is my wash-bowl, I fling my shoes at Edom;</l>
<l>Philistia is the target of my anger.</q></l></v>
This seams reasonable -- I can use XPath to find all q elements,
for example. But I have a q element that spans a verse, and this is
not well-formed XML.
The two most common compromises are to abandon one or other of the
textual forms or to consider one secondary.
Abandoning one might mean leaving off the line and verse tags, if
one was more interested in capturing speech.
Making one secondary might mean making the verse and line markers into
empty elements. An alternative (supported by the Text Encoding
Initiative) is to make a linked list, somethign like this (to use
the same fictional markup as before)
<v><l><q id="q1" next="q2">I will ... Shechem</q></l>
<l><q id="q2" next="q3" prev="q4">I will divide....</a></l></v>
and so on. You have to watch that a weird limitation of XML is that
you can only have one ID-valued attribute per element, and if you
want to be able to transform this markup so that the quotes are primary
and the lines are empty markers or secondary spans, you'll likely
want a second id-valued marker at the start of each text stream and
on each verse and line that's not manipulated by the splitting into
I throw my shoes at rigid limitations of hierarchy. Or I would,
if I had any shoes :-)
Liam Quin, W3C XML Activity Lead, http://www.w3.org/People/Quin/