xml-dev - Re: Whitespace

Re: Whitespace
[ Lists Home | Date Index | Thread Index ]
From: Peter@ursus.demon.co.uk (Peter Murray-Rust)
To: xml-dev@ic.ac.uk
Date: Tue, 26 Aug 1997 17:11:17 GMT
There is clearly a wide spectrum of opinion on this - and everyone is being very
helpful and patient.  I think I see where (at least some of) the differences 
lie and hope this is helpful:

In message <v03007803b028a0302c58@[205.181.197.114]> dgd@cs.bu.edu (David G. Durand) writes:
> 
> I'm afraid that I must ask what these are to be used for. I used to think
> that this was a problem, and now I don't see how we really need these
> declarations. They only seem to be relevant for typesetting, and if

I think this highlights that what we are doing is going through a learning
process and David (and others) have already been through this :-). It took
several months for XML-WG to arrive at the present position (there were 
intermediate drafts which included munging of various sorts). [It reminds me of 
a story of a very famous physicist (I forget whom) who, when asked to justify 
an equation in a lecture, stated it was trivial, then looked at it in silence 
for 15 mins, and then re-iterated 'Yes, it is trivial'.] 

The problem we have is not a technical one, but a variety of human perceptions
and preconceptions.

We agree that:
	1. this is NOT a parser concern, and all whitespace is passed to the
		application.
	2. that it is always *possible* to create an XML document in which no 
		non-significant whitespace appears.
	3. the XML-WG, in its wisdom, has found it useful to allow authors
		to pass the attribute XML-SPACE="DEFAULT" to the application.

I believe that (2) is David's position which is logical and consistent. If
(2) is universally applied then I can see no value in (3). It suggests that
there is value in passing non-significant whitespace to the application and
processing it in some application-dependent way. If we are processing 
whitespace by stylesheet, then isn't DEFAULT 
irrelevant? My problem is probably mainly because, after *much* debate, (3) 
has been included in the spec and I don't see what it is for.

[David suggests that one reason to add whitespace is that it should appear in
the final typeset version - this makes it significant (though I suspect that
some people would prefer to pass explicit markup).  Personally I do not wish
to do this.]

As  David says, it is possible to produce an XML document with no line-ends
and no other non-significant whitespace. If additional whitespace (e.g. 
for paragraphs) is to be included in the processed document, then it can
either be explicitly included as markup, or deduced from markup through
stylesheets or other methods.

The reasons I can see that non-significant whitespace is contained in XML 
documents are:
	- the documents are produced to be human-readable
	- the authoring/editing tools used introduce non-significant whitespace
	- non-significant whitespace is required to allow various tools to
		process the documents 
	- humans edit the XML documents

I can conceive of a time (perhaps 2 years hence) when there are a wide variety
of XML authoring tools and when the HTML community is educated about XML. In 
that state, perhaps, documents will be always created without non-significant
whitespace. Then, perhaps, we shall have a non-problem.

At present we have (at least) the viewpoints:
	- whitespace matters and authors must define precisely what they want
		in a document. The SGML community can understand and manage
		whitespace. If newcomers find it difficult, they'll have to
		learn the rules, or use proper tools.
	- most of the people who will want to use XML will graduate from HTML.
		This has 'taught' them that whitespace is not significant and
		gets normalised somewhere. They will start creating XML by 
		analogy with HTML. XML will not succeed unless we can
		offer some support for this transitional period.

As is fairly obvious, I take the second viewpoint.  I am trying to 'sell' CML
to a community which has never heard of SGML, but knows about HTML. I cannot
sell them files which they can't read (because they have no line breaks) or
force them to understand where space conventions differ from HTML.  Remember
that many XML files are going to be authored by people who never go near an
SGML tool - the molecular community will probably use C programs.

So - David asks for examples :-)

I want to be able to state that these 3 XML documents are to be interpreted 
to give identical results:

<FOO><META DC.AUTHOR="foo"/><META DC.TITLE="baz"/><BAR B="b"/></FOO>

and

<FOO>
  <META DC.AUTHOR="foo"/>
  <META DC.TITLE="baz"/>
  <BAR B="b"/>
</FOO>

Almost everyone who posts **examples** of XML files shows them prettyprinted
in some fashion.  No-one posts 1000 character lines to this list, or to
XML-SIG - they wouldn't be popular! So the impression is probably universal
outside the XML experts that XML files can be prettyprinted ad lib. 

I would like to preserve this prettyprinting - I suspect this is a major
motive for trying to see some way forward here.

A second example could be the one that I posted earlier:
<PARA> We took
<VAR TYPE="float">23.02+02</VAR>
<UNIT>gram</UNIT>
water
</PARA>

This is clearly contains 'text' and my community is conditioned to reading 
this in the same way as HTML (i.e. that the line-ends are normalised to 
a single space.) It seems to me that this is likely to be valuable in many
applications and that interoperability and code re-use would be greatly
helped by giving it a label and a set of rules. As I have said more than once
I would like to avoid having to develop both my own rules and my own code.

I have a fear (and I think it is shared by my community) that data within a
document can be changed by changing a stylesheet.  The *meaning* of the
(HTML) file below differs according to whether the line-end is normalised to a 
space or not:

<P> I saw a <B>black</B>
<B>bird</B>
</P>

Since stylesheets can be (and will be) imposed by people other than the 
author (publishers, browsers, readers, etc.) there is a danger that stylesheet
imposed WS processing can change meaning. Of course you can argue that the 
author above should have taken greater trouble to create an unambiguous 
text, but this is the way that I expect many newcomers to XML to approach it.

	P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)
Prev by Date: Re: Whitespace
Next by Date: Papers Comparing MCF, CDF, D-C & RDF?
Previous by thread: Re: Whitespace
Next by thread: Re: Whitespace
Index(es):
- Date
- Thread