OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   RE: tabs to indent for pretty-printing (is it correct?)

[ Lists Home | Date Index | Thread Index ]
  • From: Paul Grosso <pgrosso@arbortext.com>
  • To: xml-dev@lists.xml.org
  • Date: Tue, 31 Oct 2000 12:13:38 -0600

At 14:48 2000 10 31 +0100, David Valera [and D.Megginson] wrote:
>>  > If I read the XML spec correctly, adding tabs and spaces
>> to increase
>>  > readability is often done, but it is not intended to be
>> saved as such
>>  > (significant whitespace excluded of course).
>>
>> That's application-specific; i.e. the parser passes all of the
>> characters (including whitespace) to the application, then the
>> application decides what is and isn't significant for its purposes.
>
>This means that if I open an XML file with an (XML)editor, add an element
>and save it back again, it is possible that not only the element I inserted
>is added, but also the whitespaces that the application 'decides' to
>include.
>
>I am asking this because I have opened numerous XML files in XMLeditors and
>without changing the content I was surprised to see all kinds of tabs and
>spaces added to the saved document (just to make it easy readable).

This is actually a somewhat interesting and less-than-trivial issue.

As DavidM points out, there are XML processors (which should obey
the XML 1.0 spec) and everything else--including XML editors--which
are applications and can therefore exhibit application-specific behavior.

The issue you raise here is really "what changes to my document does
my editing environment consider to be 'insignificant' (i.e., things
that it will do 'on its own')."  And that depends on the main goal of
the editor.

If you want total control over every bit of your document, use an
editor that lets you get down to the bit-level of your file.  Even
"ascii" editors do some interpretation (e.g., at the character encoding
level).  Many XML editors provide a higher level interpretation (e.g.,
they interpret the markup and provide GUIs to interact with it).

It is common for such XML editors to "change" your document in ways
other than the specific edits you requested.  This is because such
editors make an internal representation of your document that maintains
only some of the information represented by your input document and then
re-serializes that information later.  Distinctions that such an editor 
considers "insignificant" in terms of the information content of your 
document may not be retained.  Among the most common sorts of things
are specifics of line breaks and white space.  [Other things include
order of attributes, whether a ' or " was used to delimit a literal,
whether &apos; or ' was used within text (where either would be allowable),
and so forth.]

Some white space is clearly insignificant per the XML 1.0 spec:  that
within markup, and that which gets normalized out of attributes per
the spec.  An XML editor that add or deletes such insignificant space
is not changing the information content of your document as defined by
the XML 1.0 spec.  Of course, you might still wish it didn't make such
changes.  As far as you are concerned, that "insignificant" white space
may be "information" you want to preserve.  One person's insignificant
stuff is another person's valuable information.  

But most people would agree that it's fine for an XML editor to add/delete
whitespace that is deemed insignificant by the XML 1.0 spec.  (If this
doesn't satisfy you, you'll need to edit your documents using another tool.)

Now many XML editors are optimized for a specific task such as the
authoring and editing of documents that will be published eventually
either as [X]HTML or as composed documents, and these optimizations
often drive what additional whitespace these editor applications 
consider to be "insignificant".

Since most composition processes do their own line breaking within
runs of character data (except in "preformatted" or "verbatim" regions),
then the specifics of line breaks in the input file (except in such
special regions) are insignificant.  Therefore, it is reasonable for
XML editors optimized for such applications to turn input file line ends
into spaces and some spaces into line ends upon writing back out the file.
(What would not be quite as reasonable for such editors is to introduce
white space where none exists and where compliant XML processors would
consider it significant.)  

HTML browsers appear to have a set of "rules" for ignoring leading white
space on an input line (and perhaps other places), despite the fact that 
HTML is defined as a subset of SGML in which some such space would be
significant.  XML editors optimized to create [X]HTML sometimes treat
such space as insignificant.  In fact, such XML editors sometimes add
such "indenting" space to "pretty-print" the resulting file.  The 
resulting file still displays "properly" in HTML browsers.

The problem is that such "indenting" space is significant as far as
XML processors are concerned--specifically, it is space that would and 
should be significant to an XML composition system.  So, if you edit
your XML in an editor that "indents" and then run your XML through a
composition system, you may well get unwanted space in your output.
(Specifically, space inserted in mixed content, such as in the 
indented example:
  <h2 align="left">
  The h2 heading text
  </h2>
is significant and will cause the h2 text to look like it isn't aligned
correctly because it will start with a significant space when it is
properly composed.) 

If you find such "indenting" behavior on the part of an XML editor
inappropriate, you might see if there is a switch/mode in the product
that allows you to disable such.  If not, then you will need to find
another way to edit your documents.

paul





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS