[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Difference between "normalize" and "canonicalize"?
- From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
- To: <xml-dev@lists.xml.org>
- Date: Wed, 25 Feb 2009 08:11:41 -0500
Personally, I see "normalization" as changing the information into
something that is common, while "canonicalization" is representing
something in a common way without changing it.
When line-ending sequences are normalized, they are changed into new
values without their old values being retained. On DOS and Mac and
mainframe systems, different line-ending sequences are changed to the
line-feed character. Once you have the line-feed character there is
no going back to what it was. If there was a line-feed in the DOS
file, there is no distinguishing the authored line-feed from the
normalized line-end line feed.
The normalize-space() function changes a sequence of white-space
characters into a single space. The information is changing and you
can't undo it once you have the normalized string. There's no way to
go back to an arbitrary sequence of white-space characters.
The normalize-unicode() string changes a character without the
ability to go back. Using NFKC normalization on U+1E9B creates
U+1E61 and you can't go back because you've changed the Latin
character that is the basis of the Unicode character from a long s to
a simple s.
On the other hand, canonicalization doesn't change the information,
or the meaning of the information, it merely makes assumptions about
how that information is presented or organized. One can then recover
another arbitrary representation or organization again without
changing the meaning. Consider empty elements: they can be created
either as "<abc/>" or "<abc></abc>" and the meaning between the two
is identical. In an XML processor you cannot distinguish between the
two. However, when not using an XML processor you need a common
representation of an empty element so that two users who see an empty
element represent that empty element in the same canonical form so
that other processes will see the same information from their
perspective. But the information hasn't changed at all.
So I see normalization as destructive and canonicalization as not destructive.
Normalizing information creates a common form without necessarily
being able to recover the original form because the information is
being changed.
Canonicalizing information creates a common form merely by convention
and one could then change that to another alternate form simply by
following a different convention without changing the information.
So I personally don't consider the two terms the same.
But I also don't think they are always consistently applied with such
nuance and I wouldn't be surprised to find some users of the terms
interchanging them. But when I'm given the choice I perceive a distinction.
I hope this helps.
. . . . . . . . . . . Ken
At 2009-02-25 06:38 -0500, Costello, Roger L. wrote:
>Hi Folks,
>
>Consider these two sentences:
>
>
>1. When an XML parser reads in an XML document it normalizes all
>line breaks to \n.
>
>2. A canonicalizer tool will canonicalize empty elements to
>start-tag, end-tag pairs.
>
>
>Both "normalize" and "canonicalize" seem to mean:
>
> Put into a standard form.
>
>Do they in fact mean the same thing? If so, why have two terms? Why
>not have just one term?
>
>/Roger
--
XQuery/XSLT training in Prague, CZ 2009-03 http://www.xmlprague.cz
Training tools: Comprehensive interactive XSLT/XPath 1.0/2.0 video
Video lesson: http://www.youtube.com/watch?v=PrNjJCh7Ppg&fmt=18
Video overview: http://www.youtube.com/watch?v=VTiodiij6gE&fmt=18
G. Ken Holman mailto:gkholman@CraneSoftwrights.com
Crane Softwrights Ltd. http://www.CraneSoftwrights.com/x/
Male Cancer Awareness Nov'07 http://www.CraneSoftwrights.com/x/bc
Legal business disclaimers: http://www.CraneSoftwrights.com/legal
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]