xml-dev - Re: [xml-dev] Detection of non-Unicode characters

Re: [xml-dev] Detection of non-Unicode characters

[ Lists Home | Date Index | Thread Index ]

To: <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Detection of non-Unicode characters
From: "Rick Jelliffe" <ricko@allette.com.au>
Date: Tue, 27 Aug 2002 02:25:51 +1000
References: <3D66A823.2060109@textuality.com> <4DBDB4044ABED31183C000508BA0E97F040ABF38@fcpostal.frictionless.com> <3D66A823.2060109@textuality.com> <4.3.2.7.2.20020826100147.00c101d0@mail.webgeek.com>

From: "Ann Navarro" <ann@webgeek.com>

> I just ran into this myself, with a styled apostrophe character -- which 
> was only reported as a problem by XML Spy 4.4 upon opening the 1.2MB XML 
> file (character was: Â (0xC2), ' (0x92)).


I expect we will see more of this problem, unless the C1 controls (U+0080-U+009F)
are banned from direct use in XML. The trouble is that transcoders do not fail when
they find strange characters. Nothing stops your XML from being polluted, because
after the data is in corrupted, it may look like good data. For more on this issue,
see http://www.topologi.com/public/XML_Naming_Rules.html  

...
> A tool that would quickly locate these kinds of things would be enormously 
> helpful (I'd certainly buy a copy if it were commercial/shareware).

You may care to look at my company's new editor for XML and SGML:
the Topologi Collaborative Markup Editor. See
 http://www.topologi.com/

We'll be posting the real announcement in a day or two; you can download it
for evaluation now.

When you open a file, an "Incoming Text Conditioning" box comes up. In the
"Whitespace" tab you can set it to:
  * detect control characters or characters above a certain character
  * give a warning or replace the character with a PI containing the code point,
to figure out what is going wrong and where it is.

Also, it displays the Unicode code for the current caret position, so you can
see what is going on even when the font doesn't have a glyph for a character.
It will give warnings for many kinds of encoding errors, and sorts its available
encodings in three ways (by platform, by language, and by IANA name)
for easier selection. It performs Unicode normalization on the way in and the 
way out, and during cut-and-paste. 

Cheers
Rick Jelliffe

Follow-Ups:
- Re: [xml-dev] Detection of non-Unicode characters
  - From: "Rick Jelliffe" <ricko@allette.com.au>

References:
- Re: [xml-dev] Detection of non-Unicode characters
  - From: Tim Bray <tbray@textuality.com>
- Detection of non-Unicode characters
  - From: Mark Feblowitz <mfeblowitz@frictionless.com>
- Re: [xml-dev] Detection of non-Unicode characters
  - From: Ann Navarro <ann@webgeek.com>

Prev by Date: FW: [xml-dev] Renamer-att (was: Can XLink be fixed?)
Next by Date: Re: [xml-dev] Newby question
Previous by thread: Re: [xml-dev] Detection of non-Unicode characters
Next by thread: Re: [xml-dev] Detection of non-Unicode characters
Index(es):
- Date
- Thread