[
Lists Home |
Date Index |
Thread Index
]
From: "Ann Navarro" <ann@webgeek.com>
> I just ran into this myself, with a styled apostrophe character -- which
> was only reported as a problem by XML Spy 4.4 upon opening the 1.2MB XML
> file (character was: Â (0xC2), ' (0x92)).
I expect we will see more of this problem, unless the C1 controls (U+0080-U+009F)
are banned from direct use in XML. The trouble is that transcoders do not fail when
they find strange characters. Nothing stops your XML from being polluted, because
after the data is in corrupted, it may look like good data. For more on this issue,
see http://www.topologi.com/public/XML_Naming_Rules.html
...
> A tool that would quickly locate these kinds of things would be enormously
> helpful (I'd certainly buy a copy if it were commercial/shareware).
You may care to look at my company's new editor for XML and SGML:
the Topologi Collaborative Markup Editor. See
http://www.topologi.com/
We'll be posting the real announcement in a day or two; you can download it
for evaluation now.
When you open a file, an "Incoming Text Conditioning" box comes up. In the
"Whitespace" tab you can set it to:
* detect control characters or characters above a certain character
* give a warning or replace the character with a PI containing the code point,
to figure out what is going wrong and where it is.
Also, it displays the Unicode code for the current caret position, so you can
see what is going on even when the font doesn't have a glyph for a character.
It will give warnings for many kinds of encoding errors, and sorts its available
encodings in three ways (by platform, by language, and by IANA name)
for easier selection. It performs Unicode normalization on the way in and the
way out, and during cut-and-paste.
Cheers
Rick Jelliffe
|