xml-dev - Re: [xml-dev] Detection of non-Unicode characters

Re: [xml-dev] Detection of non-Unicode characters

[ Lists Home | Date Index | Thread Index ]

To: Mark Feblowitz <mfeblowitz@frictionless.com>
Subject: Re: [xml-dev] Detection of non-Unicode characters
From: Tim Bray <tbray@textuality.com>
Date: Fri, 23 Aug 2002 14:24:51 -0700
Cc: xml-dev@lists.xml.org
References: <4DBDB4044ABED31183C000508BA0E97F040ABF38@fcpostal.frictionless.com>
User-agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.1b) Gecko/20020722

Mark Feblowitz wrote:
> We've gotten ourselves in a slight muddle. We've copied Word documentation
> into (many) xs:annotation blocks in our UTF-8 .xsd files (there are around
> 300 files). In the process, we have apparently brought along some
> non-Unicode characters. This is not tolerated equally well by all tools.

I love that last sentence.  Your problem is probably subtly different 
from as stated, which might even make the a difference in the solution. 
  It could be the case that the file has bytes that are not actually a 
UTF-8 encoding of any character, for example the hex sequence 0AC0C0 
cannot possibly occur in UTF8.

I'm not aware of any command-line tools for catching this, but you could 
write your own in C in a couple of hours with a copy of the UTF8 rules 
handy; it wouldn't be XML-specific.

Second possible problem is that the UTF-8 is good but it encodes Unicode 
characters that aren't allowed in XML, like for example &#x01; - any 
decent XML parser should catch this and give you helpful error messages, 
if you have an expat around your system (and a lot of people do these 
days) "xmlwf [filename here]" will do the trick.  D'oh, now that I think 
about it in fact I bet xmlwf (or equivalent) would probably catch the 
UTF8 breakage too.  -Tim

Follow-Ups:
- Re: [xml-dev] Detection of non-Unicode characters
  - From: Matt Gushee <mgushee@havenrock.com>

References:
- Detection of non-Unicode characters
  - From: Mark Feblowitz <mfeblowitz@frictionless.com>

Prev by Date: Re: XML indexing/search engine
Next by Date: Re: [xml-dev] Detection of non-Unicode characters
Previous by thread: Detection of non-Unicode characters
Next by thread: Re: [xml-dev] Detection of non-Unicode characters
Index(es):
- Date
- Thread