OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Detection of non-Unicode characters

[ Lists Home | Date Index | Thread Index ]

From: "Mark Feblowitz" <mfeblowitz@frictionless.com>

> We've gotten ourselves in a slight muddle. We've copied Word documentation
> into (many) xs:annotation blocks in our UTF-8 .xsd files (there are around
> 300 files). In the process, we have apparently brought along some
> non-Unicode characters. This is not tolerated equally well by all tools.
> 
> Is there a convenient means of scanning .xsd files to locate non-Unicode
> characters? I'm looking for something like a Windows command line filter.
> 
> Any idea where I can find such a beast?

If you have a programmer on tap, you would probably be better off 
writing a quick C (or Python or  Perl) program to do this. 

It is a state machine with two states S1 and S2 and a transition
T1 from S1 to S2, and a transition T2 from S2 to S1.

In S1, read a byte in, write it out, appending it to a byte buffer. 
When you find "<xs:annotation" go T1 and clear the
byte buffer.

In S2, read a byte in and translate it to UTF-8, then write out
the bytes.  If you find "</xs:annotation" go T2 and clear the
byte buffer.

In all probability (unless you have East Asian annotations, or
UTF-16 annotation) your bogus text is encoded in 8859-1 or
MacRoman or CP1252, which are just single bytes.  So that
is quite easy. 

But before doing this, confirm the encodings used in
the XML document and the Word fragments. In no circumstances
try to read the document in as XML, because it will surely
corrupt the data further, and you may not be able to go back.

If you use Java, read everything as bytes not as Characters, 
because reading in the characters will cause transcoding
and therefore corruption. 

If the data is already sitting in a datastructure in a program,  then
serialize it out so that each xs:annotation is in a different
entity with the appropriate encoding header.  External entities
can all be in different encodings.  Then just parse
the document as normal XML and the parser will take care
of this for you.  XML already provides these facilities to
cope. If your XML parser does not handle external entity
references properly, get rid of it and switch to professional
quality tools.

Finally, if you don't have a programmer on tap, then use
the same tool you used to plonk the word documentation
into the xs:annotations and cut and paste them into their
own entities, with the correct encoding headers.  This
is tedious but low-tech. 

Cheers
Rick Jelliffe







 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS