OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: Identifying the encoding of a document

[ Lists Home | Date Index | Thread Index ]
  • From: Rick JELLIFFE <ricko@geotempo.com>
  • To: xml-dev@xml.org
  • Date: Mon, 07 Aug 2000 19:21:19 +0800

Lisa Retief wrote:
> > Do you mean you need to detect in some documents which encoding they
> > use?
> Yes - I need to do this programatically - sometimes the user has not
> specified an encoding and I would prefer not to default to something if I
> can figure it out.

If you can figure out three things about the text, it will help narrow
your choices:

 1) Do you know the language (really, the script) used in the documents
    by some external mechanism?  In particular, is it a Latin-based
    or something else?

 2) Is the encoding ASCII-family, EBCIDIC family, or something exotic.  
    A simple way to do this is to open the document up in a vanilla
    text editor: if there are any places where you expect ASCII
    (a-z A-Z 01-9 simple puncuation) that will help you figure it

 3) What locale was it created in (or what locale were the tools)?
    E.g. was it made in Japan, or by or for Japanese?

When you know all these three things, often you will only have one or
two main choices.

There are automated programs available too: many encodings have a
distinct signature that can be detected this way. Some (such as the ISO
8859-n encodings) may not have distinct signatures, but if you documents
provide a large enough sample it is possible to use statistical
techniques to figure
out the characters.  

This is of course a skill which you don't need if you are using XML: all
documents must be labelled with explict information about the encoding
used. Authors and programmers are not used to the discipline of doing
this, but it is the only thing that can work reliably: guesswork isn't
good enough.

> > Or which encoding is best to use when generating XML documents for
> > different locales?

> I am interested in this question too, as I need to advise clients and 
> users of the application I am developing about this.

It is prudent to limit yourself to international or national encodings:
stay away from encodings that are regional or vendor-specific (i.e.
Microsoft's "ANSI" and Macintosh "MacRoman" or IBM's EBCDIC family). You
may find it useful in the short run to converge on UTF-8: there are many
text conversion programs that can help in this: GNU iconv, IBMs
Internationalizion Classes for Unicode, etc. 

Rick Jelliffe


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS