xml-dev - Re: Identifying the encoding of a document

Re: Identifying the encoding of a document

[ Lists Home | Date Index | Thread Index ]

From: Rick JELLIFFE <ricko@geotempo.com>
To: xml-dev@xml.org
Date: Mon, 07 Aug 2000 19:21:19 +0800

Lisa Retief wrote:
> 
> > Do you mean you need to detect in some documents which encoding they
> > use?
> 
> Yes - I need to do this programatically - sometimes the user has not
> specified an encoding and I would prefer not to default to something if I
> can figure it out.

If you can figure out three things about the text, it will help narrow
your choices:

 1) Do you know the language (really, the script) used in the documents
    by some external mechanism?  In particular, is it a Latin-based
script
    or something else?

 2) Is the encoding ASCII-family, EBCIDIC family, or something exotic.  
    A simple way to do this is to open the document up in a vanilla
ASCII
    text editor: if there are any places where you expect ASCII
character
    (a-z A-Z 01-9 simple puncuation) that will help you figure it
out.    

 3) What locale was it created in (or what locale were the tools)?
    E.g. was it made in Japan, or by or for Japanese?

When you know all these three things, often you will only have one or
two main choices.

There are automated programs available too: many encodings have a
distinct signature that can be detected this way. Some (such as the ISO
8859-n encodings) may not have distinct signatures, but if you documents
provide a large enough sample it is possible to use statistical
techniques to figure
out the characters.  

This is of course a skill which you don't need if you are using XML: all
documents must be labelled with explict information about the encoding
used. Authors and programmers are not used to the discipline of doing
this, but it is the only thing that can work reliably: guesswork isn't
good enough.

> > Or which encoding is best to use when generating XML documents for
> > different locales?

> I am interested in this question too, as I need to advise clients and 
> users of the application I am developing about this.

It is prudent to limit yourself to international or national encodings:
stay away from encodings that are regional or vendor-specific (i.e.
Microsoft's "ANSI" and Macintosh "MacRoman" or IBM's EBCDIC family). You
may find it useful in the short run to converge on UTF-8: there are many
text conversion programs that can help in this: GNU iconv, IBMs
Internationalizion Classes for Unicode, etc. 

Rick Jelliffe

Follow-Ups:
- RE: Identifying the encoding of a document
  - From: Justin Lipton <justin@speedlegal.com>

References:
- Identifying the encoding of a document
  - From: lisa@exinet.co.za (Lisa Retief)
- Re: Identifying the encoding of a document
  - From: Rick JELLIFFE <ricko@geotempo.com>
- Re: Identifying the encoding of a document
  - From: lisa@exinet.co.za (Lisa Retief)

Prev by Date: Re: Identifying the encoding of a document
Next by Date: Re: System identifiers and base URIs
Previous by thread: Re: Identifying the encoding of a document
Next by thread: RE: Identifying the encoding of a document
Index(es):
- Date
- Thread