Lists Home |
Date Index |
Martin Olsson wrote:
> --- QUESTION 2
> XML files can use different character encodings including UNICODE and
> normal ascii text files. An XML parser must know what encoding is used
> before it starts to process the file, loading a UNICODE file is very
> different from loading a normal text file. The parser can obviously not
> first read the encoding attribute of the XML declaration which is the
> first line of the XML file and then load the file.
On the contrary, the xml declaration is entirely in ascii except for a
possible byte order mark, so the processor can determine 8-bit vs.
16-bit encodings from the BOM and the <?xml, and then read the encoding
declaration, knowing that it is in ascii.
> ... Should the XML parser use a brute force approach and try all of these?
THe only problem would come if the actual encoding does not match the
declared encoding (I am leaving aside those cases where the processor
knows the encoding by some other means). The processor is not expected
to sort out such discrepancies.
XML is one of the few formats out there that can handle multiple
encodings and unicode decently, and much of this is due to the xml
Thomas B. Passin
Explorer's Guide to the Semantic Web (Manning Books)