UTF-8, BOM [Was: nextml]

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Tony Graham <Tony.Graham@MenteithConsulting.com>
To: xml-dev@lists.xml.org
Date: Fri, 10 Dec 2010 08:37:09 +0000

On Thu, Dec 09 2010 05:56:24 +0000, liam@w3.org wrote:
> On Thu, 2010-12-09 at 00:37 -0500, Michael Sokolov wrote:
...
>> One more mini-addition: would it be possible to have parsers ignore the 
>> BOM at the start of a UTF-8 file?  Some editors seem to insist on 
>> creating them, they are allowed by the UTF-8 spec, and probably ought to 
>> be considered external to the actual file content.  Also, maybe if we're 

The definition of the BOM/ZWNBS, the role of the BOM with UTF-8, and the
prominence of UTF-8 in the Unicode Standard has changed over time with
successive versions of the Unicode Standard [2].  The discussion of
detecting character encoding has also changed over time in successive
editions of XML 1.0.

You could review UTF-8 and BOM on the basis that much has changed since
the first XML 1.0 spec.

>> going to allow multiple root elements we could also allow whitespace in 
>> the prolog?   People often put it there, and it seems like something 
>> that could be tolerated easily enough.
>
> I have always felt it was a bug in the XML spec that the XML declaration
> becomes a regular processing instruction if there's a blank line in
> front of it.

It makes it usable as a file signature for the OS.  (If "<?xml" seems a
bit much, try EPUB, where you have to read the first 50+ bytes of a Zip
archive file [1].)

...
>> On restriction to UTF-8 (16 if we insist, but really do folks store 
>> *files* as UTF-16?)
>
> Yes. Frequently.
>
>> : is this really a problem for non-western 
>> languages?
>
> If you manufacture memory and hard drives, then utf-8 is truly
> delightful in countries where most characters will be 3 or more
> bytes/octets in length in utf-8.

Liam's roundabout way of saying YMMV.

> It's also a common misconception that Unicode is a 16-bit character set;
> it defines more than 65536 characters, and "surrogate pairs" in
> languages like Java make utf16 as complex as utf8; processing characters

Easier, probably, since you don't have surrogate pairs in UTF-8.

> in either utf-8 or ucs-32 are the most common choices outside the Java
> world as far as I can tell.

Regards,


Tony Graham                         Tony.Graham@MenteithConsulting.com
Director                                  W3C XSL FO SG Invited Expert
Menteith Consulting Ltd                               XML Guild member
XML, XSL and XSLT consulting, programming and training
Registered Office: 13 Kelly's Bay Beach, Skerries, Co. Dublin, Ireland
Registered in Ireland - No. 428599   http://www.menteithconsulting.com
  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --
xmlroff XSL Formatter                               http://xmlroff.org
xslide Emacs mode                  http://www.menteith.com/wiki/xslide
Unicode: A Primer                               urn:isbn:0-7645-4625-2


[1] Section 4 in http://www.idpf.org/ocf/ocf1.0/download/ocf10.htm
[3] http://inasmuch.as/2007/10/03/bom-in-utf-8-good-bad-or-ugly/

References:
- nextml
  - From: Amelia A Lewis <amyzing@talsever.com>
- Re: [xml-dev] nextml
  - From: Michael Sokolov <sokolov@ifactory.com>
- Re: [xml-dev] nextml
  - From: Liam R E Quin <liam@w3.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]