[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
UTF-8, BOM [Was: nextml]
- From: Tony Graham <Tony.Graham@MenteithConsulting.com>
- To: xml-dev@lists.xml.org
- Date: Fri, 10 Dec 2010 08:37:09 +0000
On Thu, Dec 09 2010 05:56:24 +0000, liam@w3.org wrote:
> On Thu, 2010-12-09 at 00:37 -0500, Michael Sokolov wrote:
...
>> One more mini-addition: would it be possible to have parsers ignore the
>> BOM at the start of a UTF-8 file? Some editors seem to insist on
>> creating them, they are allowed by the UTF-8 spec, and probably ought to
>> be considered external to the actual file content. Also, maybe if we're
The definition of the BOM/ZWNBS, the role of the BOM with UTF-8, and the
prominence of UTF-8 in the Unicode Standard has changed over time with
successive versions of the Unicode Standard [2]. The discussion of
detecting character encoding has also changed over time in successive
editions of XML 1.0.
You could review UTF-8 and BOM on the basis that much has changed since
the first XML 1.0 spec.
>> going to allow multiple root elements we could also allow whitespace in
>> the prolog? People often put it there, and it seems like something
>> that could be tolerated easily enough.
>
> I have always felt it was a bug in the XML spec that the XML declaration
> becomes a regular processing instruction if there's a blank line in
> front of it.
It makes it usable as a file signature for the OS. (If "<?xml" seems a
bit much, try EPUB, where you have to read the first 50+ bytes of a Zip
archive file [1].)
...
>> On restriction to UTF-8 (16 if we insist, but really do folks store
>> *files* as UTF-16?)
>
> Yes. Frequently.
>
>> : is this really a problem for non-western
>> languages?
>
> If you manufacture memory and hard drives, then utf-8 is truly
> delightful in countries where most characters will be 3 or more
> bytes/octets in length in utf-8.
Liam's roundabout way of saying YMMV.
> It's also a common misconception that Unicode is a 16-bit character set;
> it defines more than 65536 characters, and "surrogate pairs" in
> languages like Java make utf16 as complex as utf8; processing characters
Easier, probably, since you don't have surrogate pairs in UTF-8.
> in either utf-8 or ucs-32 are the most common choices outside the Java
> world as far as I can tell.
Regards,
Tony Graham Tony.Graham@MenteithConsulting.com
Director W3C XSL FO SG Invited Expert
Menteith Consulting Ltd XML Guild member
XML, XSL and XSLT consulting, programming and training
Registered Office: 13 Kelly's Bay Beach, Skerries, Co. Dublin, Ireland
Registered in Ireland - No. 428599 http://www.menteithconsulting.com
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
xmlroff XSL Formatter http://xmlroff.org
xslide Emacs mode http://www.menteith.com/wiki/xslide
Unicode: A Primer urn:isbn:0-7645-4625-2
[1] Section 4 in http://www.idpf.org/ocf/ocf1.0/download/ocf10.htm
[3] http://inasmuch.as/2007/10/03/bom-in-utf-8-good-bad-or-ugly/
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]