Re: [xml-dev] An XML document is not well-formed if encoding="..."does n

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] An XML document is not well-formed if encoding="..."does not match the actual encoding of the characters in the document, right?

From: Chris Maloney <voldrani@gmail.com>
To: David Lee <dlee@calldei.com>
Date: Sat, 29 Dec 2012 22:55:45 -0500

When a BOM occurs inside a file or stream of text, it's supposed to be
treated as a zero-width non-breaking space; i.e., a "no-op" character.

But, you're right in that this is another compatibility gotcha.  The
most recent time I was bitten by BOMs was when trying to use a
Javascript minifier that first concatenated a bunch of JS files
together, and it was not happy about the BOMs that ended up in the
middle of that stream of code.  At the time, I think I found a place
in the ECMAScript standard that suggested the BOM was legal and should
be considered whitespace, but I just looked again, and can't find it.

Still, it seems to me that in most cases where you have multi-lingual
text-based documents, like Markdown, to take one example, that the
benefits of using a BOM are significant.


On Sat, Dec 29, 2012 at 9:53 PM, David Lee <dlee@calldei.com> wrote:
> I'm curious ...
> Considering that UTF16 is a dangerous file format,  (I agree it is ... )
> For people who use languages which have predominantly non-latin codepoints ...
> Is UTF8 actually worse than UTF32  - file size wise ?
> And does it matter much ?
>
> When Java was introduced with 16 bit chars I remember the huge debate about how wasteful that was ... but now rarely hear it,
> (except that handling > 16 bit codepoint chars is still difficult).
>
> What about UTF8 vs UTF32 ?
>
> There definitely is an advantage to a fixed byte-per-char format ... But if someone had the Iron Fist to Declare "Thou Shalt Use ..."
> Would UTF8 be that bad ?   Consider that very often when filesize is an issue compression is used ... so the "raw" file size is not nearly as important as it used to be.
>
> As for BOM's ... I personally am not fond of them.   On first glance they seem great ... like the "File Types" of Yore ... (which thank goodness Unix god rid of ...)
>
> But the problem with BOM's IMHO, like file types,  ... is that they assume that you are dealing with files, and/or that all sequences of bytes have a known start ... aka "The Beginning",  where you would put a BOM.   I suggest that is a historical oddity, and/or too small a subset of real use that it is impractical to count on.  What about say blob records in a database ? Streams of data with no beginning or end ?
> I dont think any convention that requires you to have read "the Beginning" will consistently work with text ...
> XML suffers with this assumption as well with the XML declaration declaring the encoding.
> That only works when you have an entire document to look at.    Until we can come up with a universal encoding format we have to suffer with out-of-band information to inform a decoder.
>
>
> -David
>
>
>
> -----Original Message-----
> From: Chris Maloney [mailto:voldrani@gmail.com]
> Sent: Saturday, December 29, 2012 9:27 PM
> To: Costello, Roger L.
> Cc: xml-dev@lists.xml.org
> Subject: Re: [xml-dev] An XML document is not well-formed if encoding="..." does not match the actual encoding of the characters in the document, right?
>
> Roger wrote:
>
>> I would advocate using UTF-8 exclusively
>
> That's what I do with my own files, and what I advocate whenever I have any input to design decisions, but as Liam and others have said, it's not practical to expect everyone to adopt this convention.
>
> What I really want to know is, when can we start freely using BOMs in UTF-8?  I really like this idea, because it is a simple, easy way for a text file to "declare" that it is in UTF-8, and eliminate the ambiguity when the text files are passed around.  Unfortunately, a lot of software, especially on Linux, still chokes on these.
>
> On a slightly different topic (UTF-16), this discussion reminded of something else I read a while back, a technical note the Unicode Consortium advocating for the use of UTF-16 for internal processing (as opposed to file interchange):
> http://unicode.org/notes/tn12/tn12-1.html.  On the other hand, I just found from a Google search this recent thread on StackExchange, where several people argue that UTF-16 should be considered harmful:
> http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful.
>  I guess the debate will rage on, but interoperability, on the whole, does seem to be getting better.
>
> Chris
>
>
>
>
> On Sat, Dec 29, 2012 at 2:36 PM, Costello, Roger L. <costello@mitre.org> wrote:
>> Hi Folks,
>>
>> I spoke with George Cristian Bina from oXygen XML and he gave me the scoop on how things work inside oXygen.
>>
>> George told me to do this:
>>
>> 1. Create an iso-8859-1 encoded XML file.
>>
>> 2. Using a hex editor, change encoding="iso-8859-1" to encoding="utf-8"
>>
>> 3. Drag and drop the file into oXygen.
>>
>> 4. oXygen will generate an encoding exception:
>>
>>     Cannot open the specified file. Got a character
>>     encoding exception [snip]
>>
>> Next, here is something George told me. It is mind-blowing:
>>
>>     If you have an iso-8859-1 encoded XML file loaded into oXygen
>>     and change encoding="iso-8859-1" to encoding="utf-8" then
>>     oXygen will automatically change the encoding of every character
>>     in the document to UTF-8.
>>
>> Wow!
>>
>> That is so fantastic, I jumped out of my chair when I read it.
>>
>> I just received this additional information from George:
>>
>>     Please note that the encoding is important only when the file is loaded
>>     and saved. When the file is loaded the bytes are converted to characters
>>     and then the application works only with characters. When the file is
>>     saved then those characters need to be converted to bytes and the
>>     encoding used will be determined from the XML header with a default to
>>     UTF-8 if no encoding can be detected.
>>
>> /Roger
>>
>> ______________________________________________________________________
>> _
>>
>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS to
>> support XML implementation and development. To minimize spam in the
>> archives, you must subscribe before posting.
>>
>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>> subscribe: xml-dev-subscribe@lists.xml.org List archive:
>> http://lists.xml.org/archives/xml-dev/
>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>>
>
> _______________________________________________________________________
>
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS to support XML implementation and development. To minimize spam in the archives, you must subscribe before posting.
>
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
>

References:
- An XML document is not well-formed if encoding="..." does not matchthe actual encoding of the characters in the document, right?
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: [xml-dev] An XML document is not well-formed if encoding="..."does not match the actual encoding of the characters in the document, right?
  - From: Michael Sokolov <sokolov@ifactory.com>
- RE: [xml-dev] An XML document is not well-formed if encoding="..."does not match the actual encoding of the characters in the document, right?
  - From: "Costello, Roger L." <costello@mitre.org>
- RE: [xml-dev] An XML document is not well-formed if encoding="..."does not match the actual encoding of the characters in the document, right?
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: [xml-dev] An XML document is not well-formed if encoding="..."does not match the actual encoding of the characters in the document, right?
  - From: Chris Maloney <voldrani@gmail.com>
- RE: [xml-dev] An XML document is not well-formed if encoding="..."does not match the actual encoding of the characters in the document, right?
  - From: David Lee <dlee@calldei.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]