Re: [xml-dev] nextml

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Liam R E Quin <liam@w3.org>
To: Michael Sokolov <sokolov@ifactory.com>
Date: Thu, 09 Dec 2010 00:56:24 -0500

On Thu, 2010-12-09 at 00:37 -0500, Michael Sokolov wrote:
> I want to be able to record the 
> position of elements as byte offsets in an original source file and use 
> those to extract well-formed fragments as text 

You can only do this if you check no-one else touched the file since yu
last made your index.  The "XML Promise" (as I call it) is that any XML
tool is licensed to process any XML document.

> (think extracting 
> snippets and highlighting in search results).  This can't be done 
> reliably in a SAX or StaX handler if the parser alters text in a 
> non-reversible manner: you can make a guess if you know what the 
> original line endings were, but if they're mixed all bets are off.  
> Currently one has to use HTML parsers for this.

HTML parsers also normalise line endings though, no? Both HTML and XML
inherit some of this from SGML.

> One more mini-addition: would it be possible to have parsers ignore the 
> BOM at the start of a UTF-8 file?  Some editors seem to insist on 
> creating them, they are allowed by the UTF-8 spec, and probably ought to 
> be considered external to the actual file content.  Also, maybe if we're 
> going to allow multiple root elements we could also allow whitespace in 
> the prolog?   People often put it there, and it seems like something 
> that could be tolerated easily enough.

I have always felt it was a bug in the XML spec that the XML declaration
becomes a regular processing instruction if there's a blank line in
front of it.

> Yeah, I disagree about entities (and therefore DTDs).  Let me try to 
> explain why, briefly, and then I promise to stop whining about it. The 
> problem w/DTDs (and entity decls defined in them) as I see it is they 
> introduce a dependence on an external file.
They don't have to - you can put everything in one file.

>  If entities were defined by 
> the standard (and built in to parsers), or were required to be defined 
> inline, that would remove my objections.

You can't really define &productName; in the XML spec to be, say,
"Internet Explorer 12.1" :-), and the Unicode long character names are
all in English, which is obviously not OK.  The ISO SGML entities are
insane.  You're right that a goal of XInclude was to reduce the need for
entities; there are still places where they're used and XInclude can't
be, e.g.
    href="&server;&docroot;intro/chapter6.xml"


> On restriction to UTF-8 (16 if we insist, but really do folks store 
> *files* as UTF-16?)

Yes. Frequently.

> : is this really a problem for non-western 
> languages?

If you manufacture memory and hard drives, then utf-8 is truly
delightful in countries where most characters will be 3 or more
bytes/octets in length in utf-8.

It's also a common misconception that Unicode is a 16-bit character set;
it defines more than 65536 characters, and "surrogate pairs" in
languages like Java make utf16 as complex as utf8; processing characters
in either utf-8 or ucs-32 are the most common choices outside the Java
world as far as I can tell.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org

Follow-Ups:
- UTF-8, BOM [Was: nextml]
  - From: Tony Graham <Tony.Graham@MenteithConsulting.com>
- Re: [xml-dev] nextml
  - From: James Clark <jjc@jclark.com>

References:
- nextml
  - From: Amelia A Lewis <amyzing@talsever.com>
- Re: [xml-dev] nextml
  - From: Michael Sokolov <sokolov@ifactory.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]