[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] nextml
- From: Liam R E Quin <liam@w3.org>
- To: Michael Sokolov <sokolov@ifactory.com>
- Date: Thu, 09 Dec 2010 00:56:24 -0500
On Thu, 2010-12-09 at 00:37 -0500, Michael Sokolov wrote:
> I want to be able to record the
> position of elements as byte offsets in an original source file and use
> those to extract well-formed fragments as text
You can only do this if you check no-one else touched the file since yu
last made your index. The "XML Promise" (as I call it) is that any XML
tool is licensed to process any XML document.
> (think extracting
> snippets and highlighting in search results). This can't be done
> reliably in a SAX or StaX handler if the parser alters text in a
> non-reversible manner: you can make a guess if you know what the
> original line endings were, but if they're mixed all bets are off.
> Currently one has to use HTML parsers for this.
HTML parsers also normalise line endings though, no? Both HTML and XML
inherit some of this from SGML.
> One more mini-addition: would it be possible to have parsers ignore the
> BOM at the start of a UTF-8 file? Some editors seem to insist on
> creating them, they are allowed by the UTF-8 spec, and probably ought to
> be considered external to the actual file content. Also, maybe if we're
> going to allow multiple root elements we could also allow whitespace in
> the prolog? People often put it there, and it seems like something
> that could be tolerated easily enough.
I have always felt it was a bug in the XML spec that the XML declaration
becomes a regular processing instruction if there's a blank line in
front of it.
> Yeah, I disagree about entities (and therefore DTDs). Let me try to
> explain why, briefly, and then I promise to stop whining about it. The
> problem w/DTDs (and entity decls defined in them) as I see it is they
> introduce a dependence on an external file.
They don't have to - you can put everything in one file.
> If entities were defined by
> the standard (and built in to parsers), or were required to be defined
> inline, that would remove my objections.
You can't really define &productName; in the XML spec to be, say,
"Internet Explorer 12.1" :-), and the Unicode long character names are
all in English, which is obviously not OK. The ISO SGML entities are
insane. You're right that a goal of XInclude was to reduce the need for
entities; there are still places where they're used and XInclude can't
be, e.g.
href="&server;&docroot;intro/chapter6.xml"
> On restriction to UTF-8 (16 if we insist, but really do folks store
> *files* as UTF-16?)
Yes. Frequently.
> : is this really a problem for non-western
> languages?
If you manufacture memory and hard drives, then utf-8 is truly
delightful in countries where most characters will be 3 or more
bytes/octets in length in utf-8.
It's also a common misconception that Unicode is a 16-bit character set;
it defines more than 65536 characters, and "surrogate pairs" in
languages like Java make utf16 as complex as utf8; processing characters
in either utf-8 or ucs-32 are the most common choices outside the Java
world as far as I can tell.
Liam
--
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]