[
Lists Home |
Date Index |
Thread Index
]
- From: Chris Lilley <chris@w3.org>
- To: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>
- Date: Tue, 06 Apr 1999 14:44:04 +0200
MURATA Makoto wrote:
>
> Chris Lilley wrote:
> > The vast majority of content authors have *no control whatsoever* on
> > server configuration. This isn't 1993; assuming that the person who
> > wrote the content is also the person who administers the server is
> > totally unwarranted.
>
> To overcome this problem, Uchida-san is proposing a convention for WWW server
> configurations. His proposal is already used by some ISPs in Japan. It is
> available at:
>
> http://www.asahi-net.or.jp/~sd5a-ucd/docs/suffix_guideline_981106.txt
Its good to see a concrete proposal. On the other hand, relying on a
complex convention of filename suffixes is problematic:
- it either requires content negotiation to be enabled (something not
all servers can do) or it results in a mulitplicity of URIs for the same
resource.
- it requires all content authoring applications to know about it and to
offer to save using this naming convention
- it duplicates (and may contradict) the XML encoding declaration
- the information stored in this way may be lost when saving local
copies to systems which do not allow double dots in filenames or which
have other restrictions. The XML encoding declaration, on the other
hand, is much more robust in the face of the multiplicity of file
systems in use.
The second condition is made harder because two alternative syntaxes are
proposed in this note - so a content auuthor has to know which
convention is used on a particular server. Also, the note says that
these are only two of many possibilities.
An alternative method for achieving the same result is to use a filter
(this can be done in Apache and in Jigsaw) which automatically emits the
correct charset parameter based on reading the encoding declaration in
the XML instance. Thi s can easily cache its results, and need not
result in processing overhead on each request.
Of course, this still requires work - for example, to ensure that it is
included in the standard Apache distribution; but it is easier than
trying to get the hundreds of authoring tools to support a couple of
naming conventions which may in any case be hard to deal with on some
platforms (platforms are still in use which have trouble with .html, for
example ;-)
> Chris Lilley wrote:
> >
> > But not necessarily everyones favourite. It is a good choice for
> > Japanese, because Kanji use less bytes per character in UTF-16 than in
> > UTF-8.
> >
> > > (In the case that the charset is broken, autodetection of
> > > UTF-16 is very easy.
> >
> > But autodetection should not be required; users can label their
> > documents correctly.
>
> To me, the biggest advantage of UTF-16 is that UTF-16 XML documents can parse
> only as UTF-16. Even if the charset parameter is incorrect, UTF-16 XML documents
> do not parse incorrectly (and error recovery is very reliable).
I am wary of relying on error recovery. If it doesn't work well, then
there is reduced interoperability because of variation; if it does work
well, or seems to work well in some cases, then people just use it all
the time.
> Chris Lilley wrote:
> > On the other hand, if the RFC had been written as I suggested, saying
> > that a charset parameter overode *if present* but that *if absent*, the
> > rules in the XML recommendation were followed, then you would need no
> > server reconfiguration and the rules to follow to have the encoding
> > information correctly conveyed to the client would have been a matter of
> > public record in the XML recommendation rather than private convention.
> > A big win for interoperability, if that had happened.
>
> At *IETF*, the default of the charset parameter for text/HTML *is* 8859-1.
Yes, which is different to the default for text/* - this demonstrates
that it is possible to give a more specific rule for a particular
registration. I gave an example of a particular rule for text/xml which
would have saved all this bother.
> You might want to change this first.
Why? It is XML we are speaking of here.
> It is going to be very difficult or
> impossible, since HTTP and MIME people will disagree.
I think you mean, HTTP and Mail(SMTP/IMAP/POP). MIME is used by both
email and HTTP.
> There have been a lot of discussion about this issue. None of your arguments
> are new to me. In fact, my original opinion was not so different from yours but
> I have changed my mind during the discussion. More about this, see the archive
> of the XML SIG (around April and May of 1998).
OK, I will check this out. I cannot of course discuss such material in
this forum, however. Perhaps you could post your technical reasons for
the change of direction here?
> > Murata-san, you asked why a W3C team person was criticising this RFC in
> > public. It is because the mission of W3C is to improve interoperability,
> > so it is my duty to do so.
>
> You might want to check what the W3C I18N WG has said to the XML CG. If
> W3C strongly recommends the use of the charset parameter, the world will
> change.
Sure, in the absence of any other indication, server-applied labelling
is certainly better than no labelling or guesswork. I have nothing
against the use of the charset parameter. But, if it is not present,
then the XML Rec says exactly what should happen; carefull wording which
this RFC nullifies. Problems arise if an XML file is saved from the Web
to a local filesystem, perhaps for further editing; the MIME charset
information is lost. It could perhaps be stored in some way - but, there
is already a standard way - the XML encoding declaration.
And if the charset parameter is present, then it should say the same
thing as the encoding declaration. The best way to ensure this is to
treat the XML encoding declaration as the prmary metadata resource and
to programatically derive the charset parameter from this; greater
robustness is at once achieved and also harmonisation of the MIME and
XML labelling.
> XML is the last chance.
I agree, it is important to get it right.
> I am strongly advocating the use of the
> charset parameter in Japan whenever possible.
Great. On the other hand, you seem to be trying to do so by enforcing a
different default charset than that in the XML Recommendation, which
means that local files and remote files work differently; this is
clearly not desirable.
> On the other hand, if even a
> W3C team member does not respect the consensus, there is not much hope.
I think that last comment was beneath you, and would thank you to
restrict yourself to technical argument.
However, I will point out that it is the consensus of the XML 1.0
Recommendation that I am respecting - and that the RFC does not, by
altering the meaning of the default encoding. It could have been
harmionised with the XML REC; it was not. Redundancy can be good; a
charset parameter and an XML encoding declaration that say the same
thing and work the same way, which is what I was suggesting, is good.
What you are suggesting, which is a charset parameter and an XML
encoding declaration that work in different ways, is clearly suboptimal.
--
Chris
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)
|