OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] [Summary] Media type (MIME) of XML in MS Word? in Notepad?

[ Lists Home | Date Index | Thread Index ]

On 2006-06-12 18:47:33 -0400 "Costello, Roger L." <costello@mitre.org> wrote:
> The Editor used to Create the XML Determines its MIME Type

Gah!  No!  No, no, no!

> Interestingly, you may have a document which contains XML and yet its
> MIME type may not be application/xml.
> 
> For example, take this simple XML:
> 
> <?xml version="1.0"?>
> 
> <root>
>       Blah
> </root>
> 
> and put it into Word (save it as a .doc file).  The MIME type is:
> 
>       application/msword

True as far as it goes, but that's because it's *not XML!*

Try this experiment.

Type the above in Word.

Save as .doc (default).

Open a DOS box (or whatever they call it these days) and say "type 
NameOfDocument.doc".

Does it *look* like XML?  No.  It violates the rules for XML, namely that 
the XML declaration *must* be the first thing encountered in the data.

Do this with any random proprietary-format tool you care to; same result.  
The fact that you are "quoting" an XML document inside some other document 
format does *not* make that format somehow magically become XML.  It's still 
what it is.

Put that XML document into a cell in an Excel spreadsheet.  Do you *really* 
expect the .xsl that you saved to be "XML"?

Here's something fun.  Type that stuff into OpenOffice.org's word processor 
and save in default format.  Is the result XML?  Well, no.  It's a 
compressed (.zip) directory.  Unzip it, and what's inside?  Hey, there's 
XML!  Only ... no, it isn't the XML you *typed*.  All that has been quoted 
(escaped) into CDATA.  But it *is* XML, only it's a different document type.

> Conversely, if you put the same XML into Notepad, the MIME type is:
> 
>       application/xml

Bloody not if you accept the "suggestion" of notepad that it ought to have a 
".txt" extension.  Then it's text/plain.

> Why is that?  Why is it that if you put XML into one editor (Word) you
> get a MIME type that is specific to the editor, whereas if you put XML
> into another editor (Notepad) you get a MIME type that is independent
> of the editor?

Well, because you don't?

The "MIME type" of a document is not stored in a document.  A variety of 
heuristics may be applied to dynamically determine the MIME type; this was 
true of document formats even before MIME (see file(1)).  The commonest 
heuristic is the "extension", the bit that comes after the last dot in a 
filename, typically 1-4 characters (in the DOS world, always three 
characters).  In that heuristic, .doc maps to application/msword (even if 
it's *actually* an Excel spreadsheet), and .txt maps to text/plain (even if 
it's *really* a pkzip-compressed encrypted security analysis in a 
proprietary format) and .xml maps (as a rule) to application/xml.

What happens to a Windows application if the MIME type doesn't match the 
extension?

Damn all.  Windows doesn't care about MIME types.

What happens to a Windows application if the data format doesn't match the 
extension?

Crash.  Hopefully, the application just refuses to read it, but it could 
crash, and given the general level of protection in the system, it could 
bring the system down.

What happens to a BeOS application if the MIME type doesn't match the 
extension or data format?

BeOS used MIME types in the file system, and preferred to trust them rather 
than extensions.  A decent application should have degraded gracefully.  In 
worst case, see above ("bring the system down").

Critically: a MIME type is *metadata*, it is a label placed on the data, it 
is not inherent in the data.  Data does not "have" a MIME type, it is 
*assigned* a MIME type (or not, if it isn't relevant, as for most 
applications running on Windows).  Windows cares about "file types" 
(extensions), not MIME types.  Web servers typically care about MIME types 
(although HTTP isn't a MIME-compliant protocol, but that's a different 
rant).  Browsers, consequently, usually care about MIME types.

I can write the above document in Word (ewwww, and wash my hands after), and 
save it as a .doc, and then instruct my webserver to deliver it as 
application/xml regardless of the extension, and a browser that receives it 
... will choke, because it *isn't XML*.  The webserver, not being Word, 
can't strip the cruft; the web browser, not being Word, gets confused when 
handed application/XML that doesn't start with an XML declaration.

> The answer is this: when the XML is put into Word, the Word application
> wraps the XML with a bunch of Word-specific stuff (the wrapper stuff is
> not visible).

Oh, *yes* it is!  Unless, of course, you happen to be using one application, 
namely MS Word.

> Conversely, Notepad does not wrap the XML with anything.  The document
> is pure XML, it can be fed directly into an XML parser, and thus it has
> a MIME type of application/xml.

No it isn't.  It's whatever MIME type you assign to it.  If you call it 
text/plain, it's text/plaini.

Amy!
(in a ranting mood ... but the summary was misleading, I'm sorry)
-- 
Amelia A. Lewis                    amyzing {at} talsever.com
There's someone in my head, but it's not me.
                 -- Pink Floyd





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS