How to make better files?

Hi Folks,

Scenario: There is a file.

What’s in the file? What kind of file is it? Who produced it? When? What kind of data does it hold? Is it safe to open?

Where will you find answers to those question?

Old school Unix used a stream-of-bytes metaphor for files. Every file is just a sequence of bytes. Some authors refer to this as formatless files. Michael Kay points out that, in reality, the files are not formatless; rather, their format is simply not known at some level of the system, and it is up to applications to determine the file’s format. Michael Kay wrote:

Applications are left to guess by making inferences from
the file name extension, or by sniffing the content, all of
which is unreliable and insecure.

Liam pointed out that there is a Unix command called “file” which does a pretty decent job of inspecting files and figuring out what they are.

There is a spectrum of “file knowingness.” At one end of the spectrum is old school Unix: a file is a stream of bytes. Nothing is known about the file. You need to sniff its content and make inferences. What lies at the other end of the spectrum? How would you characterize that end of the spectrum? How about this characterization: We know virtually everything about files. We know its character encoding. We know what application produced it. How long it is. When it was created. Where it was created. What kind of data it contains. What kind of applications can process it. Whether it is or isn’t safe to open. Do you agree with that characterization? What else would you add?

At which end of the spectrum do you want your files? Is one end of the spectrum better? Better in what way? Should we all strive to transition our files to one end of the spectrum?

Where does XML live in the spectrum? I suspect it lives somewhere in the middle. Michael Kay argues that XML doesn’t do a particularly good job of “file knowingness,” as he wrote:

Conventions like putting the encoding in a header or using
strings like xmlns="..." to identify the vocabulary, are ad-hoc
and unsystematic, and they're very often at the wrong level
of the system (you should know the encoding before you start
trying to interpret the characters).

How can we make better XML?

How can we make better files?

/Roger