Re: [xml-dev] The illusion of simplicity and low cost in data designand

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] The illusion of simplicity and low cost in data designand computing

From: Michael Kay <mike@saxonica.com>
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
Date: Fri, 12 Aug 2022 23:25:30 +0100

> 
> Identifying files by their format rather than by the piece of software
> that wrote them, on the other hand, implicitly assumes that the same
> information may be usable by more than one piece of software.
> 

The way we handled this on the late lamented VME operating system is that every file had metadata, and the metadata was very open-ended and extensible. As well as basic information about who created the file and when, and access information about which users could access the file, software that created a file could also make the file private to that software (the equivalent of the .xxx convention on Unix: if a file is called .git or .filezilla, then that's a strong signal to other applications to avoid touching the contents). Packing metadata into a naming convention is a pretty sure signal that something is amiss. There could also be metadata about the file format which didn't constrain who could access the file, but gave information about what they could expect to find in it: for example, this file is a hashed-random or indexed-sequential file with the key in the first 12 bytes; or, this file contains executable code, or, this file is a COBOL program, or this file is XML. That doesn't prevent a third-party text editor opening the file, but if it knows that the file is a COBOL program, then it can take advantage of that knowledge.

There are some cases, I think, where it makes sense for an application to encapsulate data files so that access to the files is only possible via that application: an obvious example is database files - you really don't want people messing with database files except via the DBMS software. There are other cases where it's fine for other applications to access the data, but it's helpful if they can find out something about the data they are accessing, e.g. this file is XML, this file is UTF-8. Unix, I think, tends too much to the idea that there is only one format for all files, namely streams of (originally) ASCII characters, and you only have to look at the files underpinning a DBMS to see that that isn't true in real life.

HTTP reflects the protocol-designers' view of the world - if you send someone a stream of bytes, they are going to need to know something about it, so there are protocol headers containing metadata such as media type and encoding to facilitate that. But it's broken, because the HTTP server is reading the data from a filesystem that has no such knowledge, so the HTTP server has to guess.

Perhaps this all feels too much like top-down design, and the future is with bottom up. Don't try to attach metadata to a file to say it's a JPEG image; sniff the contents and recognise the characteristic bit patterns instead. As people have said, it works most of the time, and perhaps we have to become more comfortable with a world in which things work most of the time. I remember hardware designers telling me we should remember that the hardware itself gives wrong answers occasionally. But yes, when it comes to getting into a self-driving car, I would prefer one that's designed to work all of the time, not just most of the time.

Michael Kay
Saxonica

References:
- The illusion of simplicity and low cost in data design and computing
  - From: Roger L Costello <costello@mitre.org>
- Re: [xml-dev] The illusion of simplicity and low cost in data designand computing
  - From: Michael Kay <mike@saxonica.com>
- Re: [xml-dev] The illusion of simplicity and low cost in data designand computing
  - From: Roger L Costello <costello@mitre.org>
- Re: [xml-dev] The illusion of simplicity and low cost in datadesign and computing
  - From: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]