Pretty much. That some sequence of bytes can be recognized by the parse rules for some encoding-and-character-set is necessary but not sufficient for the file to be 'text'. It could be an accident. We also have to know that it is supposed to contain characters as the initial layer.
(For the BOM, I think they are characters that have been assigned a supplementary role. So not really an exception.)
For record-based storage a la vms etc, an api may present a file as a virtual text file, but that does mean the file itself should considered be text rather than binary: same as zip.
That edge cases of small files may not have enough information to make the call, does not mean their character is not clear when typical files are considered.
Another example: an rtf file with only hex encoded images is a text file, because every byte maps to an intended character. But an rtf file with an embedded binary image should be considered a binary file, because those bytes are not first intended as characters: as the rtf 1.8 spec mentions, an rtf parser needs to understand \bin and lump the binary data together.
Cheers
Rick
On Tue, Nov 26, 2013 at 9:37 AM, Rick Jelliffe <rjelliffe@allette.com.au> wrote:--I would say that a text file is one which, when sequentially read, has is a simple transformation from the bytes to a sequence of characters in one or more character repertoires (lists), fully consuming all bytes with none remaining, except any file-termination codes. This transformation may be direct mapping using the values of the bytes, or may involve mapping sequences of bytes to some other number (e.g. UTF-8), or may involve a simple state machine (e.g. ISO 2022), for example, (but surely nothing requiring a stack or random access.) The result and initial objective of parsing the file is a single sequence of characters.
You probably need to say something about BOMs. But it's the last sentence that's critical: something is only text if we intend to consume it as text.I would say that a binary file, when used in distinction to "text file", is one which uses potentially more complex transformations, where the result and initial objective of parsing the file will be a data structure or event stream.
That is, a data structure other than a string, and an event stream other than a stream of character events.Something like that.The members of the English Church had ingenuously imagined up to that moment that it was possible to contain, in a frame of words, the subtle essence of their complicated doctrinal system, involving the mysteries of the Eternal and the Infinite on the one hand, and the elaborate adjustments of temporal government on the other. They did not understand that verbal definitions in such a case will only perform their functions so long as there is no dispute about the matters which they are intended to define: that is to say, so long as there is no need for them. For generations this had been the case with the Thirty-nine Articles. Their drift was clear enough; and nobody bothered over their exact meaning. But directly someone found it important to give them a new and untraditional interpretation, it appeared that they were a mass of ambiguity, and might be twisted into meaning very nearly anything that anybody liked. --Lytton Strachey, "Cardinal Manning"
GMail doesn't have rotating .sigs, but you can see mine at http://www.ccil.org/~cowan/signatures