By convention we normally restrict "binary" to files which are not interpretable as streams of characters. [John Cowan]
The word "text" is applied to files which are interpretable as streams of characters.
Of course any text file is also a binary file, since the class of text files is obtained from the class of binary files by applying restrictions. But it would be confusing to call a text file a binary file; it would be like calling a cat a mammal: correct but imprecise.
As a practical matter, is there *any* difference between text and binary
data *other* than the necessity of worrying about record-end handling in
the former, but not in the latter?
The problem with John's definition is that it begs the question, "What
does 'interpretation as a stream of characters' mean?" For many years,
the Wall Street Journal's editorial policy was to describe data sizes as
numbers of characters or, for large numbers, Encyclopedia Brittanicas,
irrespective of whether the data were in fact characters. The policy
was justified, not only because of the erstwhile naivete of their reader
base, but also, I would argue, because the distinction between
interpretability-as-characters and the lack thereof is not clear.
Regular expressions, for example, are quite useful for detecting
patterns in purely numeric data streams. Does that make such streams
*text*?
Charles Goldfarb used to say, "If there are bugs in a text-processing
system, at least one of them will have to do with record-end handling."
That rule-of-thumb is perhaps dated now,
but the absurd
human-productivity-diminishing differences between Unix, Microsoft, and
Apple record-end conventions are still very much with us.
It's sort of
like the railroad gauges that did (or did not) descend from the
distances between pairs of ancient chariot wheels; they are the echoes
of empires. (AFAIK, you still have to change trains, or at least
undercarriages, when crossing into Ukraine.)
I had a recent argument/discussion with a co-worker about if it is accurate (or useful) to consider UTF8 "Text" ...
From the perspective of the archaic (but still implemented) "text mode" file open modifier ... valid UTF8 "Text" was not readable because the control-Z was interpreted as EOF and data would be truncated.
I suggest there is an "intent" or "desire" to categorize <gasp> "files" as "Text" or "Binary" but in reality the distinction is nearly or completely impossible to make accurately and without overlap.
Simple test case:
If a stream or file contains a single byte (hex code 0x20) is it "Text" or "Binary".
Or even more esoteric, is the empty stream text or binary ?
If you cannot answer this definitively I suggest you cannot answer the general case definitively.
To pretend there is a clean categorization has value, but to claim that one actually exists deterministically is folly.
I have heard that there are languages, in whose alphabet a single"character" represents a whole word.
Therefore, we need to be cautious
in using and interpreting the word "character" as ... well, a
character.
And we could have languages where the "character" is something so
unusual as a musical note, or the representation of a phoneme in an
audio-editor. And the mathematical symbols, and, ... These things
aren't "text", but it would also be gross flattening down to call them
"binary".
Before the advent of the codex (book), Latin and Greek script was written on scrolls. Reading continuous script on a scroll was more akin to reading a musical score than reading text. The reader would typically already have memorized the text through an instructor, had memorized where the breaks were, and the reader almost always read aloud, usually to an audience in a kind of reading performance, using the text as a cue sheet. Organizing the text to make it more rapidly ingested (through punctuation) was not needed.