Re: [xml-dev] Binary versus Text

Sorry for the length of this consolidated reply. Tl;dr version: The difference between text and binary is a matter of intent, which is not always subjective. We insist on beginnings because we are individuals, though because we are many individuals, any point can be called central. When markup is old enough, we call it plain text.

On Sun, Nov 24, 2013 at 9:25 AM, Costello, Roger L. <costello@mitre.org> wrote:

By convention we normally restrict "binary" to files which are not interpretable as streams of characters. [John Cowan]

The word "text" is applied to files which are interpretable as streams of characters.

I should have said "not interpreted [by someone]" rather than "not interpretable", since every binary file is interpretable as a stream of characters (in at least some encoding) if you don't care about its meaning. That, plus the fact that most early encodings were small and simple, is what allowed the computing world to muddle along for so long without clearly distinguishing text from binary, except in the matter of line ends (see below).

Of course any text file is also a binary file, since the class of text files is obtained from the class of binary files by applying restrictions. But it would be confusing to call a text file a binary file; it would be like calling a cat a mammal: correct but imprecise.

That too wasn't very well worded: it would be like calling a cat a mammal in a context in which we divide animals of interest into cats and mammals: obviously "mammals" is short for "other mammals".

On Sun, Nov 24, 2013 at 10:39 AM, Steve Newcomb <srn@coolheads.com> wrote:

As a practical matter, is there *any* difference between text and binary
data *other* than the necessity of worrying about record-end handling in
the former, but not in the latter?

As a practical matter, is there any difference between words and pictures other than the necessity (in most cases) of dividing text into lines? Of course there is. There are a vast number of issues with text, of which encoding and line division are the simplest and easiest of all to solve. But you know that.

The problem with John's definition is that it begs the question, "What
does 'interpretation as a stream of characters' mean?" For many years,
the Wall Street Journal's editorial policy was to describe data sizes as
numbers of characters or, for large numbers, Encyclopedia Brittanicas,
irrespective of whether the data were in fact characters. The policy
was justified, not only because of the erstwhile naivete of their reader
base, but also, I would argue, because the distinction between
interpretability-as-characters and the lack thereof is not clear.

Indeed, "interpretability" was a bad choice of words on my part, as I noted above.

Regular expressions, for example, are quite useful for detecting
patterns in purely numeric data streams. Does that make such streams
*text*?

I don't know what you mean by "purely numeric". Numbers are an abstract concept. They can be represented by numerals (e.g. "1234567"), in which case they are text. Or they can be represented in any of a vast number of binary formats, which I cannot exemplify in this email because it is textual.

Charles Goldfarb used to say, "If there are bugs in a text-processing
system, at least one of them will have to do with record-end handling."
That rule-of-thumb is perhaps dated now,

I think it would refer to encoding now.

but the absurd
human-productivity-diminishing differences between Unix, Microsoft, and
Apple record-end conventions are still very much with us.

Historically, it descends from the difference between the Model 33 Teletype and the Model 37. On the former, CR only returned the printing head to the left margin and LF was required to feed the paper upwards; on the latter, LF did both jobs. Bell Labs people had access to the spiffy Model 37, so their OS employed LF alone, whereas folks in the outside world mostly had Model 33s (I cut my teeth on one), so the DEC OSes used CR+LF. That legacy passed through CP/M to MS-DOS and Windows.

It's sort of
like the railroad gauges that did (or did not) descend from the
distances between pairs of ancient chariot wheels; they are the echoes
of empires. (AFAIK, you still have to change trains, or at least
undercarriages, when crossing into Ukraine.)

The Russian Empire did that on purpose to make it hard to invade them using their own train tracks. The Unix Empire wanted simplicity of internal processing (so only one character for a newline), whereas the DEC Empire wanted simplicity of I/O (so that text shipped directly to a teletype would Just Work).

On Sun, Nov 24, 2013 at 10:49 AM, David Lee <dlee@calldei.com> wrote:

I had a recent argument/discussion with a co-worker about if it is accurate (or useful) to consider UTF8 "Text" ...
From the perspective of the archaic (but still implemented) "text mode" file open modifier ... valid UTF8 "Text" was not readable because the control-Z was interpreted as EOF and data would be truncated.

Only if an actual ^Z character appeared in the UTF-8, which would be no different from it appearing in a pure ASCII file. UTF-8 does not introduce spurious 0x1A bytes into the binary representation.

You might make a better case that UTF-16 is "not text" for such purposes.

I suggest there is an "intent" or "desire" to categorize <gasp> "files" as "Text" or "Binary" but in reality the distinction is nearly or completely impossible to make accurately and without overlap.

Yes, it is a distinction of intention.

Simple test case:

If a stream or file contains a single byte (hex code 0x20) is it "Text" or "Binary".
Or even more esoteric, is the empty stream text or binary ?

It is text if you intend to treat it so, and binary if you don't.

If you cannot answer this definitively I suggest you cannot answer the general case definitively.
To pretend there is a clean categorization has value, but to claim that one actually exists deterministically is folly.

That is equivalent to saying all intent is subjective, but there are a variety of definitions (as always) of objective intent. The differences between murder and manslaughter (culpable homicide), between assault and accident, lie in intent. But when prosecuting people for murder, we rely on objective evidence to establish their intent, we don't just ask them.

On Sun, Nov 24, 2013 at 11:00 AM, Dimitre Novatchev <dnovatchev@gmail.com> wrote:

I have heard that there are languages, in whose alphabet a single

"character" represents a whole word.

That is so: for example, the character & represents the word "and" in the English language. In other languages, it may represent the words "y" or "et" or "und" or "и". In Chinese, which is probably what you are thinking of, some characters represent whole words, such as 一, which represents the word "yī", meaning "one". But in the typical case Chinese characters represent meaningful syllables. Thus the word for China, 中国, contains two characters, both because it represents the two syllables "zhōng" and "guó", and because it represents the two semantic units "middle" and "country".

Why "middle"? Because the Chinese saw themselves as the only civilized people, in the middle of the barbarians to the north, south, east, and west, thus conceptually in the middle of the world.

Similarly, we text folk treat all non-text formats as barbarian, though there are thousands of them, all very different from one another.

There is a cross-link here to another current thread, for indeed any point on the Earth's surface may be considered the middle of the world, just as any element in a network of semantic relationships may be treated as the root element, and any piece of information is the center of all knowledge, for there are conceptual links leading everywhere. Physically, we all have exactly one point of view (in the literal sense), but there are seven billion available points of view, and potentially many more. That, I think, is why human beings, being embodied intellects and not bodiless angels, seem to always insist on having a unique starting point in our conceptual networks.

Therefore, we need to be cautious
in using and interpreting the word "character" as ... well, a
character.

Letters are a prototypical sort of character, but by no means the only kind. Some are so different from letters that Unicode does not hesitate to call Chinese characters "letters". Indeed, the great majority of letters in Unicode 6.3 are Chinese characters: 77421 (85%) are, and only 14104 (15%) are not.

And we could have languages where the "character" is something so
unusual as a musical note, or the representation of a phoneme in an
audio-editor. And the mathematical symbols, and, ... These things
aren't "text", but it would also be gross flattening down to call them
"binary".

Music and mathematics are textual, but they are not plain text: they are fancy text, because of their irreducibly two-dimensional character. Fortunately, we have a well-established invention, about a thousand years old, for reducing fancy text to plain text: it's called "markup".

Horizontal and vertical whitespace are the oldest kinds of markup, and they are so well established that we rarely think of them as such and treat them as plain text, at least up to a point. But it wasn't always so. This paragraph from Wikipedia's article on "scriptio continua" (writing without whitespace) shows how things used to be:

Before the advent of the codex (book), Latin and Greek script was written on scrolls. Reading continuous script on a scroll was more akin to reading a musical score than reading text. The reader would typically already have memorized the text through an instructor, had memorized where the breaks were, and the reader almost always read aloud, usually to an audience in a kind of reading performance, using the text as a cue sheet. Organizing the text to make it more rapidly ingested (through punctuation) was not needed.

Indeed, punctuation was also originally a kind of markup now also taken into the text. So are symbols, though they are more akin to entity markup than to element markup.

I could go on forever, but that's the nature of knowledge: see above. I've already pruned several digressions.

--
GMail doesn't have rotating .sigs, but you can see mine at http://www.ccil.org/~cowan/signatures