OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
RE: Seek your help in compiling a list of "facts about base64-encodeddata"

Thank you David and Eliot. Outstanding comments!


I updated the list, incorporating the comments from David and Eliot. See below. Does anyone else have comments?  /Roger
Here are some things to know about base64:
1.       Base64 encoding is specified in RFC 4648.
2.       Base64-encoded data is plain text. 
3.       There are several base64 alphabets:
a.       The standard base64 alphabet consists of these 64 ASCII characters: a-z, A-Z, 0-9, +, / and the equals symbol ( = ). 
b.       In the URL and Filename safe base64 alphabet, the plus symbol ( + ) is replaced with a minus sign ( - ) and the forward slash symbol ( / ) is replaced with an underscore symbol ( _ ).  
c.       Other base64 alphabets are called non-standard, or custom, base64 alphabets.
d.       As a common extension, base64 can also contain arbitrary whitespace, which is ignored.
4.       Any type of file, from plain text to binary executable, can be base64-encoded. 
a.       David Carlisle: That is true for some definition of "file" and "executable". You are, I think, assuming a filesystem that stores files as streams with a character as end of line marker, not a system using fixed length records. Also, you don't normally encode an executable or other status in a base64-encoded string.
5.       Base64-encoding enables binary objects to be transported using text-based protocols, such as SMTP.
6.       People have created regular expressions that specify the pattern of base64 text. It is possible for text to match the regular expression and yet not be base64. That is, text that appears to be base64, may not be.
a.       David Carlisle: It isn't clear what you mean by "not base64". You can write a regex that matches text just if it can be decoded, but of course you can't tell if any string was generated by base64 encoding another string. I might have just typed aGVsbG8gd29ybGQ=  as a random string of characters, or I may have base64-encoded some other string. You can't tell unless I tell you, and even then you can't tell if I'm telling the truth.
7.       External information must be provided to tell whether text is base64.
8.       If external information says that text is base64, the external information might be incorrect, either by accident or by intent.
9.       Performing base64 decoding on data that is not base64 might cause harm. 
a.       David Carlisle: I  think "causes harm" is the wrong thing to say. If the data isn't base64-encoded data then it might not decode at all. If it does decode, then it may or may not decode to the intended data, but that in of itself doesn't cause harm. If you misuse any data then something may go wrong, but that's not really related to base64.
b.       Eliot Kimber: I do wonder what is intended by “causing harm”?
c.       The implication seems to be some sort of malicious attack through data injection by disguising the data as “plain text”. But if that’s a possibility, it’s not the fault of base64 in particular but in the use of a data format that requires decoding. I’ve often circumvented mail system restrictions on specific file types by zipping them or double-zipping them. But even then it’s ultimately the responsibility of the receiving system not to trust any data it gets from any source.
d.       That is, if a malicious agent were to send you base64 data you could decode it to a new byte sequence without danger—the mere presence of the bytes in some storage context cannot, by itself, cause harm. The potential harm would be if you then gave those bytes to some process that, in the act of consuming them, caused harm, e.g., treating the decoded bytes as an executable and running it.
e.       If the data that was encoded could cause harm, for example by exploiting a vulnerability in a graphic renderer or media player, the fact that it was base64-encoded cannot be blamed—the fault is with the creator of the original data and the implementer of the vulnerable consumer of the data and maybe with the user’s lack of caution or misplaced trust in the data’s source.
f.        But it’s impossible to see how the use of base64, or any other encoding, could, by itself, be held to blame for any harm caused by the decoded data.
g.       Is there a way to perform base64 decoding that is guaranteed to never cause harm?
                                                               i.      David Carlisle: No. You could decode it and pipe it to  /dev/null but if the information you discarded was "run now", ignoring that may be harmful. Using  or misusing or ignoring data can cause harm, whether or not base64 encoding is involved.
10.   There is nothing in base64-encoded data which identifies the media type of the data. External information must be provided to tell the media type. Without external information, the media type must be discovered (if possible).
a.       David Carlisle: The same is true of most plain text files, base64-encoded or not.
11.   If there is external information about the media type of the data, the external information might be incorrect, either by accident or by intent.
12.   Processing an object, assuming it is of media type A when it is actually of media type B, might cause harm. 
13.   Decoding base64 text is a trivial task. 
14.   Data that is base64-encoded cannot be directly viewed, used, or inspected.
a.       David Carlisle: Any data on a computer needs to be decoded. If you have a JPEG image or a base64-encoded JPEG image, then in either case you need to interpret the bytes in the data as an image. The software needed is slightly different but I don't think there's any conceptual difference. 
b.       Eliot Kimber: I think the point Roger is trying to make is that from the standpoint of communicating data to a system that is expected to be able to process it in some way, if you provide a stream of bytes that is a JPEG image then the consumer of that stream will be able to process it directly, assuming it is otherwise able to recognize JPEG byte sequences and decode them (e.g., a browser with a built-in JPEG renderer). If you provide a sequence of characters representing the base64 encoding of the same image, then another decoding step is first required.
c.       It’s also important that the JPEG byte sequence is self-identifying as being JPEG data (using the magic numbers associated with well-known media types)—a receiver of arbitrary byte sequences can at least distinguish those that claim to be of a particular type by use of magic numbers from those that don’t.
d.       By contrast, a sequence of characters that (might be) a base64 encoding of something are not self-describing as being base64 and also are not self-describing as to the nature of the data the encoding is of. Thinking about it now, it would definitely be convenient if base64 encoding had something that self-identified it as base64-encoded data and further indicated what the original data was, if it can be meaningfully identified. For example, it might be useful to know that the base64 data encoded a JPEG and not a PNG.
e.       Of course, once the base64 encoded data is decoded it is no different from any other byte sequence that might be received. The fact that it had been base64 encoded is not really interesting.
15.   Compared to data that is not encoded, viewing/using/inspecting base64-encoded data requires an additional step: decode and then view/use/inspect. 
a.       David Carlisle: Shrug. Getting from the bytes in a file to some human understandable information is a black box in most cases. If the box is a bit longer, so be it.
16.   Without external information about the media type of the data that is base64-encoded, there are two additional steps to viewing/using/inspecting base64-encoded data: decode, determine the media type (if possible), and then view/use/inspect.
17.   Text formats such as XML and JSON cannot carry binary data. If binary data must be carried by a text format, the binary data can be base64-encoded, thus generating plain text, and then the base64-encoded plain text can be carried by the text format.
a.       David Carlisle: True, although it's not saying much. It can be base64-encoded but there are lots of other ways as well: either other text encodings or non-text encodings such as using elements to replace control characters not allowed in XML.
18.   The size of base64-encoded data is roughly 4/3 the size of the original data 
a.       David Carlisle: That is true only when each character to be encoded is one byte. The relative size of the base64-encoded string depends on the relative size of a character. For example in UTF-16 encoded XML, base64-encoded data is 8/3 the size of the original, as each of the characters used in the base64 encoding takes two bytes.

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS