XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Computational Power Required Of XML Languages And The InsecurityThereof

Hi Folks,

 

It is important to know the complexity of the XML language that you are deploying. Its complexity dictates how much computational power is required to process members of the language. The more computational power required, the greater the attack surface. Use the minimum amount of computational power necessary. Certain languages can be combined and the resulting language does not require more computational power. Favor those languages.

Definition: An XML language is the set of XML instance documents that are acceptable.

For example, you may define a Bookstore XML language as the set of all instance documents that have Store as the root element and its content is zero or more Book elements. Here is a member of the language:

<Store>
   
<Book>...</Book>
   
<Book>...</Book>
</Store>

That document has a Store element containing two Book elements. An XML document that contains three (or four or five or …) Book elements is also a member of the language. A document that contains, say, cellphone elements is not a member of the language.

The next few sections give examples of XML languages that require increasingly greater computational power.

Regular Languages

A regular language is one where members of the language can be accepted by a machine that simply transitions through a finite set of states (i.e., there is a finite state machine, FSM, for the language). Here is an example of a regular language: the content of the Store element is zero or more Book elements followed by zero or more Magazine elements.

<Store>
   
<Book>...</Book>
   
<Book>...</Book>
   
<Magazine>...</Magazine>
   
<Magazine>...</Magazine>
   
<Magazine>...</Magazine>
</Store>

Deterministic Context-Free Languages

Context-free languages are more complex than regular languages. They require more computational power. To determine membership requires an additional capability – a stack. A FSM is not powerful enough. As a machine processes the input it needs to keep a record what it has encountered. Here is an example of a deterministic context-free language: the content of the Store element is zero or more Book elements followed by zero or more Magazine elements and the number of Book and Magazine elements must be equal. The XML document in the previous section is not a member of this language since there are two Book elements and three Magazine elements. Here is a member of the language:

<Store>
   
<Book>...</Book>
   
<Book>...</Book>
   
<Magazine>...</Magazine>
   
<Magazine>...</Magazine>
</Store>

Perhaps you can see why a stack is needed: the machine pushes each Book element onto a stack and for each Magazine element it pops the stack; there must be one Book on the stack when the last Magazine is encountered.

When there is a dependency between the number of elements in two sets, then the language is context-free. In the example there is a dependency between the number of Book elements and the number of Magazine elements. When there is no dependency, then the language is regular. In the prior section there was no dependency between the number of Book and Magazine elements.

Non-Deterministic Context-Free Languages

In the last section no guessing was required; once the machine encountered the first Magazine it switched from pushing to popping the stack. For some languages, however, the machine can't tell when it has reached the stack-popping point and it has to guess. Non-determinism means that guessing is needed. Non-deterministic context-free languages require a more powerful machine than that required for deterministic context-free languages. It requires the ability to correctly guess when to start popping the stack. Here is an example of a non-deterministic context-free language: the content of the Store element is a mix of Book and Magazine elements followed by the same mix of Book and Magazine elements in reverse order. Here is a member of the language:

<Store>
   
<Book>...</Book>
   
<Book>...</Book>
   
<Magazine>...</Magazine>
   
<Book>...</Book>
   
<Book>...</Book>
   
<Magazine>...</Magazine>
   
<Book>...</Book>
   
<Book>...</Book>
</Store>

There is Book, Book, Magazine, Book, and then the reverse: Book, Magazine, Book, Book. You can probably see that a machine to accept members of this language would push, push, push onto the stack and then at some point it guesses that it is time to start popping the stack. Such a machine is more powerful than the machine required in the preceding section.

Non-deterministic context-free languages require more computational power than deterministic context-free languages.

Recursive Languages

Recursive languages are still more complex. They require even more computational power. To determine if something is a member of a recursive language requires a more powerful memory – a memory that the machine can use to store things onto, move left and right, and read from. Here is an example of a recursive language: the content of the Store element contains zero or more Book elements followed by zero or more Magazine elements followed by zero or more DVD elements and the number of Book, Magazine, and DVD elements must be equal. Here is a member of the language:

<Store>
   
<Book>...</Book>
   
<Book>...</Book>
   
<Magazine>...</Magazine>
   
<Magazine>...</Magazine>
   
<DVD>...</DVD>
   
<DVD>...</DVD>
</Store>

Perhaps you can see that a stack is insufficient – by using only a stack the machine can determine that the number of Book and Magazine elements are equal, but it is unable to determine that there are an equal number of DVD elements. With a strip of memory the machine can write each Book sequentially onto consecutive cells of the memory, moving right with each Book; when the Magazines are encountered the machine reverses direction and overwrites each Book with Magazine; finally when the DVDs are encountered the machine reverses direction again and overwrites the Magazines.

When there is a dependency between the number of elements in three sets, then the language is recursive.

Recursively Enumerable Languages

We've arrived at the most complex languages. The same machinery – memory – is used with these languages. However, a machine that processes these languages may go into an infinite look when trying to decide if something is a member of the language. Here is an example of a recursively enumerable language: the content of each Section element is one or more Title elements followed by an optional Section element and each Section element must have the same content as its parent element, when present. Here is member of the language:

<Section>
   
<Title>...</Title>
</Section>

The machine will halt and accept the input.

This also is a member of the language:

<Section>
   
<Title>...</Title>
   
<Title>...</Title>
</Section>

Again, the machine halts and accepts the input.

Is the following a member of the language?

<Section>
   
<Title>...</Title>
   
<Section>...</Section>
</Section>

In order for it to be a member, the child Section element must have the same content as its parent:

<Section>
   
<Title>...</Title>
   
<Section>
       
<Title>...</Title>
       
<Section>...</Section>
   
</Section>
</Section>

But now the grandchild Section element must have the same content as the child Section. Do you see the infinite loop? The machine never halts on this input. It cannot decide if the XML is a member of the language.

Closure of Regular Languages

Regular languages can be combined – by unions, concatenations, Kleene star, and intersections – and the result is guaranteed to be a regular language. For example, the regular language (Book*, Magazine*) can be concatenated with a (Music*) regular language:

<Store>
   
<Book>...</Book>
   
<Book>...</Book>
   
<Magazine>...</Magazine>
   
<Magazine>...</Magazine>
   
<Magazine>...</Magazine>
   
<Music>...</Music>
</Store>

The resulting language (Book*, Magazine*, Music*) is also a regular language.

The ability to combine languages without the resulting language requiring more computational power is very important.

Computational Power and Insecurity

Increased computational power means an increased attack surface.  Experts advise against deploying input languages that require more computational power than a deterministic context-free language. (See the Dartmouth paper referenced in the next section). There is much more to say about this; see the next section.

Use Minimal Computational Power

This is a fantastic paper:

 

                The Halting Problem of Network Stack Insecurity

 

http://www.cs.dartmouth.edu/~sergey/langsec/papers/Sassaman.pdf

 

I'd like to tell you about one of the exciting things in the paper. It can be summarized by this principle:

 

                Principle 1: Starve the Turing Beast - Request

                and Grant Minimal Computational Power

 

Explanation

Do not use more computational power than is needed. An implementation that provides more than the minimal computational strength necessary should be considered broken.

 

Below is an example of an insecurity that is the result of an implementation which is given extra computational strength. Thanks to Jonathan Cranford for explaining this to me.

 

Regular Expressions

Some regular expression engines take an exponential amount of time to decide whether a string matches a regular expression (regex). For example, with this regex:

 

                (a+)+

 

some regex engines churn for hours trying to decide if the following string is invalid:

 

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa1

 

Attackers exploit this to carry out Regular expression Denial of Service (ReDoS) attacks.

 

So what causes regex engines to descend into exponential evaluation times?

 

Short answer: those regex engines were given too much computational power.

 

Long answer: Those regex engines were extended to support back references, which requires additional computational power, which requires back tracking, which results in exponential evaluation times.

 

Explanation

Example of a regex with a back reference:

 

                (ab*)c\1de

 

The \1 is called a back reference – it refers back to the stuff in parentheses.

 

That regex is interpreted like this: A string matches the regex if the string has an 'a' followed by zero or more 'b' followed by 'c', then the next part of the string must match the string generated from the parenthesized part, then 'd' and finally 'e'. Here is a matching string:

 

                abbcabbde

 

Supporting back references requires more computational power than if the regex engine only has to support a regular language.

 

Back references require the power of a non-deterministic finite state automaton (NDFSA).

 

NDFSA are frequently implemented using back tracking.

 

Back tracking is the cause of the exponential time. Here’s how a back tracking regex engine works: "In evaluating the string I took this path and it failed so I'll backup and try another path. Oh, that didn’t work, so I'll back up and try still another path. And on and on."

 

The number of paths a back tracking regex engine must explore grows exponentially with the length of the string to be evaluated.

 

So back tracking results in exponential times for certain regular expressions.

 

And back tracking is the result of using more computational power.

 

And the additional computational power is the result of adding more functionality.

 

Lesson Learned: An implementation should have no more computational power than it needs.

 

/Roger



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS