The Impact of Regular Expression Denial of Service (ReDoS) in Practice

Published in

InfoSec Write-ups

7 min readNov 9, 2018

This is a brief for the research paper The Impact of Regular Expression Denial of Service (REDOS) in Practice: an Empirical Study at the Ecosystem Scale, published at ESEC/FSE 2018. I was the first author; Christy Coghlan, Francisco Servant, and Dongyoon Lee rounded out the team.

If you want an introduction to ReDoS, see my ReDoS Cheat Sheet.

Summary

Developers use regular expressions (regexes) for many purposes. Sometimes this can lead to security vulnerabilities through Regular Expression Denial of Service (ReDoS). Regexes are often used in web servers in client-facing contexts, as an input validation step. For example, a regex is a nice way to answer the question “Does the string I just got from the client look like an email?” When used in this manner, it is important that regex matches run quickly on untrusted input.

Unfortunately, as prior research has shown, some regex matches have polynomial or even exponential complexity in the length of the input. This bad behavior requires that the program execute a “ReDoS regex” on untrusted input. Though untrusted input is pretty common in the web domain, how common are ReDoS regexes?

In this project we studied the incidence of ReDoS regexes in practice. In short, we:

Extracted about 400,000 unique regexes from about 500,000 npm and pypi modules;
Determined how many of these regexes were ReDoS regexes (~1%)* (see follow-up at the bottom);
Measured their complexity (most are quadratic and start to perform slowly at around 10K characters on a typical desktop machine);
Labeled common families of regexes (we have hundreds of different examples of regexes that identify emails!);
Explored the effectiveness of heuristics to identify ReDoS regexes (lots of false positives); and
Studied how developers fix ReDoS regexes (trim, revise, or replace).

Introduction

Regular expressions (regexes) are a handy tool for text processing. People use them for things like input validation. For example, the question “Does the string the user entered look like an email?” can be answered using a regex.

Since regexes are generally deemed pretty useful, they are built into most modern programming languages. Each programming language has its own regex engine that accepts a regex and a string and decides whether or not the regex matches the string.

As it turns out, it can be expensive to determine whether or not a regex matches a string. Regex matches in most programming languages have worst-case exponential complexity and can also have polynomial complexity (quadratic, cubic, etc.).

Regular Expression Denial of Service (ReDoS)

When one of these expensive regex matches can be triggered by a malicious party, it may cause Regular Expression Denial of Service (ReDoS). If computational resources can be diverted to an expensive regex match instead of to legitimate users, this will deny service to legitimate users.

This type of expensive behavior has been discussed in the academic community since the 1970s, and as far as I know its relevance to web security issues was first noted in 2003. However, there have been no large-scale studies of ReDoS in the wild and I had always thought of it more as an academic parlor trick than as a serious problem in practice. However, during my Sense of Time project I was surprised to find that about 10% of the security vulnerabilities reported in npm were ReDoS vulnerabilities. This made me wonder if ReDoS might be a serious problem in practice.

ReDoS regex example 1: Exponential usernames from Microsoft

During some exploratory work we found this regex in a Microsoft project. It is used to match Windows usernames.

/^[a-zA-Z0–9]+([._]?[a-zA-Z0–9]+)*$/

In many languages, this regex is exponential on the input ‘aaa…aaa!’. Here’s an example you can try in Node.js:

/^[a-zA-Z0–9]+([._]?[a-zA-Z0–9]+)*$/.exec('a'.repeat(100) + '!')

ReDoS regex example 2: Quadratic emails in Django

We found this regex in Django. It was used to match emails.

/^\S+@\S+\.\S+$/

In many languages, this regex is quadratic on the input ‘@@@…@@@’. Here’s an example you can try in Node.js:

/^\S+@\S+\.\S+$/.exec('@'.repeat(100000))

Why do these regex matches take so long?

Most regex engines use a naive backtracking algorithm that works something like this:

Match characters like you would expect.
Any time you have a choice, try one path. If it doesn’t work, try the other one later.
If you find a match, return success.

This algorithm means that if there are more than a linear number of possible paths to try, then the regex engine will take more than linear time.

Here is an example regex with worst-case exponential behavior. It is a simplified version of the Microsoft regex for usernames shown above.

Railroad diagram for /^(?:a+)+$/. Generated by https://regexper.com

Let’s think about its behavior on the string “aaa!”.

Every time the regex engine sees an “a” it can take either the inner loop or the outer loop from the “a” vertex.
This means that every “a” it sees will double the number of paths it will try on a mismatch: “all of the paths it was already going to try, followed by either the inner loop or the outer loop on this particular ‘a’”.
The final “!” ensures that the regex engine will not find a match along any of these paths.
So with three “a”s we have 2*2*2 paths to try. If we add another “a” we will have (2*2*2)*2 paths to try. For inputs of this form the behavior will be exponential in the number of “a”s.

Collecting regexes

As I mentioned earlier, I was suspicious about the possibility of lots of ReDoS regexes in the wild after my previous study on denial of service attacks in npm modules. To check how common ReDoS regexes are in the wild, we started out by collecting all of the regexes we could get our hands on.

This figure shows how we collected regexes for our analysis.

Get the list of all modules in the npm (JavaScript) and pypi (Python) registries.
Keep the ones that have repositories we can clone.
Statically extract regexes from JavaScript (npm) and Python (pypi) files. We found that regexes were used pretty widely: 45% of the npm modules we studied, and 35% of the pypi modules we studied, included at least one regex.
Reduce to (giant) lists of unique regexes.

Identifying ReDoS Regexes

This figure shows how we identified ReDoS regexes from our giant lists of regexes (“corpuses”).

Apply ReDoS regex detectors (see here for drivers).
Dynamically validate the regexes they think are dangerous in the language of interest (npm → Node.js; pypi → Python).
Determine how many modules contained those regexes in their GitHub/etc. project.

We found similar results in npm and pypi.

Prominent places we found ReDoS regexes

We found ReDoS regexes (eek!) in:

MongoDB
Hapi
Django

We also found them (double eek!) in:

Node.js core libraries
Python core libraries

Happily, all of these have now been fixed.

How vulnerable are ReDoS regexes?

We calculated the time it took the regex engines to perform regex matches on these ReDoS regexes, for varying-sized inputs. Most of the regexes had quadratic complexity. A few had exponential or cubic.

Again, we found similar results in npm and pypi.

Degree of vulnerability of ReDoS regexes

How do people fix ReDoS regexes?

We wanted to understand how people fix ReDoS regexes after they identify them in their code. We looked at the fixes for the 37 ReDoS vulnerabilities reported in the CVE and Snyk.io databases.

Developers used one of three strategies:

Trim Don’t touch the regex. Just trim the input so it’s not too long, thus bounding the worst-case behavior. This works OK provided that (1) legitimate input shouldn’t be “too long” (e.g. 500 characters?); and (2) the worst-case behavior is not exponential.
Revise Tweak the regex so it is no longer vulnerable to ReDoS but it still matches the same language. Or similar. Be careful — we saw several cases where a revised regex became safe to one input but vulnerable to a new style of input.
Replace Replace the regex with string operations.

We then sent vulnerability disclosures to 284 projects describing possible ReDoS regexes within them. 48 of the projects responded with a fix. “Revise” remained the most popular option. Interestingly, many of these revised regexes did not match the same set of strings that the original regex did.

Conclusions

We believe our study shows that

ReDoS is a real problem in practice, and
The practitioner community needs better tools to detect and repair ReDoS regexes.

More information

The full paper is available here. It has more details and extra content.
The PowerPoint slides are here.
The research artefact is on Zenodo here. It includes the ~400K unique regexes we identified in npm and pypi modules, in case you happen to have a need for a giant corpus of regexes.
Our drivers for the ReDoS regex detector ensemble are maintained on GitHub here.

Update — 11/7/2019

During the Lingua Franca project (blog writeup here), we improved the super-linear regex detectors to eliminate false negatives. Using the improved detectors, we found that up to 10% of regexes exhibit worst-case super-linear behaviour. That’s an order of magnitude more than we reported in this paper!

Follow Infosec Write-ups for more such awesome write-ups.

InfoSec Write-ups

A collection of write-ups from the best hackers in the world on topics ranging from bug bounties and CTFs to vulnhub…

medium.com

InfoSec Write-ups

The Impact of Regular Expression Denial of Service (ReDoS) in Practice

Summary

Introduction

Regular Expression Denial of Service (ReDoS)

ReDoS regex example 1: Exponential usernames from Microsoft

ReDoS regex example 2: Quadratic emails in Django

Why do these regex matches take so long?

Collecting regexes

Identifying ReDoS Regexes

Prominent places we found ReDoS regexes

How vulnerable are ReDoS regexes?

How do people fix ReDoS regexes?

Conclusions

More information

Update — 11/7/2019

InfoSec Write-ups

A collection of write-ups from the best hackers in the world on topics ranging from bug bounties and CTFs to vulnhub…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in InfoSec Write-ups

Written by James Davis

No responses yet