March 25, 2016

# How Random are URIs?

The web is an interesting place. It’s full of systems, devices, applications, resources, and other miscellaneous entities that play some part in its operation or use it on a regular basis. URIs are the identifiers or names we assign to these entities to give them meaning and enable interoperability. Sometimes these names are fairly deterministic and predictable, e.g., the URI one might request to get my latest public events on Github would be:

https://api.github.com/users/chris-wood/events


Other times, URIs contain seemingly randomly or unpredictable information such as a unique nonce generated by a client application, encrypted value required by the server, or just some binary data. Without peering into how URIs are generated or used, can we assess the amount of randomness (or lack thereof) contained in these URIs? This is the question we set out to answer in an ongoing research problem. In this post, I’ll describe how we solved this particular problem.

# Assessing URI Randomness

Consider the following simplified URI:

www.github.com/chris-wood/


Let’s rewrite this as follows:

/com/google/chris-wood


Now let me pose the following question. What if we considered this URI as an array of individual components (i.e., one separated by the ‘/’ character)? Given a list of N URIs of length M, we can treat this as a matrix of N rows and M columns. Each entry of which is a component of a URI.

Let’s now ask the following questions:

1. What is the amount of randomness in each component of these URIs?
2. How much information about the i-th component is leaked by the previous (i - 1) components?

In this context, we use Shannon’s notion of entropy as a measure of randomness. This is defined as follows:

\begin{align} H(X) = -\sum_{x \in X}P(X = x)\log(P(X = x)). \end{align}

The chain rule is useful to quantify the amount of entropy that is lost when new information is presented. Formally, the entropy of $X$, if conditioned on the entropy of $Y$, can decrease by at most the entropy in $Y$, i.e., \begin{align} H(X|Y) = H(X,Y) - H(Y), \end{align}

where $H(X,Y)$ is the joint entropy of $X$ and $Y$ defined as

\begin{align} H(X,Y) = -\sum_{x \in X}\sum_{y \in Y}\Pr(X = x, Y = y)\log(\Pr(X = x, Y = y)). \end{align}

If we treat each column of the URI component as a set of samples we can then compute the entropy of each component. Using this, we can compute the conditional entropy of the i-th component given the previous (i - 1) components.

# Computing Entropy

The last step we need to complete in our query for URI randomness is to actually compute the entropy of a set given a list of samples. Given the definition presented earlier, this is a fairly straightforward task. We just need to do the following:

1. Compute the PMF of the set by counting the number of occurrences of each element and then dividing it by the total number of elements in the set.
2. Use the PMF to compute the entropy by looping over every element in the set.

Simple enough. The code to do this is below.

Although correct, this code runs pretty slow a large input. Its exact complexity is linear in the size of the sample space. We could improve this code by parallelizing the PMF or entropy computation steps above. However, since our goal is to compute the entropy for each URI component, it’s probably better to parallelize at a higher level. Specifically, instead of computing the entropy based on the 1st component, and then the 1st and 2nd components, and so on, we can compute the entropy of all combinations in parallel. The code to do this is below. Notice that the only change is to use a ProcessPoolExecutor to walk over the range of columns.

# Performance Comparison

Did we actually do any better by computing the entropy in parallel? Let’s run some tests to find out. I downloaded the set of Cisco-generated URIs from icn-names.net/ and took a large enough sample to make for a visible comparison. I then parsed list of URIs to create the matrix of components. Finally, I ran this matrix through the entropy calculation for each column (component). I used the sequential and parallel versions of the program included above. For a file containing 1,000,000 URIs, the sequential version took approximately 13.4s whereas the parallel version took 4.2s. So, yes, we did better.

# Randomness Results

Let’s finish my actually answering the question. How much better did we do? The plot below shows the actual data. For this data set containing ~1.5GB of plaintext URIs, the joint entropy hovers in [9,14]. Conversely, the conditional entropy is very low (and sometimes negative due to oddities in the data). This means that a URI prefix reveals a great deal of information about the suffix. So, to sum up: names are not good sources of randomness. But that makes sense, of course, since names are supposed to be meaningful and predictable. They are supposed to simplify our lives by assigning simple meaning to complex things. Purely random names would offer very little utility to the human user.