Posted on:

The web is an interesting place. It's full of systems, devices, applications, resources, and other miscellaneous entities that play some part in its operation or use it on a regular basis. URIs are the identifiers or names we assign to these entities to give them meaning and enable interoperability. Sometimes these names are fairly deterministic and predictable, e.g., the URI one might request to get my latest public events on Github would be:

```
https://api.github.com/users/chris-wood/events
```

Other times, URIs contain seemingly randomly or unpredictable information such as a unique nonce generated by a client application, encrypted value required by the server, or just some binary data. Without peering into how URIs are generated or used, can we assess the amount of randomness (or lack thereof) contained in these URIs? This is the question we set out to answer in an ongoing research problem. In this post, I'll describe how we solved this particular problem.

# Assessing URI Randomness #

Consider the following simplified URI:

```
www.github.com/chris-wood/
```

Let's rewrite this as follows:

```
/com/google/chris-wood
```

Now let me pose the following question. What if we considered this URI as an array of individual components (i.e., one separated by the '/' character)? Given a list of N URIs of length M, we can treat this as a matrix of N rows and M columns. Each entry of which is a component of a URI.

Let's now ask the following questions:

- What is the amount of randomness in each component of these URIs?
- How much information about the i-th component is leaked by the previous (i - 1) components?

In this context, we use Shannon's notion of entropy as a measure of randomness. This is defined as follows:

$$ \begin{align} H(X) = -\sum_{x \in X}P(X = x)\log(P(X = x)). \end{align} $$

The chain rule is useful to quantify the amount of entropy that is lost when new information is presented. Formally, the entropy of $X$, if conditioned on the entropy of $Y$, can decrease by at most the entropy in $Y$, i.e., $$ \begin{align} H(X|Y) = H(X,Y) - H(Y), \end{align} $$

where $H(X,Y)$ is the joint entropy of $X$ and $Y$ defined as

$$ \begin{align} H(X,Y) = -\sum_{x \in X}\sum_{y \in Y}\Pr(X = x, Y = y)\log(\Pr(X = x, Y = y)). \end{align} $$

If we treat each column of the URI component as a set of samples we can then compute the entropy of each component. Using this, we can compute the conditional entropy of the i-th component given the previous (i - 1) components.

# Computing Entropy #

The last step we need to complete in our query for URI randomness is to actually compute the entropy of a set given a list of samples. Given the definition presented earlier, this is a fairly straightforward task. We just need to do the following:

- Compute the PMF of the set by counting the number of occurrences of each element and then dividing it by the total number of elements in the set.
- Use the PMF to compute the entropy by looping over every element in the set.

Simple enough. The code to do this is here.
Although correct, this code runs pretty slow a large input. Its exact
complexity is linear in the size of the sample space. We could improve this code
by parallelizing the PMF or entropy computation steps above. However, since
our goal is to compute the entropy for each URI component, it's probably better
to parallelize at a higher level. Specifically, instead of computing the entropy
based on the 1st component, and then the 1st and 2nd components, and so on, we
can compute the entropy of all combinations in parallel. The code to do this
is here.
Notice that the only change is to use a `ProcessPoolExecutor`

to walk over the range of columns.

# Performance Comparison #

Did we actually do any better by computing the entropy in parallel? Let's run
some tests to find out. I downloaded the set of Cisco-generated URIs from
icn-names.net/ and took a large enough sample to make for a
visible comparison. I then parsed list of URIs to create the matrix of components.
Finally, I ran this matrix through the entropy calculation for *each* column
(component). I used the sequential and parallel versions of the program included
above. For a file containing 1,000,000 URIs, the sequential version took approximately
13.4s whereas the parallel version took 4.2s. So, yes, we did better.

# Randomness Results #

Let's finish my actually answering the question. How much better did we do? The plot below shows the actual data. For this data set containing ~1.5GB of plaintext URIs, the joint entropy hovers in [9,14]. Conversely, the conditional entropy is very low (and sometimes negative due to oddities in the data). This means that a URI prefix reveals a great deal of information about the suffix.

So, to sum up: names are not good sources of randomness. But that makes sense, of course, since names are supposed to be meaningful and predictable. They are supposed to simplify our lives by assigning simple meaning to complex things. Purely random names would offer very little utility to the human user.

More posts:

- Next: Stars, Bars, and Names
- Previous: What's in a Name?