 It’s time to return to our photo album. In the previous installment, we used Rust to build a simple command-line tool, scan our picture library, and calculate a 64-bit perceptual hash of every image. We left with an open question: How can we use those hashes to find similar photos in our album?

One intuitive solution would be to put the images' hashes into a data structure which keeps data sorted and easy to search; a B-tree, for instance. Given a previously unseen hash, the data structure would efficiently tell us which hashes are the closest to the one given in the query.

Unfortunately, treating perceptual hashes like numbers won’t get us far. Similarity and distance in the space of hashes have little to do with the concept of difference we know from arithmetics. Two similar hashes might be very far from one another when interpreted as unidimensional numbers. Consider `0x8001`. It’s only one bit away from both `0x8003` and `0x0001`, but it’s much greater than the latter number. We have to try something else.

Let’s return to what similarity means in terms of perceptual hashes. The more two images look alike, the fewer positions there are on which the binary representations of their hashes differ. What we’re describing here is the Hamming distance, and we can implement it as follows.

Note that our `distance` function is a valid metric. This means that for arbitrary hashes `a`, `b` and `c` the function satisfies the following conditions:

1. `distance(a, b) >= 0`,
2. `distance(a, b) == 0` if and only if `a == b`,
3. `distance(a, b) == distance(b, a)`, and
4. `distance(a, b) <= distance(a, c) + distance(c, b)`.

The last property is particularly interesting. It’s called the triangle inequality, and we will return to it shortly. For now, let’s talk about a particular kind of tree.

BK-tree is a data structure described 45 years ago by Burkhard and Keller. Trees we typically work with are optimized to efficiently look up a given key. A BK-tree on the other hand can tell us what keys are most similar to the given one without a costly exhaustive search.

Our BK-tree is either empty, or it is a node.

A node contains a hash and non-empty child nodes — its subtrees. There is an important constraint: all hashes in a subtree have exactly the same distance to the hash in the parent node. We’ll illustrate it with an example shortly.

The distance between two different hashes must be between 1 and 64. Therefore, each node can have up to 64 child nodes; one per possible distance value. We model it as a dictionary `BTreeMap<Distance, Node>`. All hashes in a child node at a key k must have distance k to the hash in the parent.

Let’s look at an example. Below, we see a node storing hash `0x2af7`. It has two child nodes. All hashes in the `sub_one` node have a distance of exactly 1 from `0x2af7`. Analogously, all hashes in the `sub_six` node have a distance of exactly 6 from `0x2af7`.

A node storing a single hash can be built using the following function.

Let’s consider adding elements to our tree. The inserting function must differentiate between empty and non-empty trees.

In case of an `Empty`, we replace it with a tree with a single hash. Otherwise, we call the following `insert` function on the nested node.

If the `new` hash is equal to the value stored in `self`, we’re done inserting. Otherwise, we calculate `new_dist` as the distance between the hashes. Following the constraint guiding the structure of the tree, we insert `new` into the child node stored under the `new_dist` key. If the child node is absent, we create one using `single`.

Now, it’s time to query our tree. We want to define a `find` function with two arguments:

• the `needle` hash to look up in the tree, and
• our distance tolerance.

The function should return all hashes in the tree whose distance to the sought hash is within the given tolerance.

In case of an `Empty`, there is nothing we can return. Otherwise we call `find` on the nested node. For now, let’s make it visit all the children and exhaustively search the tree.

Finally, to create an empty `BKTree` we simply have to return `Empty`.

We can now check whether our tree passes a small test case.

Running it with `cargo test` shows no errors. So far, so good, but our `find` function is still searching through the entire tree. Is there a way of pruning some subtrees and visiting fewer nodes?

Let’s return to the triangle inequality. The `distance` function is a valid metric, so for arbitrary hashes `a`, `b`, and `c`, the inequality is always true.

In the `Node::find` function we’re attempting to find `needle`. Let’s consider one of the subtrees in `children`. The subtree at key `dist` has all of its hashes `dist`-distant from `self.hash`. We can apply the triangle inequality to every hash `sub_hash` stored in the subtree `children[dist]`:

The distance between `needle` and `self.hash` is `needle_dist`. The distance between every `sub_hash` and `self.hash` is `dist`. We can simplify:

We now have a lower bound for distances between `needle` and all hashes in the subtree.

Given our `tolerance`, we can only find some similar hashes in the `children[dist]` subtree if:

By putting it together we end up with a following condition for visit the subtree.

By adding it as an `if` statement in the `for` loop in `find` we can prune some nodes during our search. We can do better though. Let’s apply the triangle inequality again, this time to a different edge.

We plug our variables.

Just like above, given that `distance(needle, sub_hash)` must be lesser or equal to `tolerance`, we obtain:

We end up with two inequalities helping us to prune subtrees during our search:

We can interpret it as a condition that `dist` must belong to the interval:

Since we represent all the values above using unsigned integers, subtraction poses a risk of an underflow. To prevent this we add `dist` and `needle_dist` to both sides of the first and the second inequality respectively. The result is:

Let’s add them as a condition to the `find` function. The resulting code looks as follows.

That’s it. The condition we added in the `for` loop allows us to prune a lot of nodes without ever visiting them. The lower the value of our tolerance the more nodes we will ignore. As we showed above, they cannot contain hashes within our tolerance.

Our tree implementation is ready. All we need now is a bit of extra code to handle I/O interactions. See the source code in its entirety on GitHub.