It’s time to return to our photo album. In the previous installment, we used Rust to build a simple command-line tool, scan our picture library, and calculate a 64-bit perceptual hash of every image. We left with an open question: How can we use those hashes to find similar photos in our album?

One intuitive solution would be to put the images' hashes into a data structure which keeps data sorted and easy to search; a B-tree, for instance. Given a previously unseen hash, the data structure would efficiently tell us which hashes are the closest to the one given in the query.

Unfortunately, treating perceptual hashes like numbers won’t get us far. Similarity and distance in the space of hashes have little to do with the concept of difference we know from arithmetics. Two similar hashes might be very far from one another when interpreted as unidimensional numbers. Consider 0x8001. It’s only one bit away from both 0x8003 and 0x0001, but it’s much greater than the latter number. We have to try something else.

Let’s return to what similarity means in terms of perceptual hashes. The more two images look alike, the fewer positions there are on which the binary representations of their hashes differ. What we’re describing here is the Hamming distance, and we can implement it as follows.

type Hash = u64;
type Distance = u32;

fn distance(a: Hash, b: Hash) -> Distance {
  (a ^ b).count_ones()
}

Note that our distance function is a valid metric. This means that for arbitrary hashes a, b and c the function satisfies the following conditions:

  1. distance(a, b) >= 0,
  2. distance(a, b) == 0 if and only if a == b,
  3. distance(a, b) == distance(b, a), and
  4. distance(a, b) <= distance(a, c) + distance(c, b).

The last property is particularly interesting. It’s called the triangle inequality, and we will return to it shortly. For now, let’s talk about a particular kind of tree.

Triangle inequality. d(a, b) ≤ d(a, c) + d(c, b)
Triangle inequality. d(a, b) ≤ d(a, c) + d(c, b)

BK-tree is a data structure described 45 years ago by Burkhard and Keller. Trees we typically work with are optimized to efficiently look up a given key. A BK-tree on the other hand can tell us what keys are most similar to the given one without a costly exhaustive search.

Our BK-tree is either empty, or it is a node.

enum BKTree {
  Empty,
  NonEmpty(Node),
}

A node contains a hash and non-empty child nodes — its subtrees. There is an important constraint: all hashes in a subtree have exactly the same distance to the hash in the parent node. We’ll illustrate it with an example shortly.

use std::collections::BTreeMap;

struct Node {
  hash: Hash,
  children: BTreeMap<Distance, Node>,
}

The distance between two different hashes must be between 1 and 64. Therefore, each node can have up to 64 child nodes; one per possible distance value. We model it as a dictionary BTreeMap<Distance, Node>. All hashes in a child node at a key k must have distance k to the hash in the parent.

Visualisation of a simple tree with two child nodes.
Visualisation of a simple tree with two child nodes.

Let’s look at an example. Below, we see a node storing hash 0x2af7. It has two child nodes. All hashes in the sub_one node have a distance of exactly 1 from 0x2af7. Analogously, all hashes in the sub_six node have a distance of exactly 6 from 0x2af7.

Node {
    hash: 0x2af7,
    children: {
        1: sub_one,
        6: sub_six,
    }
}

A node storing a single hash can be built using the following function.

impl Node {
  fn single(hash: Hash) -> Node {
    Node {
      hash: hash,
      children: BTreeMap::new(),
    }
  }
}

Let’s consider adding elements to our tree. The inserting function must differentiate between empty and non-empty trees.

impl BKTree {
  fn insert(&mut self, new: Hash) {
    use BKTree::*;
    match *self {
      Empty => {
        *self = NonEmpty(Node::single(new))
      },
      NonEmpty(ref mut node) => {
        node.insert(new)
      },
    }
  }
}

In case of an Empty, we replace it with a tree with a single hash. Otherwise, we call the following insert function on the nested node.

impl Node {
  fn insert(&mut self, new: Hash) {
    if self.hash != new {
      let new_dist = distance(new, self.hash);
      self.children
        .entry(new_dist)
        .or_insert(Node::single(new))
        .insert(new)
    }
  }
}

If the new hash is equal to the value stored in self, we’re done inserting. Otherwise, we calculate new_dist as the distance between the hashes. Following the constraint guiding the structure of the tree, we insert new into the child node stored under the new_dist key. If the child node is absent, we create one using single.

Now, it’s time to query our tree. We want to define a find function with two arguments:

The function should return all hashes in the tree whose distance to the sought hash is within the given tolerance.

impl BKTree {
  fn find(
    &self,
    needle: Hash,
    tolerance: Distance
  ) -> Vec<Hash> {
    use BKTree::*;
    match self {
      &Empty => Vec::new(),
      &NonEmpty(ref node) => {
        node.find(needle, tolerance)
      },
    }
  }
}

In case of an Empty, there is nothing we can return. Otherwise we call find on the nested node. For now, let’s make it visit all the children and exhaustively search the tree.

impl Node {
  fn find(
    &self,
    needle: Hash,
    tolerance: Distance
  ) -> (Vec<Hash>, u32) {
    // We create a vector to accumulate results.
    let mut result = Vec::new();
    // We calculate the distance between the
    // sought hash and the one stored in `self`.
    // If it's within our tolerance we add it
    // to `results`.
    let needle_dist = distance(self.hash, needle);
    if needle_dist <= tolerance {
      result.push(self.hash)
    }
    // Then we invoke `find` recursively on all
    // children and accumulate results.
    for (_, ref subtree) in &self.children {
      let found = subtree.find(needle, tolerance);
      for hash in found {
        result.push(hash)
      }
    }
    result
  }
}

Finally, to create an empty BKTree we simply have to return Empty.

impl BKTree {
  fn new() -> BKTree {
    BKTree::Empty
  }
}

We can now check whether our tree passes a small test case.

#[test]
fn test_insertion_and_finding() {
  let mut bk = BKTree::new();
  bk.insert(0x2af7);
  bk.insert(0x6af7);
  for &(needle, dist, ref expected) in &[
    (0x6af7, 0, vec![0x6af7]),
    (0x6bf7, 1, vec![0x6af7]),
    (0x6bf7, 2, vec![0x2af7, 0x6af7]),
    (0x6bf6, 1, vec![]),
    (0x6bf6, 2, vec![0x6af7]),
    (0x6bf6, 3, vec![0x2af7, 0x6af7]),
  ] {
    let found = bk.find(needle, dist);
    assert_eq!(&found, expected);
  }
}

Running it with cargo test shows no errors. So far, so good, but our find function is still searching through the entire tree. Is there a way of pruning some subtrees and visiting fewer nodes?

Triangle inequality in the space of perceptual hashes.
Triangle inequality in the space of perceptual hashes.

Let’s return to the triangle inequality. The distance function is a valid metric, so for arbitrary hashes a, b, and c, the inequality is always true.

In the Node::find function we’re attempting to find needle. Let’s consider one of the subtrees in children. The subtree at key dist has all of its hashes dist-distant from self.hash. We can apply the triangle inequality to every hash sub_hash stored in the subtree children[dist]:

distance(needle, self.hash) <= distance(needle, sub_hash)
                             + distance(sub_hash, self.hash)

The distance between needle and self.hash is needle_dist. The distance between every sub_hash and self.hash is dist. We can simplify:

needle_dist <= distance(needle, sub_hash) + dist

We now have a lower bound for distances between needle and all hashes in the subtree.

distance(needle, sub_hash) >= needle_dist - dist

Given our tolerance, we can only find some similar hashes in the children[dist] subtree if:

tolerance >= distance(needle, sub_hash)

By putting it together we end up with a following condition for visit the subtree.

tolerance >= needle_dist - dist

By adding it as an if statement in the for loop in find we can prune some nodes during our search. We can do better though. Let’s apply the triangle inequality again, this time to a different edge.

distance(self.hash, sub_hash) <= distance(self.hash, needle)
                               + distance(needle, sub_hash)

We plug our variables.

dist <= needle_dist + distance(needle, sub_hash)

Just like above, given that distance(needle, sub_hash) must be lesser or equal to tolerance, we obtain:

tolerance >= dist - needle_dist

We end up with two inequalities helping us to prune subtrees during our search:

tolerance >= needle_dist - dist &&
tolerance >= dist - needle_dist

We can interpret it as a condition that dist must belong to the interval:

[needle_dist - tolerance, needle_dist + tolerance].

Since we represent all the values above using unsigned integers, subtraction poses a risk of an underflow. To prevent this we add dist and needle_dist to both sides of the first and the second inequality respectively. The result is:

dist + tolerance >= needle_dist &&
needle_dist + tolerance >= dist

Let’s add them as a condition to the find function. The resulting code looks as follows.

impl Node {
  fn find(
    &self,
    needle: Hash,
    tolerance: Distance
  ) -> Vec<Hash> {
    // We create a vector to accumulate results.
    let mut result = Vec::new();
    // We calculate the distance between the
    // sought hash and the one stored in `self`.
    // If it's within our tolerance we add it
    // to `results`.
    let needle_dist = distance(self.hash, needle);
    if needle_dist <= tolerance {
      result.push(self.hash)
    }
    // Then we invoke `find` recursively on all
    // children which satisfy our condition.
    for (&dist, ref tree) in &self.children {
      if dist + tolerance >= needle_dist &&
         needle_dist + tolerance >= dist {
        let found = tree.find(needle, tolerance);
        for hash in found {
          result.push(hash)
        }
      }
    }
    result
  }
}

That’s it. The condition we added in the for loop allows us to prune a lot of nodes without ever visiting them. The lower the value of our tolerance the more nodes we will ignore. As we showed above, they cannot contain hashes within our tolerance.

Our tree implementation is ready. All we need now is a bit of extra code to handle I/O interactions. See the source code in its entirety on GitHub.

I used the BK-tree to scan my photo album for potential duplicates. Some results were exactly what I hoped for. In the GIF above you see a comparison of two images mentioned towards the end of the previous blog post. The similarity is obvious and pHash captures it; the Hamming distance between their hashes is 10.

But there were some results which I found quite surprising. Below you see two photos I took two years apart. Strikingly, the Hamming distance between their hashes is also equal to 10. It’s fascinating that the pHash algorithm was able to preserve the similarity of their composition.

Let’s see how many more surprising similarities I’ll be able to find in my archive in the future. Luckily, even if my album continues to grow and gets scattered across several backup disks, Rust, pHash, and BK-trees will allow me to quickly find what I’m looking for.