One intuitive solution would be to put the images' hashes into a data structure which keeps data sorted and easy to search; a B-tree, for instance. Given a previously unseen hash, the data structure would efficiently tell us which hashes are the closest to the one given in the query.
Unfortunately, treating perceptual hashes like numbers won’t get us far.
Similarity and distance in the space of hashes have little to
do with the concept of difference we know from arithmetics.
Two similar hashes might be very far from one another when interpreted as
It’s only one bit away from both
0x0001, but it’s much greater
than the latter number.
We have to try something else.
Let’s return to what similarity means in terms of perceptual hashes. The more two images look alike, the fewer positions there are on which the binary representations of their hashes differ. What we’re describing here is the Hamming distance, and we can implement it as follows.
Note that our
distance function is a valid metric.
This means that for arbitrary hashes
c the function satisfies the
distance(a, b) >= 0,
distance(a, b) == 0if and only if
a == b,
distance(a, b) == distance(b, a), and
distance(a, b) <= distance(a, c) + distance(c, b).
The last property is particularly interesting. It’s called the triangle inequality, and we will return to it shortly. For now, let’s talk about a particular kind of tree.
BK-tree is a data structure described 45 years ago by Burkhard and Keller. Trees we typically work with are optimized to efficiently look up a given key. A BK-tree on the other hand can tell us what keys are most similar to the given one without a costly exhaustive search.
Our BK-tree is either empty, or it is a node.
A node contains a hash and non-empty child nodes — its subtrees. There is an important constraint: all hashes in a subtree have exactly the same distance to the hash in the parent node. We’ll illustrate it with an example shortly.
The distance between two different hashes must be between 1 and 64.
Therefore, each node can have up to 64 child nodes; one per possible
We model it as a dictionary
All hashes in a child node at a key k must have distance k to the hash in
Let’s look at an example.
Below, we see a node storing hash
It has two child nodes.
All hashes in the
sub_one node have a distance of exactly 1 from
Analogously, all hashes in the
sub_six node have a distance of exactly 6 from
A node storing a single hash can be built using the following function.
Let’s consider adding elements to our tree. The inserting function must differentiate between empty and non-empty trees.
In case of an
Empty, we replace it with a tree with a single hash.
Otherwise, we call the following
insert function on the nested node.
new hash is equal to the value stored in
self, we’re done inserting.
Otherwise, we calculate
new_dist as the distance between the hashes.
Following the constraint guiding the structure of the tree, we insert
the child node stored under the
If the child node is absent, we create one using
Now, it’s time to query our tree.
We want to define a
find function with two arguments:
needlehash to look up in the tree, and
- our distance tolerance.
The function should return all hashes in the tree whose distance to the sought hash is within the given tolerance.
In case of an
Empty, there is nothing we can return.
Otherwise we call
find on the nested node.
For now, let’s make it visit all the children and exhaustively search the tree.
Finally, to create an empty
BKTree we simply have to return
We can now check whether our tree passes a small test case.
Running it with
cargo test shows no errors.
So far, so good, but our
find function is still searching through the entire
Is there a way of pruning some subtrees and visiting fewer nodes?
Let’s return to the triangle inequality.
distance function is a valid metric, so for arbitrary hashes
c, the inequality is always true.
Node::find function we’re attempting to find
Let’s consider one of the subtrees in
The subtree at key
dist has all of its hashes
We can apply the triangle inequality to every hash
sub_hash stored in the
The distance between
The distance between every
We can simplify:
We now have a lower bound for distances between
needle and all hashes in the
tolerance, we can only find some similar hashes in the
children[dist] subtree if:
By putting it together we end up with a following condition for visit the subtree.
By adding it as an
if statement in the
for loop in
find we can prune some nodes during our search.
We can do better though.
Let’s apply the triangle inequality again, this time to a different edge.
We plug our variables.
Just like above, given that
distance(needle, sub_hash) must be lesser or equal
tolerance, we obtain:
We end up with two inequalities helping us to prune subtrees during our search:
We can interpret it as a condition that
dist must belong to the
Since we represent all the values above using unsigned integers,
subtraction poses a risk of an underflow.
To prevent this we add
needle_dist to both sides
of the first and the second inequality respectively.
The result is:
Let’s add them as a condition to the
The resulting code looks as follows.
The condition we added in the
for loop allows us to prune a lot of nodes
without ever visiting them.
The lower the value of our tolerance the more nodes we will ignore.
As we showed above, they cannot contain hashes within our tolerance.
Our tree implementation is ready. All we need now is a bit of extra code to handle I/O interactions. See the source code in its entirety on GitHub.
I used the BK-tree to scan my photo album for potential duplicates. Some results were exactly what I hoped for. In the GIF above you see a comparison of two images mentioned towards the end of the previous blog post. The similarity is obvious and pHash captures it; the Hamming distance between their hashes is 10.
But there were some results which I found quite surprising. Below you see two photos I took two years apart. Strikingly, the Hamming distance between their hashes is also equal to 10. It’s fascinating that the pHash algorithm was able to preserve the similarity of their composition.
Let’s see how many more surprising similarities I’ll be able to find in my archive in the future. Luckily, even if my album continues to grow and gets scattered across several backup disks, Rust, pHash, and BK-trees will allow me to quickly find what I’m looking for.