Perceptual hashing is an interesting technique for processing media files. I stumbled upon it not long ago and find it very exciting. Just like cryptographic hashing algorithms such as the SHA family, perceptual hashing functions generate short hash values for given input files. They work on media content, such as images, videos and audio clips.

Perceptual and cryptographic hashing differ in an interesting way. In case of the latter, minor changes to input result in an entirely different hash value. Take a look at the example below; changing just a single letter yields a completely different output.

$ echo hello | openssl sha1
f572d396fae9206628714fb2ce00f72e94f2258f
$ echo hallo | openssl sha1
56ac1c08fa5479fd57c4a5c65861c4ed3ed93ff8

This is an important property. Thanks to it, we can easily tell whether two files are different by simply comparing their cryptographic hashes. The rsync algorithm, Git commit identifiers and Bitcoin mining are just some applications of this property.

Perceptual hashing offers us a reversed guarantee. Minor changes to its input result in hashes which are similar to one another. Specifically, a bitwise exclusive OR of hashes of two similar inputs will have most bits unset. Perceptual hashes preserve similarity of its inputs.

pHash is an example of a perceptual hashing algorithm. It’s published as an open source library with a C API. Let’s try it out by building a simple command-line tool for calculating perceptual hashes of images. To make matters interesting we’ll do it with Rust; a language featured on this blog before, most recently when Kofi partially automated airing the office.

pHash isn’t available in most popular package managers, but we can install it straight from the source. We’ll focus on images, so after satisfying the library’s dependencies all we need is:

$ ./configure --disable-audio-hash \
              --disable-video-hash
$ make -C src
$ sudo make install -C src

With the library in place, let’s use Cargo to create a Rust project.

$ cargo new --bin phash

To interact with the C API of pHash we’ll need some type aliases available in the libc crate. Let’s add it to the list of dependencies in Cargo.toml

[package]
name = "phash"
version = "0.1.0"
authors = ["Jan Stępień"]

[dependencies]
libc = "0.2.36"

The function we’re interested in is ph_dct_imagehash. Here’s its declaration from the pHash.h header file.

int ph_dct_imagehash(const char *file, ulong64 &hash);

To call the function we need to declare its interface in an extern block. We have to name the library the function is implemented in; otherwise the linker wouldn’t be able to resolve it.

extern crate libc;
use libc::{uint64_t, c_char, c_int};

#[link(name = "pHash")]
extern {
  fn ph_dct_imagehash(
    file: *const c_char,
    hash: *mut uint64_t
  ) -> c_int;
}

We can now define a function adapting the C API to Rust. For one, instead of using an integer value to signal an error or a success we can model it with the Option type.

use std::ffi::CString;

fn imagehash(file: &str) -> Option<u64> {
  let mut hash: u64 = 0;
  let cstr = CString::new(file).unwrap();
  unsafe {
    let ptr = cstr.as_ptr();
    match ph_dct_imagehash(ptr, &mut hash) {
      0 => Some(hash),
      _ => None,
    }
  }
}

We want our tool to take file names as its arguments and pass them to imagehash. It should print hexadecimal hashes of all successfully processed images and report all errors.

use std::env;

fn main() {
  env::args().skip(1).for_each(|file| {
    match imagehash(&file) {
      Some(hash) => println!("{:016x} {}", hash, file),
      None => eprintln!("Failed to hash {}", file),
    }
  })
}

We can now compile our tool with cargo build and let it process some files. Just as expected, hashes of two photos I took recently in short succession reflect their similarity; they differ just on 10 out of 64 bits.

$ target/debug/phash IMG_4867.jpg IMG_4868.jpg
35066a79990aa7e7 IMG_4867.jpg
15066e39914e17f7 IMG_4868.jpg

Now we can process our entire photo library and generate a perceptual hash of every image. Using find we invoke our tool on all JPEG files we’ve got in ~/pictures.

$ find ~/pictures -iname '*.jpg' \
   -exec target/debug/phash {} +

Notice that we’re processing one file at a time, but the task itself is pleasingly parallelizable. Using find, we can generate a null-separated list of files and pipe them to xargs. We instruct the latter to run four parallel instances of phash passing 50 files to each one.

$ find ~/pictures -iname '*.jpg' -print0 \
   | xargs -0 -P 4 -n 50 target/debug/phash

Alternatively, we can implement parallel processing within phash by using Rayon, a data-parallelism library for Rust. Let’s add rayon = "0.9.0" to the dependency list in Cargo.toml and modify our main function. We need to accumulate all command-line arguments into a single vector. Afterwards, we create a parallel iterator over the vector. The rest remains unaltered. Parallelism comes nearly for free.

extern crate rayon;
use rayon::prelude::*;

fn main() {
  let files: Vec<_> = env::args().skip(1).collect();
  files.par_iter().for_each(|file| {
    match imagehash(&file) {
      Some(hash) => println!("{:016x} {}", hash, file),
      None => eprintln!("Failed to hash {}", file),
    }
  })
}

Thanks to Rayon, our tool will use all cores of our machine to their fullest. We don’t even need to specify the level of parallelism explicitly.

Our job here is done. It took us just a screenful of code to implement what we wanted, multi-threaded processing included. Feel free to review the source code in its entirety on GitHub.

A task arising quite naturally at this point is similarity search. Given hashes of all those photos can we tell if a given picture resembles something in our library? Because — I could swear — it does look oddly familiar…

That’s a question I’ll address in a follow-up post. Brace yourself for more Rust, this time with algorithms and data structures. Until next time!

TAGS