Blockchain Mining: Embarrassingly Parallel?

In the previous post, we looked at how to use Rust for the task of mining new blocks, in the context of building our own blockchain. We showed two implementations, one of them imperative, using mutable state, the other one functional and immutable, using Rust’s iterators.

It was satisfying to see that both implementations were similarly fast. However shouldn’t it be possible to be even faster? Both approaches we have seen are single-threaded. At first glance, the problem of mining a new block, i.e. of finding a proof value for which the hash of the block starts with six zeros, sounds embarrassingly parallel. Shouldn’t it be possible to divide the task of finding a valid proof value among multiple threads, so that the remaining cores in our machine are not just idling around?

In this blog post, we are going to look at three different approaches, all of them using multiple threads, and we’re going to compare their runtime performances with each other and with that of the two single-threaded solutions.

Deterministic results with parallel iterators

In our first attempt to parallelize the work of finding a valid proof, we aim for deterministic results. This means that the proof that should be used should always be the same for a given block, and the same one that would have been used in the two single-threaded solutions.

To do that, we are going to use a crate called Rayon, a popular work-stealing parallelism library for Rust. Among other things, it provides parallel iterators. Their API is almost identical to Rust’s standard iterators, but under the hood, the data of the iterator is processed in parallel, by multiple threads, whenever possible.

Here is our implementation using Rayon’s parallel iterators:

extern crate rayon;

use self::rayon::prelude::*;

impl Block {
    pub fn mine_with_parallel_iterator_find_first(block_candidate: &Block, prefix: &str) -> Block {
        (0..u64::max_value()).into_par_iter().map(|proof| Block {
            index: block_candidate.index,
            timestamp: block_candidate.timestamp,
            proof: proof,
            transactions: block_candidate.transactions.clone(),
            previous_block_hash: block_candidate.previous_block_hash.clone(),
        }).find_first(|b| Self::valid(&Self::hash(b), prefix)).unwrap()
    }
}

Observe how similar this implementation is to the single-threaded one based on iterators that was presented in the previous blog post. Again, we map from a numerical proof value to a new Block. There are two notable differences though: For one, we cannot use a half-open range with Rayon iterators, so we use an upper bound of u64::max_value(). Moreover, Rayon’s parallel iterators do not have a find method. Instead, we use the find_first consumer, which returns the first result for which the predicate passed to it returns true. This has the same semantics as using find on standard iterators.

Let’s do some benchmarking for this implementation, using the measure function we defined in the previous blog post:

let previous_block = Block::genesis();
measure("test_mine_with_parallel_iterator_find_first", || {
    let block = Block::new(1524480511, Vec::new(), &previous_block);
    format!("{:?}", Block::mine_with_parallel_iterator_find_first(&block, "000000").proof)
});

Here is the result on my ancient MacBook Pro (Retina, 13-inch, Late 2012, 2,9 GHz Intel Core i7, four cores):

test_mine_with_parallel_iterator_find_first:
mean:	18885 ms/iter
median:	18760 ms/iter
min:	18547 ms/iter
max:	19629 ms/iter
stddev:	334

Ouch! That’s about two times slower than the single-threaded versions from the previous blog post. Maybe there is too much coordination overhead, and the individual tasks (computing and validating a single hash) are too small for the use of multiple threads to have any benefit.

Non-deterministic results with parallel iterators

It’s not strictly necessary for the parallelized version to pick exactly the same proof that our single-threaded versions would have picked. Can we improve the performance if we loosen this constraint and accept any proof value leading to a valid hash? After all, Rayon’s parallel iterators will not have to maintain the ordering. Instead, each thread can process a big chunk of the items of the whole iterator without any coordination with the others. As soon as one of them finds a block with a valid hash, we’re done.

impl Block {
    pub fn mine_with_parallel_iterator_find_any(block_candidate: &Block, prefix: &str) -> Block {
        (0..u64::max_value()).into_par_iter().map(|proof| Block {
            index: block_candidate.index,
            timestamp: block_candidate.timestamp,
            proof: proof,
            transactions: block_candidate.transactions.clone(),
            previous_block_hash: block_candidate.previous_block_hash.clone(),
        }).find_any(|b| Self::valid(&Self::hash(b), prefix)).unwrap()
    }
}

The implementation looks almost identical. The only difference is that this time, we call the find_any method instead of find_first.

Again, let’s do some benchmarking:

let previous_block = Block::genesis();
measure("test_mine_with_parallel_iterator_find_any", || {
    let block = Block::new(1524480511, Vec::new(), &previous_block);
    format!("{:?}", Block::mine_with_parallel_iterator_find_any(&block, "000000").proof)
});

This time, the result looks like this:

test_mine_with_parallel_iterator_find_any:
mean:	5719 ms/iter
median:	5682 ms/iter
min:	5583 ms/iter
max:	5891 ms/iter
stddev:	114

Using four cores, we are now almost twice as fast as with the single-threaded versions.

Low-level parallelism with channels

Rayon’s parallel iterators are pretty amazing because they give us a very familiar, high-level and declarative API, taking care of all the cumbersome details of parallelising our work for us.

However, I was curious to see if using a more low-level approach could give us an additional performance benefit. We’re going to spawn four threads (one for each core on my machine) and take care of assigning the work each of them is supposed to do ourselves. The idea is that each thread knows beforehand which proof values it needs to validate. Each thread will maintain its own mutable Block, on which the proof value will be incremented until a valid hash is found or until it detects it’s not supposed to run any more. Once a valid hash is found, the main thread has to be notified about it.

This is by far the most complex solution to our problem, and it involves a few tools from Rust’s concurrency toolbox:

an AtomicBool so that the threads know whether to keep running
Arc (atomic reference count) so that we can safely send references to our AtomicBool to multiple threads
threads, which we can spawn with the, well, spawn function
channels, so that a valid hash can be communicated from the worker threads to the main thread

Our implementation will be somewhat less succinct than the previous ones, so let’s start with a skeleton that looks like this:

use std::thread::spawn;
use std::sync::mpsc::channel;
use std::sync::Arc;
use std::sync::atomic::{AtomicBool, Ordering};

impl Block {
    pub fn mine_with_channels(block_candidate: &Block, prefix: &str) -> Block {
        // to be implemented    
    }
}

The following code snippets are all to be interpreted as being pasted into the mine_with_channels implementation.

let num_threads: usize = 4;

let keep_running = Arc::new(AtomicBool::new(true));
let (sender, receiver) = channel();

let mut handles = Vec::with_capacity(num_threads);

First, we hardcode the number of threads to four, to keep the example simple. You would probably want to make this configurable or try to detect the number of cores in a real application. Next, we define the keep_running flag as an AtomicBool wrapped in an Arc.

Calling the channel function creates a new channel, but what the function returns is a tuple of the sender and receiver. The function takes a type parameter, but we are relying on Rust’s powerful type inference. Since we are going to send values of type Block to the sender later in the function, the Rust compiler knows that these will be a Sender<Block> and a Receiver<Block>.

Finally, we create mutable vector that is supposed to contain the handles to the threads we are going to create a bit later. Again, due to type inference and what will come later, handles will automatically be inferred to be of type Vec<JoinHandle>.

Next up, we need to spawn our worker threads and tell them what to do:

for thread_id in 0..num_threads {
    let keep_running_ref = keep_running.clone();
    let mut block = block_candidate.clone();
    let prefix = prefix.clone().to_string();
    block.proof = thread_id as u64;
    let sender = sender.clone();
    let handle = spawn(move || {
        while keep_running_ref.load(Ordering::SeqCst) && !Self::valid(&Self::hash(&block), &prefix) {
            block.proof += num_threads as u64
        }
        sender.send(block.clone()).unwrap();
        ()
    });
    handles.push(handle);
}

Remember that keep_running is an Arc<AtomicBool>. When we call clone on it, this does not actually copy the AtomicBool. Instead, it increments the reference count maintained by the Arc.

We also need to create a new Block and prepare it for the respective thread. We will use the thread id, a value from 0 to 3, as the initial proof value.

Finally, before spawning the thread, we clone the sender returned by the channel function, because what we need is one channel with one receiver and multiple senders.

The closure we pass to the spawn function is what gets executed on the new thread. We check whether we are still supposed to run and whether the block candidate is invalid. As long as that is true, we increment the proof value. This time, however, we do not increment it by 1, but by the number of threads – each thread is responsible for a specific set of proof values, and it knows which ones those are without any coordination with other threads. Once this while loop is finished, we send a clone of the current state of our block to the channel. This is also what helps the Rust compiler infer the type of the sender to be Sender<Block>. After sending the Block, the thread will stop running.

For each thread we create, the spawn function returns a JoinHandle. We will need this later in our main thread to wait for all worker threads to be finished:

let block = receiver.recv().unwrap();
keep_running.store(false, Ordering::SeqCst);

for handle in handles {
    handle.join().unwrap();
}

block

In the main thread, we call recv() on our receiver in order to block until one of the worker threads has found a valid hash. Once we have received a Block, we switch the state of the keep_running flag to false. Next, we have to wait for all worker threads to finish, which should happen instantaneously, given that we have already received our Block. We do this by calling the join method on the respective handle. Finally, we can return the received block.

Now, let’s see if all this low-level code was really worth it and do some benchmarking:

let previous_block = Block::genesis();
measure("test_mine_with_channels", || {
    let block = Block::new(1524480511, Vec::new(), &previous_block);
    format!("{:?}", Block::mine_with_channels(&block, "000000").proof)
});

And this is the result:

test_mine_with_channels:
mean:	4594 ms/iter
median:	4535 ms/iter
min:	4408 ms/iter
max:	4965 ms/iter
stddev:	169

Nice! We are twice as fast as the single-threaded solutions, and indeed a bit faster than the optimized parallel iterator implementation.

Summary

Parallelising your work can be very easy in Rust. If you are already using iterators, you can keep the expressiveness and readability of your code by simply moving to parallel iterators. However, even for a task that seems to be embarrassingly parallel, you may be surprised by the results. Our naive approach was actually way slower than the single-threaded ones.

Apparently, it’s possible to squeeze out a little bit of additional performance by using more low-level concurrency constructs – at the cost of readability. Whether that’s worth it, is something that needs to be decided for each individual use case.

Also, it’s interesting to see that we’re only able to speed things up by a factor of two using four cores. That’s Amdahl’s Law striking once again.

Finally, the benchmarking approach taken here can certainly be improved. For example, it would probably be better to test each of the algorithms with a broad range of different blocks instead of only a single one.

You can find the complete example code for this and the previous blog post on GitHub:

https://github.com/innoq/rusty-blockchain

Blog-Post