It was satisfying to see that both implementations were similarly fast. However shouldn’t it be possible to be even faster? Both approaches we have seen are single-threaded. At first glance, the problem of mining a new block, i.e. of finding a proof value for which the hash of the block starts with six zeros, sounds embarrassingly parallel. Shouldn’t it be possible to divide the task of finding a valid proof value among multiple threads, so that the remaining cores in our machine are not just idling around?
In this blog post, we are going to look at three different approaches, all of them using multiple threads, and we’re going to compare their runtime performances with each other and with that of the two single-threaded solutions.
Deterministic results with parallel iterators
In our first attempt to parallelize the work of finding a valid proof, we aim for deterministic results. This means that the proof that should be used should always be the same for a given block, and the same one that would have been used in the two single-threaded solutions.
To do that, we are going to use a crate called Rayon, a popular work-stealing parallelism library for Rust. Among other things, it provides parallel iterators. Their API is almost identical to Rust’s standard iterators, but under the hood, the data of the iterator is processed in parallel, by multiple threads, whenever possible.
Here is our implementation using Rayon’s parallel iterators:
Observe how similar this implementation is to the single-threaded one based on iterators that was presented in the previous blog post. Again, we map from a numerical proof value to a new
Block. There are two notable differences though: For one, we cannot use a half-open range with Rayon iterators, so we use an upper bound of
u64::max_value(). Moreover, Rayon’s parallel iterators do not have a
find method. Instead, we use the
find_first consumer, which returns the first result for which the predicate passed to it returns
true. This has the same semantics as using
find on standard iterators.
Let’s do some benchmarking for this implementation, using the
measure function we defined in the previous blog post:
Here is the result on my ancient MacBook Pro (Retina, 13-inch, Late 2012, 2,9 GHz Intel Core i7, four cores):
Ouch! That’s about two times slower than the single-threaded versions from the previous blog post. Maybe there is too much coordination overhead, and the individual tasks (computing and validating a single hash) are too small for the use of multiple threads to have any benefit.
Non-deterministic results with parallel iterators
It’s not strictly necessary for the parallelized version to pick exactly the same proof that our single-threaded versions would have picked. Can we improve the performance if we loosen this constraint and accept any proof value leading to a valid hash? After all, Rayon’s parallel iterators will not have to maintain the ordering. Instead, each thread can process a big chunk of the items of the whole iterator without any coordination with the others. As soon as one of them finds a block with a valid hash, we’re done.
The implementation looks almost identical. The only difference is that this time, we call the
find_any method instead of
Again, let’s do some benchmarking:
This time, the result looks like this:
Using four cores, we are now almost twice as fast as with the single-threaded versions.
Low-level parallelism with channels
Rayon’s parallel iterators are pretty amazing because they give us a very familiar, high-level and declarative API, taking care of all the cumbersome details of parallelising our work for us.
However, I was curious to see if using a more low-level approach could give us an additional performance benefit. We’re going to spawn four threads (one for each core on my machine) and take care of assigning the work each of them is supposed to do ourselves. The idea is that each thread knows beforehand which proof values it needs to validate. Each thread will maintain its own mutable
Block, on which the proof value will be incremented until a valid hash is found or until it detects it’s not supposed to run any more. Once a valid hash is found, the main thread has to be notified about it.
This is by far the most complex solution to our problem, and it involves a few tools from Rust’s concurrency toolbox:
AtomicBoolso that the threads know whether to keep running
Arc(atomic reference count) so that we can safely send references to our
AtomicBoolto multiple threads
- threads, which we can spawn with the, well,
- channels, so that a valid hash can be communicated from the worker threads to the main thread
Our implementation will be somewhat less succinct than the previous ones, so let’s start with a skeleton that looks like this:
The following code snippets are all to be interpreted as being pasted into the
First, we hardcode the number of threads to four, to keep the example simple. You would probably want to make this configurable or try to detect the number of cores in a real application. Next, we define the
keep_running flag as an
AtomicBool wrapped in an
channel function creates a new channel, but what the function returns is a tuple of the sender and receiver. The function takes a type parameter, but we are relying on Rust’s powerful type inference. Since we are going to send values of type
Block to the sender later in the function, the Rust compiler knows that these will be a
Sender<Block> and a
Finally, we create mutable vector that is supposed to contain the handles to the threads we are going to create a bit later. Again, due to type inference and what will come later,
handles will automatically be inferred to be of type
Next up, we need to spawn our worker threads and tell them what to do:
keep_running is an
Arc<AtomicBool>. When we call
clone on it, this does not actually copy the
AtomicBool. Instead, it increments the reference count maintained by the
We also need to create a new
Block and prepare it for the respective thread. We will use the thread id, a value from 0 to 3, as the initial proof value.
Finally, before spawning the thread, we clone the
sender returned by the
channel function, because what we need is one channel with one receiver and multiple senders.
The closure we pass to the
spawn function is what gets executed on the new thread. We check whether we are still supposed to run and whether the block candidate is invalid. As long as that is true, we increment the proof value. This time, however, we do not increment it by 1, but by the number of threads – each thread is responsible for a specific set of proof values, and it knows which ones those are without any coordination with other threads. Once this while loop is finished, we send a clone of the current state of our block to the channel. This is also what helps the Rust compiler infer the type of the sender to be
Sender<Block>. After sending the
Block, the thread will stop running.
For each thread we create, the
spawn function returns a
JoinHandle. We will need this later in our main thread to wait for all worker threads to be finished:
In the main thread, we call
recv() on our
receiver in order to block until one of the worker threads has found a valid hash. Once we have received a
Block, we switch the state of the
keep_running flag to
false. Next, we have to wait for all worker threads to finish, which should happen instantaneously, given that we have already received our
Block. We do this by calling the
join method on the respective handle. Finally, we can return the received block.
Now, let’s see if all this low-level code was really worth it and do some benchmarking:
And this is the result:
Nice! We are twice as fast as the single-threaded solutions, and indeed a bit faster than the optimized parallel iterator implementation.
Parallelising your work can be very easy in Rust. If you are already using iterators, you can keep the expressiveness and readability of your code by simply moving to parallel iterators. However, even for a task that seems to be embarrassingly parallel, you may be surprised by the results. Our naive approach was actually way slower than the single-threaded ones.
Apparently, it’s possible to squeeze out a little bit of additional performance by using more low-level concurrency constructs – at the cost of readability. Whether that’s worth it, is something that needs to be decided for each individual use case.
Also, it’s interesting to see that we’re only able to speed things up by a factor of two using four cores. That’s Amdahl’s Law striking once again.
Finally, the benchmarking approach taken here can certainly be improved. For example, it would probably be better to test each of the algorithms with a broad range of different blocks instead of only a single one.
You can find the complete example code for this and the previous blog post on GitHub: