Demilade Sonuga's blog

All posts

regalloc III

2024-10-02

Nobody likes waiting for code to finish compiling. Today, Rust's Cranelift backend can be used to make debug builds faster, but there's still room for improvement. I spent the last few months working on a register allocator which, when used with Cranelift, speeds up the Rust compiler by up to 18%.

I've finally completed the allocator and my PR has been merged. It ended up being pretty similar to the one described in the previous post; a description of the final state of the algorithm can be found here.

Of course, performance engineering is a complex topic that can't be fully described with a single number. The rest of this post goes into some detail about the performance measurements and how to replicate them.

Measuring Performance

If there's one thing I've learned throughout this period: it's that measuring performance is hard.

Rust Compiler Benchmarks

Here's a snapshot of running the Rust compiler benchmarks, comparing the speed of the compiler using Cranelift with the main Ion allocator vs Fastalloc:

Ion vs Fastalloc No of Instructions

The picture shows the performance improvements measured with the no. of instructions metric for the primary benchmarks. Some parts of the primary benchmarks failed because my computer doesn't support LTO.

Anyways, the measurements show an improvement of 0.3%-18%.

This is a snapshot of the improvements measured on the no. of cycles metric:

Ion vs Fastalloc No of Cycles

The performance improvements are pretty similar.

Then this is a snapshot of the performance changes measured on the wall time metric:

Ion vs Fastalloc Wall Time

This shows wildly varying changes, some of which are good and some of which are not. Apparently, wall time is a "noisy" metric for this, meaning that it flunctuates a lot depending on a lot of irrelevant things and some kind of special setup is required to get better measurements.

Anyways, the no. of cycles & no. of instructions are more stable and reliable.

Sightglass Benchmarks

Sightglass is a benchmarking tool for Wasmtime & Cranelift. Instead of measuring just compile-times, it also measures instantiation and execution time.

The only metrics I used are the no. of cycles:

Compilation is faster by 1.07x-5x:

Ion vs Fastalloc Sightglass Compilation

On most benchmarks, there's no difference in instantiation.

Ion vs Fastalloc Sightglass Instantiation

On a few instantiation is faster by 1.03x-1.13x:

Ion vs Fastalloc Sightglass Instantiation

And execution is slower by 1.06x-7.50x:

Ion vs Fastalloc Sightglass Execution

Profiling Results

To check the difference in time spent on register allocation alone, I compiled some of the Sightglass wasm files with Wasmtime using Ion and Fastalloc and compared the profiling results with Samply.

Compiling the Libsodium-Core6 benchmark with Ion:

Compiling Libsodium-Core6 With Wasmtime-Ion

Shows about 2000 samples collected during register allocation.

With Fastalloc:

Compiling Libsodium-Core6 With Wasmtime-Fastalloc

About 290 samples collected during register allocation.

Fastalloc is roughly 6x faster.

Reproduce

If you're interested in reproducing the measurements:

Sightglass Benchmarks

  1. Clone Sightglass, regalloc2, and Wasmtime:
git clone https://github.com/bytecodealliance/sightglass.git
git clone https://github.com/bytecodealliance/regalloc2.git
git clone https://github.com/bytecodealliance/wasmtime.git
  1. The commits I used while running Sightglass, Wasmtime, and regalloc2 are e5003d5684fc615c52d6b9b571d15dd17644e58e, 7cc466abbeeb8de042d4ba055b8711955018dd56, and 0130fee8dc2e4896e01500f8d6552cf5b241268c, respectively, so you might want to switch to them:

In the Sightglass directory:

git switch --detach e5003d5684fc615c52d6b9b571d15dd17644e58e

in the regalloc2 directory:

git switch --detach 0130fee8dc2e4896e01500f8d6552cf5b241268c

And in the Wasmtime directory:

git switch --detach 7cc466abbeeb8de042d4ba055b8711955018dd56
  1. Run this in the Wasmtime directory:
cargo build --release -p wasmtime-bench-api
cp target/release/libwasmtime_bench_api.so /tmp/ion.so
  1. Now, to use Fastalloc, you have to modify the code a bit because Cranelift isn't yet using a version of regalloc2 that contains it.

In Cargo.toml, change the regalloc2 dependency to the one you cloned:

regalloc2 = "0.10.2" # Delete this
regalloc2 = { path = "<path/to/regalloc2-clone>" } # Add this

Next, you have to set the regalloc2 option to use Fastalloc:

In cranelift/codegen/src/machinst/compile.rs, around line 59:

let mut options = RegallocOptions::default();
options.verbose_log = b.flags().regalloc_verbose_logs();
options.algorithm = regalloc2::Algorithm::Fastalloc; // Add this line.
  1. And compile:
cargo build --release -p wasmtime-bench-api
cp target/release/libwasmtime_bench_api.so /tmp/fastalloc.so
  1. Finally, run the benchmark command in the Sightglass directory:
cd <path/to/sightglass>
cargo run -- \
    benchmark \
    --engine /tmp/ion.so \
    --engine /tmp/fastalloc.so \
    -- \
    benchmarks/all.suite

Profiling

  1. You need Sightglass, regalloc2, and Wasmtime time for this. Do steps 1. and 2. for running the Sightglass benchmarks.

  2. Select a .wasm file from <path/to/sightglass>/benchmarks for the benchmark you want to profile. For example, to profile Spidermonkey, you need <path/to/sightglass>/benchmarks/spidermonkey/benchmark.wasm.

  3. Next, you need to install Samply: follow the directions in their readme to do that: https://github.com/mstange/samply.

  4. To run samply, you need to disable paranoid:

echo '1' | sudo tee /proc/sys/kernel/perf_event_paranoid

And to avoid weird mmap errors:

echo '10000' | sudo tee /proc/sys/kernel/perf_event_mlock_kb
  1. In the Wasmtime directory, run:
cargo build --release
samply record -r $(cat /proc/sys/kernel/perf_event_max_sample_rate) --iteration-count 40 target/release/wasmtime compile -C parallel-compilation=n  <selected-benchmark>.wasm

This will create a profile for Ion and start a server to display the results using the Firefox profiler.

For Fastalloc, do step 4. for running Sightglass, then repeat the above compile & profile command.

Rust Compiler Benchmarks

  1. Clone the Rust and rustc-perf:
git clone https://github.com/rust-lang/rust.git
git clone https://github.com/rust-lang/rustc-perf.git

The commits I used for my measurements of Rust and rustc-perf are c4f7176501a7d3c19c230b8c9111b2d39142f83a and 206c6b37ba22c625a7ccb09903f785b3d45992f3, respectively. You can switch to that using:

For Rust:

git switch --detach c4f7176501a7d3c19c230b8c9111b2d39142f83a

For rustc-perf:

git switch --detach 206c6b37ba22c625a7ccb09903f785b3d45992f3
  1. In the Rust directory, you have to create a configuration file: config.toml with:
[rust]
codegen-backends = ["llvm", "cranelift"]
  1. Build the sysroot:
./x build sysroot

This will output the sysroot in build/<target>/stage1.

  1. In the rustc-perf directory, run:
cargo build --release
cargo run --release --bin collector bench_local --backends Cranelift --id ion <path-to-sysroot>/bin/rustc
  1. Follow steps 1. and 2. for the Sightglass benchmarks to get the Wasmtime and regalloc2 repos, then do step 4. to get a version of Cranelift that uses the local regalloc2.

  2. To measure the performance with Fastalloc, you need to do some configuration and code modifications. In the Rust directory, in the compiler/rustc_codegen_cranelift/Cargo.toml file:

Change all the cranelift- dependencies to 0.113.0, cause that's the version of the local Wasmtime checkout.

[dependencies]
# These have to be in sync with each other
cranelift-codegen = { version = "0.113.0", default-features = false, features = ["std", "unwind", "all-arch"] }
cranelift-frontend = { version = "0.113.0" }
cranelift-module = { version = "0.113.0" }
cranelift-native = { version = "0.113.0" }
cranelift-jit = { version = "0.113.0", optional = true }
cranelift-object = { version = "0.113.0" }

Uncomment the cranelift- dependencies under [patch.crates-io] and replace all ../wasmtime with the path to your local Wasmtime repo

[patch.crates-io]
# Add these:
cranelift-codegen = { path = "<path/to/wasmtime>/cranelift/codegen", default-features = false, features = ["std", "unwind", "all-arch"] }
cranelift-frontend = { path = "<path/to/wasmtime>/cranelift/frontend" }
cranelift-module = { path = "<path/to/wasmtime>/cranelift/module" }
cranelift-native = { path = "<path/to/wasmtime>/wasmtime/cranelift/native" }
cranelift-jit = { path = "<path/to/wasmtime>/wasmtime/cranelift/jit", optional = true }
cranelift-object = { path = "<path/to/wasmtime>/wasmtime/cranelift/object" }
  1. Now, you have to do some more code modifications in the Wasmtime repo: in <path/to/wasmtime>/cranelift/entity/src/lib.rs, <path/to/wasmtime>/cranelift/bforest/src/lib.rs, <path/to/wasmtime>/cranelift/isle/isle/src/lib.rs, <path/to/wasmtime>/cranelift/codegen/meta/src/lib.rs, <path/to/wasmtime>/cranelift/codegen/src/lib.rs, <path/to/wasmtime>/cranelift/frontend/src/lib.rs, <path/to/wasmtime>/cranelift/module/src/lib.rs, <path/to/regalloc2>/src/lib.rs, add the following after the top-level doc comments (there's probably a better way, but I didn't have the time to find it):
#![allow(explicit_outlives_requirements)]
#![allow(elided_lifetimes_in_paths)]

Then change the gimli dependency from 0.31.0 to 0.29 in Wasmtime's Cargo.toml:

gimli = { version = "0.31.0", default-features = false, features = ['read'] } # Remove this
gimli = { version = "0.29.0", default-features = false, features = ['read'] } # Add this

And rerun ./x build sysroot in the Rust directory to get a version of the compiler that uses Fastalloc.

  1. In the rustc-perf directory:
cargo run --release --bin collector bench_local --backends Cranelift --id fastalloc <path-to-sysroot>/bin/rustc
  1. Follow the steps in the rustc-perf site/README.md to build the site.

Finally:

./target/release/site results.db

This will start a web server that you can use to view the results. When you open the site, under the "Do another comparison" header, enter "ion" for before and "fastalloc" for after, then click then submit button:

Rustc Perf Site