regalloc I

Sometime late March, I was banging away at my keyboard, eyes glued to a screen, dawn to dusk (okay... almost). And, although that's how much of my life in the past twenty-something days of my holiday had been, this period was different because this time, I wasn't just working on my school project or a personal project. The artifact being willed to life was a quick and dirty prototype of the register allocation algorithm I had described in my Google Summer of Code proposal for The Rust Foundation.

After about three days of thinking and coding and thinking (oh, glorious!), I completed it and tested it out on a toy language created just for that purpose.

On May 1st, I received some good news that my proposal had been accepted, and so here I am.

The Project Aim

The Rust compiler mainly uses LLVM for code generation, but it can be swapped out. Cranelift is a compiler backend, like LLVM, but it's faster. In Cranelift's code generation process, there is a stage called register allocation, which takes a lot of time. The aim of my project is to write a faster register allocator that will speed up Cranelift and, hence, the Rust compiler when Cranelift is used as the backend.

Some Project Details

If you aren't familiar with the register allocation problem, there are lots of resources you can check out online. For an explanation of the problem and how it pertains specifically to Cranelift and regalloc2 (the crate used for register allocation in Cranelift), check out this post.

Quick recap: register allocation is essentially taking a program written for a machine with infinite registers and converting it to an equivalent program (one that does exactly what the original does) with a finite number of registers. This is an NP-Complete problem, meaning it's really hard to solve well. The hardness is not in the problem itself (which is quite simple) but in the constraints that the allocator has to stay within. In some instructions, some values must be on the stack. In some others, they must be in any register or a specific register.

The bulk of my job is to design and correctly implement a register allocation algorithm within the regalloc2 crate that can be made available as an option for fast debug compilation.

From the moment I sent my proposal, I sensed that balancing school stuff (projects, assignments, tests) and GSoC would be a serious exercise in self- and time-management, and my suspicions have turned out to be much truer than I expected them to be. Nevertheless, in the past two weeks, I've been able to make some progress.

The original algorithm I described in my proposal (lightly edited to remove some unnecessary stuff) is based on the SSRA algorithm, but extended to handle multiple basic blocks. The main concept behind the algorithm is also the same as the one used in LuaJIT (took a look at it, and it was a blob of C code with minimal docs) and the LLVM's fast register allocator (looked here too; C++, no docs). The main idea is that allocation is done in reverse.

When I sat and told myself it was time to start implementing this for real, my first task was to flesh out the algorithm to account for the different constraints that the allocator would have to accommodate.

In regalloc2, the registers that instructions operate on and their constraints are encapsulated within a struct called Operand. Some of the info it carries:

The virtual register being referenced
Is the register read (use) or written (def)?
The position: does the reading or writing happen before the instruction executes (early) or after the instruction executes (late)? And so on.

The now modified algorithm I'm implementing looks more like this:

main_entry()
    for each block in reversed(blocks)
        for each instruction in reversed(instructions(block))
            alloc_instruction(instruction)
        reload_at_begin()
    reverse_results()

alloc_instruction(instruction)
    if instruction is a branch
        alloc spillslots for the branch args
        put branch args in the spillslots
    for operand in late_operands(instruction)
        alloc_operand(operand)
    for operand in late_def_operands(instruction)
        freealloc(operand)
    for operand in early_operands(operands)
        alloc_operand(operand)
    for operand in early_def_operands(operands)
        freealloc(operand)

It takes into account the decoupling of the operand kind (use/def) and the position (early/late). If you're interested in taking a look at its current state, the implementation I'm currently working on is here.

Project Progress

The current progress I've made is for the initial correct implementation. I haven't started taking some types of constraints into account yet, not even to talk of optimization, which I'm saving for later (Since I typed that live_regs.clone(), I've had to keep telling myself: "Don't worry if you're cloning an entire hashmap after every single iteration. Let's make it work right, then we will make it fast(er)").

As for correctness, regalloc2 already has testing infrastructure in place (Thank you, Chris Fallin!). Currently, my day-to-day development is being guided by the wisdom of the almighty fuzzing oracle, which yesterday revealed unto me a bug in my current implementation that I spent about four hours wrangling with and have still not yet found the answer to (yet!).

You know, I suspect I'm going to be having some really good times with the fuzzer.

Next Steps

I'm going to first focus on correctly implementing the allocator. So, my next task from here will be to fix that bug (and all the others that will pop up after it). Then, I'll move on to making it go as fast as it can possibly go.

I'll also write another post diving into the nitty gritty of the algorithm.