regalloc II

In the previous post, I briefly described the project I'm working on. My aim with this one is to dive into the details of the project, the state it's currently in, and what I plan on doing next.

Register Allocation

As I'm working on a register allocator, the first step we must take before diving in is to understand what exactly is register allocation.

It is simply rewriting code that makes use of an arbitrary number of registers to code that makes use of only a fixed number and does exactly the same thing. That is, the dataflow in the original program must be exactly the same as the dataflow in the rewritten one.

For example, we could have the following code:

block 0:
    0. v0 = 0
    1. v1 = 1
    2. v2 = 2
    3. v3 = add v2, v1
    4. v4 = add v3, v0

The above makes use of 5 virtual registers: v0, v1, v2, v3, and v4. If the target machine had five or more registers, then we could simply do a one-to-one mapping of the virtual registers to physical ones and rewrite them as:

block 0:
    0. p0 = 0
    1. p1 = 1
    2. p2 = 2
    3. p3 = add p2, p1
    4. p4 = add p3, p0

Where pn is a physical register.

But, if the target machine has only 2 physical registers, then such a one-to-one mapping won't be possible. The allocator will have to somehow figure out how to shuffle the data between the registers and the stack such that the dataflow from value definitions to value uses remains the same as in the original program.

If we try to visualize the dataflow of the different values in this program, we'll have something like this:

block 0:                   v0  v1  v2  v3
    0. v0 = 0              |      
    1. v1 = 1              |   |
    2. v2 = 2              |   |   |    
    3. v3 = add v2, v1     |   |   |   |
    4. v4 = add v3, v0     |           |

The vertical lines represent liveranges, the range of instructions during which the virtual registers are live (defined and still going to be used). Looking at this representation, it's clear why a simple one-to-one mapping won't fly: at some points, there are more live values than physical registers. At instruction 3, v0, v1, and v2 are live, but only two registers are available on the machine. In such a scenario, we must enlist the help of the stack to remember values that we can't keep in registers for one reason or the other. In other words, we must spill some values to stack locations (spillslots).

One way of rewriting this program could proceed as follows:

v0 = 0 becomes p0 = 0. So, the physical register p0 now represents v0.
v1 = 1 becomes p1 = 1. The physical register p1 now represents v1.
At this point, the only two physical registers are already holding live values. To proceed with the allocation we need to evict a register: take an innocent value away from a register and relegate it to a spillslot. If we move v0 from p0 to stack0, then p0 will be free for us to use. v2 = 2 now becomes:

mov from p0 to stack0
p0 = 2

At this point, since all the values we need are already in registers, v3 = add v2, v1 becomes ? = add p0, p1. v3 has no register, but that isn't a problem because v2's liverange ends at instruction 4. Its liverange ends after its use, not after the instruction. Because of this, its register can be reused: p0 = add p0, p1. So, after this instruction's execution, v3 ends up in p0.
Now, we need v0 back. v1 is no longer live, so we can reuse its register here. v3 isn't going to be live after its use, so we can reuse its register to hold v4.

The instruction becomes:

mov from stack0 to p1
p0 = add p0, p1

And the program is rewritten to:

0. p0 = 0
1. p1 = 1
mov from p0 to stack0
2. p0 = 2
3. p0 = add p0, p1
mov from stack0 to p1
4. p0 = add p0, p1

In particular, the v0 value is first in p0, then it's moved to stack0, then it's moved to p1. All this shuffling must result in correct code: the value produced in v0's definition must be the same value used in all instructions where v0 is an operand.

Constraints

To make matters even harder for us, there are different constraints that specify what kind of registers must be used, or whether the value should be on the stack or in a specific register.

In the previous code, there is an assumption that all values are to be in registers and that all registers are the same, but that's not true in reality. There are different classes of registers that hold different types of values: integers, floating point numbers, and vectors. And some instructions prefer their operands in a specific register or on the stack.

A Brief Tour of regalloc2

regalloc2 is a crate (library) that carries out register allocation.

Its Philosophy of Life

The way regalloc2 views the world is very simple. There are functions. Functions consist of blocks. Blocks consist of instructions. Each instruction has a bunch of virtual register (vreg) definitions (defs) and uses with associated constraints.

Operands

Like many other things in computer science, abstraction is used to make life easier: the register allocator doesn't need to care about the semantics of the instructions - the way registers are defined and used (and their constraints) is the only info we need.

The virtual registers defined or used in an instruction, along with their constraints and other info, are encapsulated in the structure Operand.

Info in Operand includes:

The operand kind: is it the definition of a virtual register, or a use?
The operand position: does the use or definition of the virtual register take place before the instruction executes (early phase) or after the instruction executes (late phase)?
The register class: is it a float, an integer, or a vector?
The virtual register defined or used.
The constraints governing the allocation: should the allocation be a fixed register? Can it be anywhere? Can it be any register? Must it be on the stack? Or is it the reuse of another virtual register?

For a more in-depth look at the operand's info and its meaning, have a look at this.

Branching

regalloc2 also views functions in SSA form: every virtual register is defined exactly once. Because of control-flow branches, there will be some virtual registers whose values depend on the block it's coming from. regalloc2, rather than using Φ functions, uses block arguments to define those values.

For example:

block1(v1, v2)
    v3 = 0
    if cond, goto block1(v2, v1) else goto block2(v1, v3)

block2(v4, v5)
    v6 = add v4, v5
    goto block1(v6, v4)

Block 1 has block parameters v1 and v2. The values that they take on depend on the predecessor block that branched to it.

The values passed as arguments to block params, branch arguments, are used to define the block parameters.

Representations

Most structures are represented with 1-byte or 4-byte numbers with functions to interpret the bits.

Instructions, blocks, allocations, virtual registers, and operands are represented with structs that are just 4-byte numbers. While physical registers are represented with 1-byte numbers.

These structs also have indexes that uniquely identify an instance in an index space. For example, instructions in the entire function have indexes that range from 0 to n-1, where n is the number of instructions. The same goes for basic blocks.

Interface to The Outside World

regalloc2 provides a function run:

pub fn run<F: Function>(
    func: &F,
    env: &MachineEnv,
    options: &RegallocOptions,
) -> Result<Output, RegAllocError> {
    /* Others */
}

Function is a trait implemented by the client that provides information to the allocator about the blocks in the function and the instructions in the blocks and the operands used in the instructions and block predecessors and successors and some other info.

MachineEnv is a struct holding info about the resources available on the target machine, namely the physical registers available for each register class, scratch registers and registers to use as fixed stack slots:

pub struct MachineEnv {
    pub preferred_regs_by_class: [Vec<PReg>; 3],
    pub non_preferred_regs_by_class: [Vec<PReg>; 3],
    pub scratch_by_class: [Option<PReg>; 3],
    pub fixed_stack_slots: Vec<PReg>,
}

And RegallocOptions allows the client to set some preferences.

Output

The run function returns an Output:

pub struct Output {
    pub num_spillslots: usize,
    pub edits: Vec<(ProgPoint, Edit)>,
    pub allocs: Vec<Allocation>,
    pub inst_alloc_offsets: Vec<u32>,
    pub safepoint_slots: Vec<(ProgPoint, Allocation)>,
    pub debug_locations: Vec<(u32, ProgPoint, ProgPoint, Allocation)>,
    pub stats: ion::Stats,
}

The main fields of concern here are allocs, edits, and inst_alloc_offsets.

allocs is the main output that holds the final allocations for all operands in all instructions in the function. inst_alloc_offsets is essentially a map from an instruction's index number to the starting index for its operand allocations in allocs. Meanwhile, edits are the data-shuffling moves that are to be inserted at specific program points (before or after instructions).

An example:

block 0:
    0. v0 = 0
    1. v1 = 1
    2. v2 = 2
    3. v3 = add v2, v1
    4. v4 = add v3, v0

Instructions have indexes from 0 to 4.

Operands of instruction 0: [v0], instruction 1: [v1], instruction 2: [v2], instruction 3: [v1, v2, v3], instruction 4: [v0, v3, v4]. (I'm omitting the constraint info from the operands).

Rewritten:

0. p0 = 0
1. p1 = 1
mov from p0 to stack0
2. p0 = 2
3. p0 = add p0, p1
mov from stack0 to p1
4. p0 = add p0, p1

allocs: [reg(p0), reg(p1), reg(p0), reg(p1), reg(p0), reg(p0), reg(p1), reg(p0), reg(p0)]

inst_alloc_offsets: [0, 1, 2, 3, 6]

edits: [(before inst 2, move from p0 to stack0), (before inst 4, move from stack0 to p1)]

In this setup, the allocations for the first, second, and third operands of instruction 3 are allocs[inst_alloc_offsets[3]], allocs[inst_alloc_offsets[3] + 1], and allocs[inst_alloc_offsets[3] + 2], respectively.

The allocations for instruction i's operands are allocs[inst_alloc_offsets[i]..no_of_operands].

The edits are also guaranteed to be sorted by program point.

fastalloc

The algorithm I'm implementing is based on the one described here. I also drew some inspiration from LLVM's fast register allocator (although, it was mostly by visual osmosis. I haven't yet taken a serious look at it).

Usually, when liveranges are computed, they are done in reverse: from the last use to the definition. In register allocation algorithms, these liveranges are usually first computed before allocation takes place. Even the linear scan register allocation algorithm, although it performs allocation in linear time, first computes liveranges before allocation.

The core concept behind the algorithm is to perform allocation as liveranges are being computed, rather than pre-computing.

To extend it to handle multiple basic blocks, registers that outlive the blocks they're defined in are spilled to dedicated stack allocations that are unique to each of them. That way, whenever an instruction makes use of a virtual register not defined in that block, the value is simply loaded from the register's spillslot.

A pseudocode of it is as follows:

data:
    live_regs: {}
    vreg_allocs: {}
    freepregs: {}
    final_allocs: {}
    edits: {}

main
    initialize vreg_allocs to none for all virtual registers
    initialize freepregs to contain all available physical registers
    initialize live_regs to the emtpy set
    initialize final_allocs to none for all operands
    for each block in reversed(blocks(function))
        alloc_block(block)

alloc_block(block)
    for inst in reversed(insts(block))
        alloc_inst(block, inst)
    reload_at_begin(block)
    clear(live_regs)

alloc_inst(block, inst)
    if inst is branch
        add all branch arguments to live_regs
        insert edits to move branch args from temporaries to post-block locations
        insert edits to move from temporaries to block param spillslots
        insert edits to move from branch arg spillslots to temporaries
    for each operand in late_operands(inst)
        process_operand_allocation(inst, operand)
    for each operand in late_def_operands(inst)
        freealloc(vreg(operand))
    for each operand in early_operands(inst)
        process_operand_allocation(inst, operand)
    for each operand in early_def_operands(inst)
        freealloc(vreg(operand))

process_operand_allocation(inst, operand)
    insert(live_regs, vreg(operand))
    current allocation = vreg_allocs[vreg(operand)]
    if operand's current allocation isn't within constraints
        prev_alloc = current allocation
        alloc_operand(inst, operand)
        if prev_alloc is not none
            if operand is a definition
                insert edit to move from vreg(operand) to prev_alloc after inst
            else
                insert edit to move from vreg(operand) to prev_alloc before inst
    else
        final_allocs[(inst, operand)] = current allocation

alloc_operand(inst, operand)
    match operand constraint
        any => alloc_reg(inst, operand)
        reg => alloc_reg(inst, operand)
        stack, fixed-reg, reuse => not yet handling this
    final_allocs[(inst, operand)] = vreg_allocs[vreg(operand)]

alloc_reg(inst, operand)
    if freepregs is empty
        preg = evictreg()
    else
        preg = pop(freepregs)
    vreg_allocs[vreg(operand)] = preg

frealloc(vreg)
    alloc = vreg_allocs[vreg]
    match alloc kind
        reg => push(freepregs, preg)
        stack => do nothing
    vreg_allocs[vreg] = none
    remove(live_regs, vreg)

vreg_allocs is a map from any virtual register to their current allocation at any point in time during processing. freepregs is the list of all free physical registers. live_regs is the set of registers that are live during the processing of a block.

Invariants

Invariants that are maintained at all times:

A virtual register that outlives the block it was defined in will be in its dedicated spillslot by the end of the block.
At the end of a block, before edits are inserted to move values from branch arguments to block parameters spillslots, all branch arguments will be in their dedicated spillslots.
At the beginning of a block, all branch parameters and livein virtual registers will be in their dedicated spillslots.

Workings

The algorithm works as follows:

Blocks and their instructions are processed in reverse:

main
    initialize vreg_allocs to none for all virtual registers
    initialize freepregs to contain all available physical registers
    initialize live_regs to the emtpy set
    initialize final_allocs to none for all operands
    for each block in reversed(blocks(function))
        alloc_block(block)

alloc_block(block)
    for inst in reversed(insts(block))
        alloc_inst(block, inst)
    reload_at_begin(block)
    clear(live_regs)

During processing, vreg_allocs is used to hold the current allocations of registers. Whenever a use is encountered, the vreg must be live, so it's placed in live_regs. Since definitions mark the end of a liverange, vregs are removed from live_regs when their definition is encountered.

Vregs that are block parameters or liveins from predecessors, after the processing of a block, will still be in live_regs, because their definitions weren't encountered.

reload_at_begin ensures that invariant 3 is upheld: it inserts edits to move the liveins and block parameters from their dedicated spillslots to wherever they are expected to be at the block's beginning. The vreg's current allocation in vreg_alloc is updated to be its spillslot. Then, all registers are freed, and live_regs is emptied before proceeding to the processing of another block.

Instruction processing

alloc_inst(block, inst)
    if inst is branch
        add all branch arguments to live_regs
        insert edits to move branch args from temporaries to post-block locations
        insert edits to move from temporaries to block param spillslots
        insert edits to move from branch arg spillslots to temporaries
    for each operand in late_operands(inst)
        process_operand_allocation(inst, operand)
    for each operand in late_def_operands(inst)
        freealloc(vreg(operand))
    for each operand in early_operands(inst)
        process_operand_allocation(inst, operand)
    for each operand in early_def_operands(inst)
        freealloc(vreg(operand))

The early and late positions of operands state when a definition or use takes effect. In regalloc2, these positions are decoupled from uses and defs: it is possible to have a late use or early def. Because of this, late operands are processed first, with late def operands being freed. If a def takes place in the late phase of an instruction, then the register (or other allocation) it will end up using should be free in the early phase.

Early operands are processed next, with the early defs being freed last.

Branches are special cases because they may have branch arguments for the block parameters of their successor blocks. The edits inserted are to effect a parallel move from the branch argument spillslots to the successor block parameter spillslots.

The moves have to be parallel because it is possible for the block parameters of the successor block to be used as a branch argument.

For example:

block1(v1, v2)
    1. goto block2

block2
    2. goto block1(v2, v1)

According to the invariants, before instruction 2, v1 and v2 are in their spillslots, say stack0 and stack1, respectively. To process the block arguments, we must move their values from their spillslots to the block parameter spillslots. If edits are inserted to perform the moves directly, we'll have:

move from stack1 to stack0
move from stack0 to stack1

But this is a problem because, after the first move, the value of v2 has overwritten the value of v1 in v1's spillslot. So, v2's value will be passed as the two branch arguments to block1, which is incorrect.

The moves have to be done in parallel. The easiest way to do this is to just put everything in a separate temporary and then move from those temporaries to the destinations, the block param spillslots, which is what this does:

insert edits to move from temporaries to block param spillslots
insert edits to move from branch arg spillslots to temporaries

The first insertion:

insert edits to move non-block-param branch args from temporaries to post-block locations

has to be there because of liveout registers. That is, some registers that are defined in the block are used in successor blocks. Because allocation is done in reverse, those registers will have been in some allocation already. These edits move those liveouts to the allocations they're expected to be in.

In the final allocation output, the edits will actually look like:

insert edits to move from branch arg spillslots to temporaries
insert edits to move from temporaries to block param spillslots
insert edits to move branch args from temporaries to post-block locations

But the code has it reversed because the allocation proceeds in reverse.

Operand processing

process_operand_allocation(inst, operand)
    insert(live_regs, vreg(operand))
    current allocation = vreg_allocs[vreg(operand)]
    if operand's current allocation isn't within constraints
        prev_alloc = current allocation
        alloc_operand(inst, operand)
        if prev_alloc is not none
            if operand is a definition
                insert edit to move from vreg(operand) to prev_alloc after inst
            else
                insert edit to move from vreg(operand) to prev_alloc before inst
    else
        final_allocs[(inst, operand)] = current allocation

To process the operands individually, their current allocations are first checked. If the operand's virtual register is already allocated within the operand constraints, then the final allocation for that operand is set to the current allocation. If it's not, then another allocation is found, and a move is inserted to place the value back into the place it is expected to be after the instruction.

An Example

Consider the following code:

block0
    0. v0 = 0 [operand(late def, v0)]
    1. v1 = 2 [operand(late def, v1)]
    2. v2 = add v0, v1 [operand(early use, v0), operand(early use, v1), operand(late def, v2)]
    3. goto block1(v1, v2)

block1(v3, v4)
    4. v5 = add v3, v4 [operand(early use, v3), operand(early use, v4), operand(late def, v5)]
    5. ret []

In the above code, to the right of each instruction is a simplified view of how regalloc2 views the operands. For the purpose of this example, we'll assume that all the operand constraints say that they should be in any register.

First, we initialize:

live_regs: {}
vreg_allocs: {v0: none, v1: none, v2: none, v3: none, v4: none, v5: none}
freepregs: {p0, p1}
final_allocs: {none, none, none, none, none, none, none, none}
edits: {}
vreg_spillslots: {} (to keep track of spillslots for virtual registers)
temp_spillslots: {} (to keep track of the temporaries)

And we start from the last instruction in block1:

Instruction 5

There are no operands here to process, so nothing is done.

Instruction 4

[operand(early use, v3), operand(early use, v4), operand(late def, v5)]

The late operands are processed first:

p0 is selected as v5's register. So, we have:

vreg_allocs: {v0: none, v1: none, v2: none, v3: none, v4: none, v5: p0}
freepregs: {p1}
final_allocs: {none, none, none, none, none, none, none, reg(p0)}
live_regs: {v5}

The late def operands are then freed: v5 is no longer live:

vreg_allocs: {v0: none, v1: none, v2: none, v3: none, v4: none, v5: none}
freepregs: {p0, p1}
live_regs: {}

The early operands are processed next:

p0 and p1 are selected as v3 and v4's registers:

vreg_allocs: {v0: none, v1: none, v2: none, v3: p0, v4: p1, v5: none}
freepregs: {}
final_allocs: {none, none, none, none, none, reg(p0), reg(p1), reg(p0)}
live_regs: {v3, v4}

Current state:

live_regs: {v3, v4}
vreg_allocs: {v0: none, v1: none, v2: none, v3: p0, v4: p1, v5: none}
freepregs: {}
final_allocs: {none, none, none, none, none, reg(p0), reg(p1), reg(p0)}
edits: {}
vreg_spillslots: {}
temp_spillslots: {}

End of block1

We've reached the end of block1's instructions, but v3 and v4 are still live. Now, we have to reload_at_begin and clear them from live_regs:

v3 and v4 don't have dedicated spillslots yet, so two are allocated for them:

vreg_spillslots: {v3: stack0, v4: stack1}

Now edits are inserted to move from the spillslots, which is where they should be now (by the invariants) to where they are expected to be before the first instruction.

edits: {(before inst 4, move from stack1 to p1), (before inst 4, move from stack0 to p0)}
vreg_allocs: {v0: none, v1: none, v2: none, v3: stack0, v4: stack1, v5: none, v6: none}

And v3 and v4 are freed.

Current state:

live_regs: {}
vreg_allocs: {v0: none, v1: none, v2: none, v3: stack0, v4: stack1, v5: none}
freepregs: {p0, p1}
final_allocs: {none, none, none, none, none, reg(p0), reg(p1), reg(p0)}
edits: {(before inst 4, move from stack1 to p1),
    (before inst 4, move from stack0 to p0)}
vreg_spillslots: {v3: stack0, v4: stack1}
temp_spillslots: {}

Instruction 3

This is a branch to block1 with arguments [v1, v2]. In block1, v3 -> v1 and v4 -> v2.

By the invariants, these branch arguments should be in their dedicated spillslots. So, we ought to insert instructions to move from their spillslots to temporaries and from temporaries to the spillslots of the successor block parameters.

v1 and v2 don't have dedicated spillslots yet, so we allocate for them. We also don't have enough temporaries yet, so we allocate those, too:

temp_spillslots: {stack2, stack3}
vreg_spillslots: {v3: stack0, v4: stack1, v1: stack4, v2: stack5}

Now we insert the edits:

edits: {(before inst 5, move from stack1 to p1), (before inst 4, move from stack0 to p0),
    (before inst 3, move from stack2 to stack0), (before inst 3, move from stack3 to stack1),
    (before inst 3, move from stack4 to stack2), (before inst 3, move from stack5 to stack3)
}

And update the current vreg locations:

vreg_allocs: {v0: none, v1: stack4, v2: stack5, v3: stack0, v4: stack1, v5: none}

Current state:

live_regs: {v1, v2}
vreg_allocs: {v0: none, v1: stack4, v2: stack5, v3: stack0, v4: stack1, v5: none}
freepregs: {p0, p1}
final_allocs: {none, none, none, none, none, reg(p0), reg(p1), reg(p0)}
edits: {(before inst 4, move from stack1 to p1), (before inst 4, move from stack0 to p0),
    (before inst 3, move from stack2 to stack0), (before inst 3, move from stack3 to stack1),
    (before inst 3, move from stack4 to stack2), (before inst 3, move from stack5 to stack3)
}
vreg_spillslots: {v3: stack0, v4: stack1, v1: stack4, v2: stack5}
temp_spillslots: {stack2, stack3}

Instruction 2

[operand(early use, v0), operand(early use, v1), operand(late def, v2)]

Late operands:

v2 is already in stack5, but it should be in a register to satisfy the constraint. p0 is allocated to v2 and an edit (after inst 2, move from p0 to stack5) is added to move v2 to where it's expected to be after instruction 2. The edit is for after the instruction, not before, because v2 is not defined before this instruction. Moving it before will be incorrect.

v2 is freed next (because it's a late def). Its liverange is over. And its current allocation is set to none. p0 is now free for allocation.

Early operands:

v0 is allocated to p0. v1 is already in stack4, which doesn't meet the operand constraints. p1 is allocated to v1 and an edit (before inst 2, move from p1 to stack4) is added.

Current state:

live_regs: {v0, v1}
vreg_allocs: {v0: p0, v1: p1, v2: none, v3: stack0, v4: stack1, v5: none}
freepregs: {}
final_allocs: {none, none, reg(p0), reg(p1), reg(p0), reg(p0), reg(p1), reg(p0)}
edits: {(before inst 4, move from stack1 to p1), (before inst 4, move from stack0 to p0),
    (before inst 3, move from stack2 to stack0), (before inst 3, move from stack3 to stack1),
    (before inst 3, move from stack4 to stack2), (before inst 3, move from stack5 to stack3),
    (after inst 2, move from p0 to stack5), (before inst 2, move from p1 to stack4)
}
vreg_spillslots: {v3: stack0, v4: stack1, v1: stack4, v2: stack5}
temp_spillslots: {stack2, stack3}

Instruction 1

[operand(late def, v1)]

This is a single def. And its already allocated within constraints, so no allocation needs to be done.

This also marks the end of v1's liverange, so it's freed now, leaving p1 for other use.

Current state:

live_regs: {v0}
vreg_allocs: {v0: p0, v1: none, v2: none, v3: stack0, v4: stack1, v5: none}
freepregs: {p1}
final_allocs: {none, reg(p1), reg(p0), reg(p1), reg(p0), reg(p0), reg(p1), reg(p0)}
edits: {(before inst 4, move from stack1 to p1), (before inst 4, move from stack0 to p0),
    (before inst 3, move from stack2 to stack0), (before inst 3, move from stack3 to stack1),
    (before inst 3, move from stack4 to stack2), (before inst 3, move from stack5 to stack3),
    (after inst 2, move from p0 to stack5), (before inst 2, move from p1 to stack4)
}
vreg_spillslots: {v3: stack0, v4: stack1, v1: stack4, v2: stack5}
temp_spillslots: {stack2, stack3}

Instruction 0

[operand(late def, v0)]

Same as before. v0 is already allocated within constraints, and this is the end of its liverange.

Final state:

live_regs: {}
vreg_allocs: {v0: none, v1: none, v2: none, v3: stack0, v4: stack1, v5: none}
freepregs: {p0, p1}
final_allocs: {reg(p0), reg(p1), reg(p0), reg(p1), reg(p0), reg(p0), reg(p1), reg(p0)}
edits: {(before inst 4, move from stack1 to p1), (before inst 4, move from stack0 to p0),
    (before inst 3, move from stack2 to stack0), (before inst 3, move from stack3 to stack1),
    (before inst 3, move from stack4 to stack2), (before inst 3, move from stack5 to stack3),
    (after inst 2, move from p0 to stack5), (before inst 2, move from p1 to stack4)
}
vreg_spillslots: {v3: stack0, v4: stack1, v1: stack4, v2: stack5}
temp_spillslots: {stack2, stack3}

Also, the edits were inserted in reverse, so they have to be reversed before presenting it to the outside world (because regalloc2 guarantees that edits are sorted in order of program points).

Rewriting With The Allocation Results

block0
    0. p0 = 0 {v0: p0}
    1. p1 = 2 {v1: p1}
    move from p1 to stack4
    move from p0 to stack5
    2. p0 = add p0, p1 {v0: p0, v1: p1, v2: v0}
    move from stack5 to stack3
    move from stack4 to stack2
    move from stack3 to stack1
    move from stack2 to stack0
    3. goto block1 {stack4: v1, stack5: v0, temporaries: [stack3, stack2]}

block1(v3, v4) {stack0: v3, stack1: v4}
    move from stack0 to p0
    move from stack1 to p1
    4. p0 = add p0, p1 {v3: p0, v4: p1, v5: p0}
    5. ret []

Concerns About Efficiency

Edits

After seeing the output of this algorithm on some sample inputs given by the fuzzer, I got very queasy seeing the sheer amount of edits required for simple programs. It screams inefficiency. The use of one temporary spillslot for every branch argument is simple. And that's good. But I'm not sure if the resulting code may be acceptable, even for debug builds. I'll find out more about this.

Even more so, I think I'll do a more serious study of LLVM's fast regalloc (and maybe other similar algorithms) to see if and how they handle similar cases.

Single Pass?

This algorithm is supposed to be single-pass. That's the whole point of doing allocation as liveranges are discovered. But because of some constraints arising from regalloc2's outward interface and representations, it's not so much.

In my current implementation, a bunch of heap allocations are first done in the beginning:

fn new(func: &'a F, env: &'a MachineEnv) -> Self {
    // ...Others
    Self {
        func,
        vreg_allocs: vec![Allocation::none(); func.num_vregs()],
        vreg_spillslots: vec![SpillSlot::invalid(); func.num_vregs()],
        live_vregs: HashSet::with_capacity(func.num_vregs()),
        freepregs: PartedByRegClass { items: regs.clone() },
        lrus: Lrus::new(
            regs[0].len(),
            regs[1].len(),
            regs[2].len()
        ),
        vreg_in_preg: PartedByRegClass { items: [
            vec![VReg::invalid(); regs[0].len()],
            vec![VReg::invalid(); regs[1].len()],
            vec![VReg::invalid(); regs[2].len()],
        ] },
        temp_spillslots: PartedByRegClass { items: [
            Vec::with_capacity(func.num_vregs()),
            Vec::with_capacity(func.num_vregs()),
            Vec::with_capacity(func.num_vregs()),
        ] },
        allocs: Allocs::new(func, env),
        edits: Vec::new(),
        num_spillslots: 0,
        stats: Stats::default(),
    }
}

Then the rest of the algorithm proceeds without any further allocations. The positions that register allocations in the final output will be placed in can be determined ahead of time.

For example, if we have:

block0
    v0 = 1
    v1 = 2
    v2 = add v0, v1
    v3 = 3

Then, from the way regalloc2 represents its output:

pub struct Output {
    // ...Others
    pub edits: Vec<(ProgPoint, Edit)>,
    pub allocs: Vec<Allocation>,
    pub inst_alloc_offsets: Vec<u32>,
    // ...Others
}

we know that allocs will be of length 6 (there are only six operands in the entire program), and inst_alloc_offsets will be [0, 1, 2, 5] because there are four instructions, and they have 1, 1, 3, and 1 operands, respectively.

Ideally, all allocation should be done upfront, and during the processing, all output produced is placed in their final output positions. Then, when the algorithm is done, it simply returns the output. No pre- or post-processing is needed.

But the Function trait, which is used to get info like the operands of an instruction and the number of virtual registers in the function and so on, has no info like "the number of operands in the function". So, to preallocate the entire allocs, all instructions in all the blocks have to be iterated over one by one to count the number of operands each of them has so that the allocs and inst_alloc_offsets can be preallocated.

The alternative to this iteration and preallocation is to push the allocations onto the allocs vector as the allocations are being made. But because regalloc2 guarantees that allocations will be in the order from first to last instruction, allocs will have to be reversed. Some post-processing still has to be done.

The same problem persists for edits. There (as far as I know) isn't any way to determine the final position an edit is supposed to occupy in the edits vector. This necessitates post-processing. When edits are added, they are added in reverse order (from the last program point to the first), because instructions are processed in reverse order. Because of this, edits is reversed after all blocks are processed.

These concerns will have to wait a little for now. What I have in mind is that when I've gotten the implementation to a point where it can handle all edge cases and constraints correctly, I'll experiment with the different pre- and post-processing approaches to see if they have any significant effect on performance. And if so, which is the fastest.

Structures Used

In my current implementation, I've used vectors to represent almost everything. The vectors are used as:

Maps from virtual registers to allocations or spillslots (leveraging the fact that virtual registers are represented as indexes).
Lists, which are mostly just pushed and popped.

The only thing that isn't represented by a vector is the live_vregs. It's represented by a hash set instead. And operations are performed on it for every operand.

I plan on experimenting with different representations to see which will be the most efficient.

Anyways, all this is still for the optimization phase. Right now, I'm focused on correctness.

Project Progress

My current implementation of the algorithm is here. I successfully fuzzed it overnight for 7 to 8 hours without errors. But it only handles some types of constraints, not all.

Next Steps

Extend the allocator to handle all the remaining constraints correctly.
Take a more serious look at LLVM's fast regalloc and see if ideas from there (or elsewhere) can be used to make the algorithm better.
Improve the algorithm to make better use of machine resources, like scratch registers and non-preferred registers.
Experiment with different implementations of the algorithm that differ in structures used for representations and methods of pre-processing and post-processing to see which combinations are more efficient.

I believe steps 1, 2, and 3 will be done in an intertwined manner, while step 4 will be done strictly after 1, 2, and 3 are complete.