Demilade Sonuga's blog
All postsregalloc II
In the previous post, I briefly described the project I'm working on. My aim with this one is to dive into the details of the project, the state it's currently in, and what I plan on doing next.
Register Allocation
As I'm working on a register allocator, the first step we must take before diving in is to understand what exactly is register allocation.
It is simply rewriting code that makes use of an arbitrary number of registers to code that makes use of only a fixed number and does exactly the same thing. That is, the dataflow in the original program must be exactly the same as the dataflow in the rewritten one.
For example, we could have the following code:
block 0:
0. v0 = 0
1. v1 = 1
2. v2 = 2
3. v3 = add v2, v1
4. v4 = add v3, v0
The above makes use of 5 virtual registers: v0
, v1
, v2
, v3
, and v4
. If the target machine had five or more
registers, then we could simply do a one-to-one mapping of the virtual registers to physical ones and
rewrite them as:
block 0:
0. p0 = 0
1. p1 = 1
2. p2 = 2
3. p3 = add p2, p1
4. p4 = add p3, p0
Where pn
is a physical register.
But, if the target machine has only 2 physical registers, then such a one-to-one mapping won't be possible. The allocator will have to somehow figure out how to shuffle the data between the registers and the stack such that the dataflow from value definitions to value uses remains the same as in the original program.
If we try to visualize the dataflow of the different values in this program, we'll have something like this:
block 0: v0 v1 v2 v3
0. v0 = 0 |
1. v1 = 1 | |
2. v2 = 2 | | |
3. v3 = add v2, v1 | | | |
4. v4 = add v3, v0 | |
The vertical lines represent liveranges, the range of instructions during which the virtual registers are live
(defined and still going to be used). Looking at this representation, it's clear why a simple one-to-one
mapping won't fly: at some points, there are more live values than physical registers.
At instruction 3, v0
, v1
, and v2
are live, but only two registers are available on the machine.
In such a scenario, we must enlist the help of the stack to remember values that we can't keep
in registers for one reason or the other. In other words, we must spill some values to stack locations
(spillslots).
One way of rewriting this program could proceed as follows:
v0 = 0
becomesp0 = 0
. So, the physical registerp0
now representsv0
.v1 = 1
becomesp1 = 1
. The physical registerp1
now representsv1
.- At this point, the only two physical registers are already holding live values. To proceed with the
allocation we need to evict a register: take an innocent value away from a register and relegate it
to a spillslot. If we move
v0
fromp0
tostack0
, thenp0
will be free for us to use.v2 = 2
now becomes:
mov from p0 to stack0
p0 = 2
- At this point, since all the values we need are already in registers,
v3 = add v2, v1
becomes? = add p0, p1
.v3
has no register, but that isn't a problem becausev2
's liverange ends at instruction 4. Its liverange ends after its use, not after the instruction. Because of this, its register can be reused:p0 = add p0, p1
. So, after this instruction's execution,v3
ends up inp0
. - Now, we need
v0
back.v1
is no longer live, so we can reuse its register here.v3
isn't going to be live after its use, so we can reuse its register to holdv4
.
The instruction becomes:
mov from stack0 to p1
p0 = add p0, p1
And the program is rewritten to:
0. p0 = 0
1. p1 = 1
mov from p0 to stack0
2. p0 = 2
3. p0 = add p0, p1
mov from stack0 to p1
4. p0 = add p0, p1
In particular, the v0
value is first in p0
, then it's moved to stack0
, then it's moved to p1
. All this
shuffling must result in correct code: the value produced in v0
's definition must be the same
value used in all instructions where v0
is an operand.
Constraints
To make matters even harder for us, there are different constraints that specify what kind of registers must be used, or whether the value should be on the stack or in a specific register.
In the previous code, there is an assumption that all values are to be in registers and that all registers are the same, but that's not true in reality. There are different classes of registers that hold different types of values: integers, floating point numbers, and vectors. And some instructions prefer their operands in a specific register or on the stack.
A Brief Tour of regalloc2
regalloc2
is a crate (library) that carries out register allocation.
Its Philosophy of Life
The way regalloc2
views the world is very simple. There are functions. Functions consist of blocks.
Blocks consist of instructions. Each instruction has a bunch of virtual register (vreg) definitions (defs)
and uses with associated constraints.
Operands
Like many other things in computer science, abstraction is used to make life easier: the register allocator doesn't need to care about the semantics of the instructions - the way registers are defined and used (and their constraints) is the only info we need.
The virtual registers defined or used in an instruction, along with their constraints and other info,
are encapsulated in the structure Operand
.
Info in Operand
includes:
- The operand kind: is it the definition of a virtual register, or a use?
- The operand position: does the use or definition of the virtual register take place before the instruction executes (early phase) or after the instruction executes (late phase)?
- The register class: is it a float, an integer, or a vector?
- The virtual register defined or used.
- The constraints governing the allocation: should the allocation be a fixed register? Can it be anywhere? Can it be any register? Must it be on the stack? Or is it the reuse of another virtual register?
For a more in-depth look at the operand's info and its meaning, have a look at this.
Branching
regalloc2
also views functions in SSA
form: every virtual register is defined exactly once. Because of control-flow branches, there will be some virtual
registers whose values depend on the block it's coming from. regalloc2
, rather than using Φ functions,
uses block arguments to define those values.
For example:
block1(v1, v2)
v3 = 0
if cond, goto block1(v2, v1) else goto block2(v1, v3)
block2(v4, v5)
v6 = add v4, v5
goto block1(v6, v4)
Block 1 has block parameters v1 and v2. The values that they take on depend on the predecessor block that branched to it.
The values passed as arguments to block params, branch arguments, are used to define the block parameters.
Representations
Most structures are represented with 1-byte or 4-byte numbers with functions to interpret the bits.
Instructions, blocks, allocations, virtual registers, and operands are represented with structs that are just 4-byte numbers. While physical registers are represented with 1-byte numbers.
These structs also have indexes that uniquely identify an instance in an index space. For example, instructions in the entire function have indexes that range from 0 to n-1, where n is the number of instructions. The same goes for basic blocks.
Interface to The Outside World
regalloc2
provides a function run
:
pub fn run<F: Function>(
func: &F,
env: &MachineEnv,
options: &RegallocOptions,
) -> Result<Output, RegAllocError> {
/* Others */
}
Function
is a trait implemented by the client that provides information to the allocator about the blocks
in the function and the instructions in the blocks and the operands used in the instructions and block
predecessors and successors and some other info.
MachineEnv
is a struct holding info about the resources available on the target machine, namely
the physical registers available for each register class, scratch registers and registers to use as fixed stack
slots:
pub struct MachineEnv {
pub preferred_regs_by_class: [Vec<PReg>; 3],
pub non_preferred_regs_by_class: [Vec<PReg>; 3],
pub scratch_by_class: [Option<PReg>; 3],
pub fixed_stack_slots: Vec<PReg>,
}
And RegallocOptions
allows the client to set some preferences.
Output
The run
function returns an Output
:
pub struct Output {
pub num_spillslots: usize,
pub edits: Vec<(ProgPoint, Edit)>,
pub allocs: Vec<Allocation>,
pub inst_alloc_offsets: Vec<u32>,
pub safepoint_slots: Vec<(ProgPoint, Allocation)>,
pub debug_locations: Vec<(u32, ProgPoint, ProgPoint, Allocation)>,
pub stats: ion::Stats,
}
The main fields of concern here are allocs
, edits
, and inst_alloc_offsets
.
allocs
is the main output that holds the final allocations for all operands in all instructions in the
function. inst_alloc_offsets
is essentially a map from an instruction's index number to the starting
index for its operand allocations in allocs
. Meanwhile, edits
are the data-shuffling moves that are to be inserted
at specific program points (before or after instructions).
An example:
block 0:
0. v0 = 0
1. v1 = 1
2. v2 = 2
3. v3 = add v2, v1
4. v4 = add v3, v0
Instructions have indexes from 0 to 4.
Operands of instruction 0: [v0
], instruction 1: [v1
], instruction 2: [v2
], instruction 3: [v1
, v2
, v3
],
instruction 4: [v0
, v3
, v4
]. (I'm omitting the constraint info from the operands).
Rewritten:
0. p0 = 0
1. p1 = 1
mov from p0 to stack0
2. p0 = 2
3. p0 = add p0, p1
mov from stack0 to p1
4. p0 = add p0, p1
allocs
: [reg(p0), reg(p1), reg(p0), reg(p1), reg(p0), reg(p0), reg(p1), reg(p0), reg(p0)]
inst_alloc_offsets
: [0, 1, 2, 3, 6]
edits
: [(before inst 2, move from p0
to stack0
), (before inst 4, move from stack0
to p1
)]
In this setup, the allocations for the first, second, and third operands of instruction 3 are
allocs[inst_alloc_offsets[3]]
, allocs[inst_alloc_offsets[3] + 1]
, and allocs[inst_alloc_offsets[3] + 2]
,
respectively.
The allocations for instruction i
's operands are allocs[inst_alloc_offsets[i]..no_of_operands]
.
The edits are also guaranteed to be sorted by program point.
fastalloc
The algorithm I'm implementing is based on the one described here. I also drew some inspiration from LLVM's fast register allocator (although, it was mostly by visual osmosis. I haven't yet taken a serious look at it).
Usually, when liveranges are computed, they are done in reverse: from the last use to the definition. In register allocation algorithms, these liveranges are usually first computed before allocation takes place. Even the linear scan register allocation algorithm, although it performs allocation in linear time, first computes liveranges before allocation.
The core concept behind the algorithm is to perform allocation as liveranges are being computed, rather than pre-computing.
To extend it to handle multiple basic blocks, registers that outlive the blocks they're defined in are spilled to dedicated stack allocations that are unique to each of them. That way, whenever an instruction makes use of a virtual register not defined in that block, the value is simply loaded from the register's spillslot.
A pseudocode of it is as follows:
data:
live_regs: {}
vreg_allocs: {}
freepregs: {}
final_allocs: {}
edits: {}
main
initialize vreg_allocs to none for all virtual registers
initialize freepregs to contain all available physical registers
initialize live_regs to the emtpy set
initialize final_allocs to none for all operands
for each block in reversed(blocks(function))
alloc_block(block)
alloc_block(block)
for inst in reversed(insts(block))
alloc_inst(block, inst)
reload_at_begin(block)
clear(live_regs)
alloc_inst(block, inst)
if inst is branch
add all branch arguments to live_regs
insert edits to move branch args from temporaries to post-block locations
insert edits to move from temporaries to block param spillslots
insert edits to move from branch arg spillslots to temporaries
for each operand in late_operands(inst)
process_operand_allocation(inst, operand)
for each operand in late_def_operands(inst)
freealloc(vreg(operand))
for each operand in early_operands(inst)
process_operand_allocation(inst, operand)
for each operand in early_def_operands(inst)
freealloc(vreg(operand))
process_operand_allocation(inst, operand)
insert(live_regs, vreg(operand))
current allocation = vreg_allocs[vreg(operand)]
if operand's current allocation isn't within constraints
prev_alloc = current allocation
alloc_operand(inst, operand)
if prev_alloc is not none
if operand is a definition
insert edit to move from vreg(operand) to prev_alloc after inst
else
insert edit to move from vreg(operand) to prev_alloc before inst
else
final_allocs[(inst, operand)] = current allocation
alloc_operand(inst, operand)
match operand constraint
any => alloc_reg(inst, operand)
reg => alloc_reg(inst, operand)
stack, fixed-reg, reuse => not yet handling this
final_allocs[(inst, operand)] = vreg_allocs[vreg(operand)]
alloc_reg(inst, operand)
if freepregs is empty
preg = evictreg()
else
preg = pop(freepregs)
vreg_allocs[vreg(operand)] = preg
frealloc(vreg)
alloc = vreg_allocs[vreg]
match alloc kind
reg => push(freepregs, preg)
stack => do nothing
vreg_allocs[vreg] = none
remove(live_regs, vreg)
vreg_allocs
is a map from any virtual register to their current allocation at any point in time during
processing. freepregs
is the list of all free physical registers. live_regs
is the set of registers
that are live during the processing of a block.
Invariants
Invariants that are maintained at all times:
- A virtual register that outlives the block it was defined in will be in its dedicated spillslot by the end of the block.
- At the end of a block, before edits are inserted to move values from branch arguments to block parameters spillslots, all branch arguments will be in their dedicated spillslots.
- At the beginning of a block, all branch parameters and livein virtual registers will be in their dedicated spillslots.
Workings
The algorithm works as follows:
- Blocks and their instructions are processed in reverse:
main
initialize vreg_allocs to none for all virtual registers
initialize freepregs to contain all available physical registers
initialize live_regs to the emtpy set
initialize final_allocs to none for all operands
for each block in reversed(blocks(function))
alloc_block(block)
alloc_block(block)
for inst in reversed(insts(block))
alloc_inst(block, inst)
reload_at_begin(block)
clear(live_regs)
During processing, vreg_allocs
is used to hold the current allocations of registers. Whenever a use
is encountered, the vreg must be live, so it's placed in live_regs
. Since definitions mark the end of a
liverange, vregs are removed from live_regs
when their definition is encountered.
Vregs that are block parameters or liveins from predecessors, after the processing of a block,
will still be in live_regs
, because their definitions weren't encountered.
reload_at_begin
ensures that invariant 3 is upheld: it inserts edits to move the liveins and block
parameters from their dedicated spillslots to wherever they are expected to be at the block's beginning.
The vreg's current allocation in vreg_alloc
is updated to be its spillslot.
Then, all registers are freed, and live_regs
is emptied before proceeding to the processing of another block.
- Instruction processing
alloc_inst(block, inst)
if inst is branch
add all branch arguments to live_regs
insert edits to move branch args from temporaries to post-block locations
insert edits to move from temporaries to block param spillslots
insert edits to move from branch arg spillslots to temporaries
for each operand in late_operands(inst)
process_operand_allocation(inst, operand)
for each operand in late_def_operands(inst)
freealloc(vreg(operand))
for each operand in early_operands(inst)
process_operand_allocation(inst, operand)
for each operand in early_def_operands(inst)
freealloc(vreg(operand))
The early and late positions of operands state when a definition or use takes effect. In regalloc2
, these
positions are decoupled from uses and defs: it is possible to have a late use or early def.
Because of this, late operands are processed first, with late def operands being freed. If a def takes place
in the late phase of an instruction, then the register (or other allocation) it will end up using should be
free in the early phase.
Early operands are processed next, with the early defs being freed last.
Branches are special cases because they may have branch arguments for the block parameters of their successor blocks. The edits inserted are to effect a parallel move from the branch argument spillslots to the successor block parameter spillslots.
The moves have to be parallel because it is possible for the block parameters of the successor block to be used as a branch argument.
For example:
block1(v1, v2)
1. goto block2
block2
2. goto block1(v2, v1)
According to the invariants, before instruction 2, v1
and v2
are in their spillslots, say stack0
and
stack1
, respectively. To process the block arguments, we must move their values from their spillslots to the
block parameter spillslots. If edits are inserted to perform the moves directly, we'll have:
move from stack1 to stack0
move from stack0 to stack1
But this is a problem because, after the first move, the value of v2
has overwritten the value of v1
in
v1
's spillslot. So, v2
's value will be passed as the two branch arguments to block1, which is incorrect.
The moves have to be done in parallel. The easiest way to do this is to just put everything in a separate temporary and then move from those temporaries to the destinations, the block param spillslots, which is what this does:
insert edits to move from temporaries to block param spillslots
insert edits to move from branch arg spillslots to temporaries
The first insertion:
insert edits to move non-block-param branch args from temporaries to post-block locations
has to be there because of liveout registers. That is, some registers that are defined in the block are used in successor blocks. Because allocation is done in reverse, those registers will have been in some allocation already. These edits move those liveouts to the allocations they're expected to be in.
In the final allocation output, the edits will actually look like:
insert edits to move from branch arg spillslots to temporaries
insert edits to move from temporaries to block param spillslots
insert edits to move branch args from temporaries to post-block locations
But the code has it reversed because the allocation proceeds in reverse.
- Operand processing
process_operand_allocation(inst, operand)
insert(live_regs, vreg(operand))
current allocation = vreg_allocs[vreg(operand)]
if operand's current allocation isn't within constraints
prev_alloc = current allocation
alloc_operand(inst, operand)
if prev_alloc is not none
if operand is a definition
insert edit to move from vreg(operand) to prev_alloc after inst
else
insert edit to move from vreg(operand) to prev_alloc before inst
else
final_allocs[(inst, operand)] = current allocation
To process the operands individually, their current allocations are first checked. If the operand's virtual register is already allocated within the operand constraints, then the final allocation for that operand is set to the current allocation. If it's not, then another allocation is found, and a move is inserted to place the value back into the place it is expected to be after the instruction.
An Example
Consider the following code:
block0
0. v0 = 0 [operand(late def, v0)]
1. v1 = 2 [operand(late def, v1)]
2. v2 = add v0, v1 [operand(early use, v0), operand(early use, v1), operand(late def, v2)]
3. goto block1(v1, v2)
block1(v3, v4)
4. v5 = add v3, v4 [operand(early use, v3), operand(early use, v4), operand(late def, v5)]
5. ret []
In the above code, to the right of each instruction is a simplified view of how regalloc2
views
the operands. For the purpose of this example, we'll assume that all the operand constraints say that
they should be in any register.
First, we initialize:
live_regs: {}
vreg_allocs: {v0: none, v1: none, v2: none, v3: none, v4: none, v5: none}
freepregs: {p0, p1}
final_allocs: {none, none, none, none, none, none, none, none}
edits: {}
vreg_spillslots: {} (to keep track of spillslots for virtual registers)
temp_spillslots: {} (to keep track of the temporaries)
And we start from the last instruction in block1:
Instruction 5
There are no operands here to process, so nothing is done.
Instruction 4
[operand(early use, v3), operand(early use, v4), operand(late def, v5)]
The late operands are processed first:
p0
is selected as v5
's register. So, we have:
vreg_allocs: {v0: none, v1: none, v2: none, v3: none, v4: none, v5: p0}
freepregs: {p1}
final_allocs: {none, none, none, none, none, none, none, reg(p0)}
live_regs: {v5}
The late def operands are then freed: v5
is no longer live:
vreg_allocs: {v0: none, v1: none, v2: none, v3: none, v4: none, v5: none}
freepregs: {p0, p1}
live_regs: {}
The early operands are processed next:
p0
and p1
are selected as v3
and v4
's registers:
vreg_allocs: {v0: none, v1: none, v2: none, v3: p0, v4: p1, v5: none}
freepregs: {}
final_allocs: {none, none, none, none, none, reg(p0), reg(p1), reg(p0)}
live_regs: {v3, v4}
Current state:
live_regs: {v3, v4}
vreg_allocs: {v0: none, v1: none, v2: none, v3: p0, v4: p1, v5: none}
freepregs: {}
final_allocs: {none, none, none, none, none, reg(p0), reg(p1), reg(p0)}
edits: {}
vreg_spillslots: {}
temp_spillslots: {}
End of block1
We've reached the end of block1's instructions, but v3
and v4
are still live.
Now, we have to reload_at_begin
and clear them from live_regs
:
v3
and v4
don't have dedicated spillslots yet, so two are allocated for them:
vreg_spillslots: {v3: stack0, v4: stack1}
Now edits are inserted to move from the spillslots, which is where they should be now (by the invariants) to where they are expected to be before the first instruction.
edits: {(before inst 4, move from stack1 to p1), (before inst 4, move from stack0 to p0)}
vreg_allocs: {v0: none, v1: none, v2: none, v3: stack0, v4: stack1, v5: none, v6: none}
And v3
and v4
are freed.
Current state:
live_regs: {}
vreg_allocs: {v0: none, v1: none, v2: none, v3: stack0, v4: stack1, v5: none}
freepregs: {p0, p1}
final_allocs: {none, none, none, none, none, reg(p0), reg(p1), reg(p0)}
edits: {(before inst 4, move from stack1 to p1),
(before inst 4, move from stack0 to p0)}
vreg_spillslots: {v3: stack0, v4: stack1}
temp_spillslots: {}
Instruction 3
This is a branch to block1 with arguments [v1, v2]
. In block1, v3
-> v1
and v4
-> v2
.
By the invariants, these branch arguments should be in their dedicated spillslots. So, we ought to insert instructions to move from their spillslots to temporaries and from temporaries to the spillslots of the successor block parameters.
v1
and v2
don't have dedicated spillslots yet, so we allocate for them. We also don't have enough temporaries
yet, so we allocate those, too:
temp_spillslots: {stack2, stack3}
vreg_spillslots: {v3: stack0, v4: stack1, v1: stack4, v2: stack5}
Now we insert the edits:
edits: {(before inst 5, move from stack1 to p1), (before inst 4, move from stack0 to p0),
(before inst 3, move from stack2 to stack0), (before inst 3, move from stack3 to stack1),
(before inst 3, move from stack4 to stack2), (before inst 3, move from stack5 to stack3)
}
And update the current vreg locations:
vreg_allocs: {v0: none, v1: stack4, v2: stack5, v3: stack0, v4: stack1, v5: none}
Current state:
live_regs: {v1, v2}
vreg_allocs: {v0: none, v1: stack4, v2: stack5, v3: stack0, v4: stack1, v5: none}
freepregs: {p0, p1}
final_allocs: {none, none, none, none, none, reg(p0), reg(p1), reg(p0)}
edits: {(before inst 4, move from stack1 to p1), (before inst 4, move from stack0 to p0),
(before inst 3, move from stack2 to stack0), (before inst 3, move from stack3 to stack1),
(before inst 3, move from stack4 to stack2), (before inst 3, move from stack5 to stack3)
}
vreg_spillslots: {v3: stack0, v4: stack1, v1: stack4, v2: stack5}
temp_spillslots: {stack2, stack3}
Instruction 2
[operand(early use, v0), operand(early use, v1), operand(late def, v2)]
Late operands:
v2
is already in stack5
, but it should be in a register to satisfy the constraint.
p0
is allocated to v2
and an edit (after inst 2, move from p0
to stack5
) is added to move v2
to where
it's expected to be after instruction 2. The edit is for after the instruction, not before, because
v2
is not defined before this instruction. Moving it before will be incorrect.
v2
is freed next (because it's a late def). Its liverange is over. And its current allocation is set to none.
p0
is now free for allocation.
Early operands:
v0
is allocated to p0
.
v1
is already in stack4
, which doesn't meet the operand constraints. p1
is allocated to v1
and an
edit (before inst 2, move from p1
to stack4
) is added.
Current state:
live_regs: {v0, v1}
vreg_allocs: {v0: p0, v1: p1, v2: none, v3: stack0, v4: stack1, v5: none}
freepregs: {}
final_allocs: {none, none, reg(p0), reg(p1), reg(p0), reg(p0), reg(p1), reg(p0)}
edits: {(before inst 4, move from stack1 to p1), (before inst 4, move from stack0 to p0),
(before inst 3, move from stack2 to stack0), (before inst 3, move from stack3 to stack1),
(before inst 3, move from stack4 to stack2), (before inst 3, move from stack5 to stack3),
(after inst 2, move from p0 to stack5), (before inst 2, move from p1 to stack4)
}
vreg_spillslots: {v3: stack0, v4: stack1, v1: stack4, v2: stack5}
temp_spillslots: {stack2, stack3}
Instruction 1
[operand(late def, v1)]
This is a single def. And its already allocated within constraints, so no allocation needs to be done.
This also marks the end of v1
's liverange, so it's freed now, leaving p1
for other use.
Current state:
live_regs: {v0}
vreg_allocs: {v0: p0, v1: none, v2: none, v3: stack0, v4: stack1, v5: none}
freepregs: {p1}
final_allocs: {none, reg(p1), reg(p0), reg(p1), reg(p0), reg(p0), reg(p1), reg(p0)}
edits: {(before inst 4, move from stack1 to p1), (before inst 4, move from stack0 to p0),
(before inst 3, move from stack2 to stack0), (before inst 3, move from stack3 to stack1),
(before inst 3, move from stack4 to stack2), (before inst 3, move from stack5 to stack3),
(after inst 2, move from p0 to stack5), (before inst 2, move from p1 to stack4)
}
vreg_spillslots: {v3: stack0, v4: stack1, v1: stack4, v2: stack5}
temp_spillslots: {stack2, stack3}
Instruction 0
[operand(late def, v0)]
Same as before. v0
is already allocated within constraints, and this is the end of its liverange.
Final state:
live_regs: {}
vreg_allocs: {v0: none, v1: none, v2: none, v3: stack0, v4: stack1, v5: none}
freepregs: {p0, p1}
final_allocs: {reg(p0), reg(p1), reg(p0), reg(p1), reg(p0), reg(p0), reg(p1), reg(p0)}
edits: {(before inst 4, move from stack1 to p1), (before inst 4, move from stack0 to p0),
(before inst 3, move from stack2 to stack0), (before inst 3, move from stack3 to stack1),
(before inst 3, move from stack4 to stack2), (before inst 3, move from stack5 to stack3),
(after inst 2, move from p0 to stack5), (before inst 2, move from p1 to stack4)
}
vreg_spillslots: {v3: stack0, v4: stack1, v1: stack4, v2: stack5}
temp_spillslots: {stack2, stack3}
Also, the edits were inserted in reverse, so they have to be reversed before presenting it to the
outside world (because regalloc2
guarantees that edits are sorted in order of program points).
Rewriting With The Allocation Results
block0
0. p0 = 0 {v0: p0}
1. p1 = 2 {v1: p1}
move from p1 to stack4
move from p0 to stack5
2. p0 = add p0, p1 {v0: p0, v1: p1, v2: v0}
move from stack5 to stack3
move from stack4 to stack2
move from stack3 to stack1
move from stack2 to stack0
3. goto block1 {stack4: v1, stack5: v0, temporaries: [stack3, stack2]}
block1(v3, v4) {stack0: v3, stack1: v4}
move from stack0 to p0
move from stack1 to p1
4. p0 = add p0, p1 {v3: p0, v4: p1, v5: p0}
5. ret []
Concerns About Efficiency
Edits
After seeing the output of this algorithm on some sample inputs given by the fuzzer, I got very queasy seeing the sheer amount of edits required for simple programs. It screams inefficiency. The use of one temporary spillslot for every branch argument is simple. And that's good. But I'm not sure if the resulting code may be acceptable, even for debug builds. I'll find out more about this.
Even more so, I think I'll do a more serious study of LLVM's fast regalloc (and maybe other similar algorithms) to see if and how they handle similar cases.
Single Pass?
This algorithm is supposed to be single-pass. That's the whole point of doing allocation as
liveranges are discovered. But because of some constraints arising from regalloc2
's outward interface
and representations, it's not so much.
In my current implementation, a bunch of heap allocations are first done in the beginning:
fn new(func: &'a F, env: &'a MachineEnv) -> Self {
// ...Others
Self {
func,
vreg_allocs: vec![Allocation::none(); func.num_vregs()],
vreg_spillslots: vec![SpillSlot::invalid(); func.num_vregs()],
live_vregs: HashSet::with_capacity(func.num_vregs()),
freepregs: PartedByRegClass { items: regs.clone() },
lrus: Lrus::new(
regs[0].len(),
regs[1].len(),
regs[2].len()
),
vreg_in_preg: PartedByRegClass { items: [
vec![VReg::invalid(); regs[0].len()],
vec![VReg::invalid(); regs[1].len()],
vec![VReg::invalid(); regs[2].len()],
] },
temp_spillslots: PartedByRegClass { items: [
Vec::with_capacity(func.num_vregs()),
Vec::with_capacity(func.num_vregs()),
Vec::with_capacity(func.num_vregs()),
] },
allocs: Allocs::new(func, env),
edits: Vec::new(),
num_spillslots: 0,
stats: Stats::default(),
}
}
Then the rest of the algorithm proceeds without any further allocations. The positions that register allocations in the final output will be placed in can be determined ahead of time.
For example, if we have:
block0
v0 = 1
v1 = 2
v2 = add v0, v1
v3 = 3
Then, from the way regalloc2
represents its output:
pub struct Output {
// ...Others
pub edits: Vec<(ProgPoint, Edit)>,
pub allocs: Vec<Allocation>,
pub inst_alloc_offsets: Vec<u32>,
// ...Others
}
we know that allocs
will be of length 6 (there are only six operands in the entire program), and
inst_alloc_offsets
will be [0, 1, 2, 5] because there are four instructions, and they have 1, 1, 3, and 1
operands, respectively.
Ideally, all allocation should be done upfront, and during the processing, all output produced is placed in their final output positions. Then, when the algorithm is done, it simply returns the output. No pre- or post-processing is needed.
But the Function
trait, which is used to get info like the operands of an instruction and the number of
virtual registers in the function and so on, has no info like "the number of operands in the function".
So, to preallocate the entire allocs
, all instructions in all the blocks have to be iterated over one
by one to count the number of operands each of them has so that the allocs
and inst_alloc_offsets
can be preallocated.
The alternative to this iteration and preallocation is to push the allocations onto the allocs
vector
as the allocations are being made. But because regalloc2
guarantees that allocations will be in the order
from first to last instruction, allocs
will have to be reversed. Some post-processing still has to be
done.
The same problem persists for edits
. There (as far as I know) isn't any way to determine the final position an
edit is supposed to occupy in the edits
vector. This necessitates post-processing. When edits
are added, they are added in reverse order (from the last program point to the first), because instructions
are processed in reverse order. Because of this, edits
is reversed after all blocks are processed.
These concerns will have to wait a little for now. What I have in mind is that when I've gotten the implementation to a point where it can handle all edge cases and constraints correctly, I'll experiment with the different pre- and post-processing approaches to see if they have any significant effect on performance. And if so, which is the fastest.
Structures Used
In my current implementation, I've used vectors to represent almost everything. The vectors are used as:
- Maps from virtual registers to allocations or spillslots (leveraging the fact that virtual registers are represented as indexes).
- Lists, which are mostly just pushed and popped.
The only thing that isn't represented by a vector is the live_vregs
. It's represented by a hash set
instead. And operations are performed on it for every operand.
I plan on experimenting with different representations to see which will be the most efficient.
Anyways, all this is still for the optimization phase. Right now, I'm focused on correctness.
Project Progress
My current implementation of the algorithm is here. I successfully fuzzed it overnight for 7 to 8 hours without errors. But it only handles some types of constraints, not all.
Next Steps
- Extend the allocator to handle all the remaining constraints correctly.
- Take a more serious look at LLVM's fast regalloc and see if ideas from there (or elsewhere) can be used to make the algorithm better.
- Improve the algorithm to make better use of machine resources, like scratch registers and non-preferred registers.
- Experiment with different implementations of the algorithm that differ in structures used for representations and methods of pre-processing and post-processing to see which combinations are more efficient.
I believe steps 1, 2, and 3 will be done in an intertwined manner, while step 4 will be done strictly after 1, 2, and 3 are complete.