Demilade Sonuga's blog

All posts

Sherlock Holmes And Deadlocks

2023-06-10 · 55 min read

If you run the game now, all you'll get are a bunch of double fault errors. In an earlier post, it was suggested that the reason we could have such non-deterministic errors is because of race conditions. With the refactoring of the previous post, we can be confident that we have eliminated any such race conditions now. But then, why is the code still not doing what we expect it to do? That is the question that we'll resolve in this post.

Sherlock

Sherlock Holmes: you've probably heard that name before. The fictional detective created by Arthur Conan Doyle. What does he have to do with this? The answer to that is summed up in this famous (again, fictional (but highly profound)) quote:

How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?

When we keyed down the code, we believed that the result will be a game scene, but rather, we have a double fault error. Impossible! you say, as the awesome programmer that conjured up the lines of code for the computer to execute. You believe that the code you've gotten down is flawless, logical, precise. But the outcome shows that something that we believe is not true. To find that something, we have to conduct a process of investigation and with the data retrieved, deduce why the outcome is not as expected.

In other words, we have to become detectives.

Investigation

To begin, we ask ourselves: what exactly is the problem? What exactly are we looking for?

When we run the code, we expect a game scene to be drawn, but rather, we have a message telling us a double fault error occurred. This is evidence that our code is telling the computer to do something that leads to a double fault error.

But what exactly is a double fault error?

According to sources in the references (check them out), a double fault occurs when a second exception occurs while handling a first exception. It requires two exceptions to occur and, supposedly, there are specific exception combinations that must take place for double faults to occur.

More specifically, from Philipp Oppermann's blog: whenever any of [divide-by-zero, invalid TSS, segment not present, stack-segment fault, general protection fault] is followed by any of [invalid TSS, segment not present, stack-segment fault, general protection fault], the result is a double fault.

So if you have a divide-by-zero exception followed by a stack-segment fault error, the result will be a double fault.

Another combination that could lead to a double fault is a page fault followed by any of [page fault, invalid TSS, segment not present, stack-segment fault, general protection fault].

Another thing that could cause a double fault is the absence of an entry to handle an interrupt. For example, if a keyboard interrupt occurs and there is no entry for it in the IDT, the result will be a double fault.

That leaves us with a lot of space to explore the potential causes of this error.

With all these potential combinations, how exactly do we proceed?

Well, there are different ways of approaching this. We could start looking up all the exceptions listed to understand exactly why they occur and then determine if we're doing anything that could trigger them. But jumping straight into that will leave the space for searching really wide, so we'll spend a lot of time googling.

Instead, we'll get more information to determine at what point exactly in the code the error is triggered.

Before the double fault occurs, we should still be able to print messages on the screen, so we'll do just that.

Our code starts executing at the first line in efi_main, so we print a message there just to verify that the double fault happens after that point:

#[no_mangle]
extern "efiapi" fn efi_main(
handle: *const core::ffi::c_void,
sys_table: *mut SystemTable,
) -> usize {
// ...Others
if let Err(msg) = init_graphics_result {
let simple_text_output = sys_table.simple_text_output();
write!(simple_text_output, "{}", msg);
loop {}
}

init_screen(init_graphics_result.unwrap());

write!(SCREEN.lock().as_mut().unwrap(), "Reached here"); // NEW

// ...Others
}

The game scene is never drawn on the screen. We can hypothesize that the reason why is because the double fault error happens before the execution of the line in charge of drawing the game scene. To verify this, we can place code to print a message before the game scene drawing.

extern "efiapi" fn efi_main(
handle: *const core::ffi::c_void,
sys_table: *mut SystemTable,
) -> usize {
// ...Others

event_hook::hook_event(EventKind::Keyboard, boxed_fn::BoxedFn::new(|event_info| {
if let EventInfo::Keyboard(key_event) = event_info {
if key_event.direction == keyboard::KeyDirection::Down {
write!(SCREEN.lock().as_mut().unwrap(), "{:?}", key_event.keycode);
}
}
}, allocator::get_allocator()));

write!(SCREEN.lock().as_mut().unwrap(), "Reached here"); // NEW

game::blasterball(&SCREEN);

0
}

With these two statements inserted, we can begin our process of narrowing down the exact point where the double fault occurred.

If after running, only one "Reached here" is printed, then that means that the double fault occurred after init_screen is invoked and before game::blasterball gets the chance to be called.

If "Reached here" is printed twice, then that means that the double fault occurs after game::blasterball is called.

And if "Reached here" isn't printed at all, then that means that the double fault occurs either before or during the invocation of init_screen.

(Notice that we're making quite a number of assumptions here. We've assumed that our fmt::Write implementation which is used to do the printing is flawless. What if the reason a "Reached here" didn't get printed is because of a bug in the printing mechanism? Well, we've been using this printing mechanism for a while now without problems, so I think assuming it will still work correctly now is a relatively safe assumption.)

Running the code now, you'll get:

Reached here printed once

"Reached here" is printed only once, so we can conclude that the double fault happens after init_screen is invoked and before game::blasterball get the chance to be called.

Narrowing down the space even more:

#[no_mangle]
extern "efiapi" fn efi_main(
handle: *const core::ffi::c_void,
sys_table: *mut SystemTable,
) -> usize {
// ...Others

init_screen(init_graphics_result.unwrap());

// DELETED: write!(SCREEN.lock().as_mut().unwrap(), "Reached here");

// ...Others

event_hook::init();

boot_services.exit_boot_services(handle).unwrap();

write!(SCREEN.lock().as_mut().unwrap(), "Reached here"); // NEW

setup_gdt();
setup_idt();
setup_pics();

write!(SCREEN.lock().as_mut().unwrap(), "Reached here"); // NEW

// ...Others

// DELETED: write!(SCREEN.lock().as_mut().unwrap(), "Reached here");

game::blasterball(&SCREEN);
}

Now, we've narrowed down our culprit. If one "Reached here" is printed, then the bug must lie in one of these setup functions. Running the code:

Reached here printed once

We have only one "Reached here" printed out. But in which of these functions exactly is the faulting code?

#[no_mangle]
extern "efiapi" fn efi_main(
handle: *const core::ffi::c_void,
sys_table: *mut SystemTable,
) -> usize {
// ...Others

write!(SCREEN.lock().as_mut().unwrap(), "Reached here");

setup_gdt();
write!(SCREEN.lock().as_mut().unwrap(), "Reached here"); // NEW
setup_idt();
setup_pics();

// DELETED: write!(SCREEN.lock().as_mut().unwrap(), "Reached here");

// ...Others
}

Is it setup_gdt?

Running:

Reached here printed twice

Okay, "Reached here" was printed twice so I guess the problem is not in the setup_gdt function.

Onto setup_idt:

#[no_mangle]
extern "efiapi" fn efi_main(
handle: *const core::ffi::c_void,
sys_table: *mut SystemTable,
) -> usize {
// ...Others

// DELETED: write!(SCREEN.lock().as_mut().unwrap(), "Reached here");

setup_gdt();
write!(SCREEN.lock().as_mut().unwrap(), "Reached here");
setup_idt();
write!(SCREEN.lock().as_mut().unwrap(), "Reached here"); // NEW
setup_pics();

// ...Others
}

Running:

Reached here printed twice

"Reached here" is printed twice, again. So, we can assume that the double fault does not occur during setup_idt's execution.

As for setup_pics:

#[no_mangle]
extern "efiapi" fn efi_main(
handle: *const core::ffi::c_void,
sys_table: *mut SystemTable,
) -> usize {
// ...Others

setup_gdt();
// DELETED: write!(SCREEN.lock().as_mut().unwrap(), "Reached here");
setup_idt();
write!(SCREEN.lock().as_mut().unwrap(), "Reached here");
setup_pics();
write!(SCREEN.lock().as_mut().unwrap(), "Reached here"); // NEW

// ...Others
}

After running this over and over, you'll get different results. In most instances, "Reached here" is printed once followed by the double fault panic message.

In just a tiny few, the only thing printed is "Reached here" without any panic messages. These only look something like this:

Only "Reached here" printed

These runs where "Reached here" is the only thing that gets printed are quite funny, baffling even. It seems like execution stopped before the double fault ever even occurred. Hmmm... seems like something's not quite right with our setup_pics. Let's comment it out and see what happens next.

#[no_mangle]
extern "efiapi" fn efi_main(
handle: *const core::ffi::c_void,
sys_table: *mut SystemTable,
) -> usize {
// ...Others


setup_gdt();
setup_idt();
write!(SCREEN.lock().as_mut().unwrap(), "Reached here");
// COMMENTED OUT: setup_pics();
write!(SCREEN.lock().as_mut().unwrap(), "Reached here");

// ...Others
}

Running this:

The Game Scene

And we have our game scene. That certifies it: something's up with setup_pics.

What exactly does setup_pics do?

fn setup_pics() {
PICS.lock().init();
}

It calls the PICS's init function.

What exactly does this init function do?

impl PICs {
pub fn init(&mut self) {
let mut wait_port = Port::new(0x80);
let mut wait = || wait_port.write(0);

self.first.command.write(CMD_INIT);
wait();
self.second.command.write(CMD_INIT);
wait();

self.first.data.write(self.first.offset);
wait();
self.second.data.write(self.second.offset);
wait();

self.first.data.write(4);
wait();
self.second.data.write(2);
wait();

self.first.data.write(MODE_8086);
wait();
self.second.data.write(MODE_8086);
wait();

self.first.data.write(0b11111100);
self.second.data.write(0b11111111);
}
}

It writes bytes to ports for the PIC's initialization sequence. So, somehow, writing bytes to ports in this function is what is causing the double fault. But which write exactly?

We continue our print investigation here. First, uncomment the setup_pics invocation in efi_main

#[no_mangle]
extern "efiapi" fn efi_main(
handle: *const core::ffi::c_void,
sys_table: *mut SystemTable,
) -> usize {
// ...Others


setup_gdt();
setup_idt();
// DELETED: write!(SCREEN.lock().as_mut().unwrap(), "Reached here");
setup_pics(); // NEW
// DELETED: write!(SCREEN.lock().as_mut().unwrap(), "Reached here");

// ...Others
}

Now, in machine/pics.rs

impl PICs {
pub fn init(&mut self) {
let mut wait_port = Port::new(0x80);
let mut wait = || wait_port.write(0);

// NEW:
use core::fmt::Write;
use crate::SCREEN;

write!(SCREEN.lock().as_mut().unwrap(), "Reached here"); // NEW
self.first.command.write(CMD_INIT);
wait();
self.second.command.write(CMD_INIT);
wait();

self.first.data.write(self.first.offset);
wait();
self.second.data.write(self.second.offset);
wait();

self.first.data.write(4);
wait();
self.second.data.write(2);
wait();

self.first.data.write(MODE_8086);
wait();
self.second.data.write(MODE_8086);
wait();

self.first.data.write(0b11111100);
self.second.data.write(0b11111111);

write!(SCREEN.lock().as_mut().unwrap(), "Reached here"); // NEW
}
}

We first import core::fmt::Write and the SCREEN static for printing.

Upon running over and over, the results are, once again, non-deterministic. Most times, a "Reached here" is printed followed by a panic message. I even had a run here where a "Reached here" was printed followed by "Reach", then nothing else:

Reached here one and a half times

As if the computer suddenly stopped executing in the middle of the process of the second print.

Let's shift the first "Reached here" print down a little and see what happens:

```rust
impl PICs {
pub fn init(&mut self) {
// ...Others

// DELETED: write!(SCREEN.lock().as_mut().unwrap(), "Reached here");
self.first.command.write(CMD_INIT);
wait();
self.second.command.write(CMD_INIT);
wait();

write!(SCREEN.lock().as_mut().unwrap(), "Reached here"); // NEW
self.first.data.write(self.first.offset);
wait();
self.second.data.write(self.second.offset);
wait();

self.first.data.write(4);
wait();
self.second.data.write(2);
wait();

self.first.data.write(MODE_8086);
wait();
self.second.data.write(MODE_8086);
wait();

self.first.data.write(0b11111100);
self.second.data.write(0b11111111);

write!(SCREEN.lock().as_mut().unwrap(), "Reached here");
}
}

Again, the results are non-deterministic. In some runs, only the panic message shows:

Only Panic Message

which suggests that the deadlock must have happened before the first code for printing the first "Reached here" was executed.

And in some others, we have a blank screen:

A Blank Screen

which suggests that the computer somehow stopped executing code before printing any messages and before any double fault ever occurred.

This is really mysterious. When we put the code to print "Reached here" before the first port writes, the "Reached here" got printed. But when we put it after, it did not. It seems like it's this first port writes that may be the culprit.

But what exactly do they do? Why do we write to these ports? To figure out what's going on, we need to remember why we did all this byte movement in this function, and the implications of those byte movements. We probably missed something there.

From this print investigation, we may not have been able to pinpoint exactly what's causing the problem but we've got a valuable piece of information: when setup_pics isn't called at all, all is well. When it is, mysterious double faults and exceptions occur.

Referring back to the Programmable Interrupt Controller posts, the PICs::init function remaps the PICs and enables interrupts to be received from the timer and the keyboard.

We still need more information to determine the cause of this problem. After a bunch of guided Google searches and web page jumps, it shouldn't take long to find the I Can't Get Interrupts Working page on the OSDev wiki. Under the "I'm receiving a double fault after enabling interrupts" subheading, there's some interesting information.

It essentially says that a timer interrupt could cause a double fault soon after interrupts are enabled if its entry is not already in the IDT. This is because timers send interrupt requests frequently.

Now, how does that apply to us here? In our code, interrupts are disabled and enabled in two places.

Here:

fn setup_gdt() {
// ...Others
interrupts::disable_interrupts();
let gdt_pointer = GDT.as_pointer();
// ...Others
interrupts::enable_interrupts();
}

And here:

// Creates a new IDT and sets it up
fn setup_idt() {
IDT.call_once(|| {
let mut idt = IDT::new();
let sel = gdt::CS.read();
idt.breakpoint = Entry::exception(ServiceRoutine(breakpoint_handler), sel);
idt.page_fault = Entry::exception(ServiceRoutineWithErrCode(page_fault_handler), sel);
idt.double_fault = Entry::exception(ServiceRoutineWithNoReturn(double_fault_handler), sel);
idt.interrupts[0] = Entry::interrupt(ServiceRoutine(timer_handler), sel);
idt.interrupts[1] = Entry::interrupt(ServiceRoutine(keyboard_handler), sel);
idt
});
let pointer = IDT.as_pointer();
interrupts::disable_interrupts();
interrupts::load_idt(&pointer);
interrupts::enable_interrupts();
}

That is, before and after the GDT is loaded and before and after the IDT is loaded. Could it be that enabling interrupts after loading the GDT could be the cause of this error? That would make sense because the timer interrupt happens very frequently and if the entry to handle it is not yet there, a double fault will occur.

But the problem with this hypothesis is that the "Reached here" got printed after executing both setup_gdt and setup_idt. If enabling interrupts in these functions caused the double faults, why do we see the effects long after? Well, perhaps it's because timer interrupts happen every 100 milliseconds. From a computer's perspective, that's a fairly long amount of time, perhaps enough to start executing the PICs::init function.

Perhaps interrupts should be completely disabled through this setup process and enabled only after the GDT, IDT, and PICs have been initialized.

Let's try doing that and see what happens.

fn setup_gdt() {
// ...Others
// DELETED: interrupts::disable_interrupts();
let gdt_pointer = GDT.as_pointer();
// ...Others
// DELETED: interrupts::enable_interrupts();
}
// Creates a new IDT and sets it up
fn setup_idt() {
IDT.call_once(|| {
let mut idt = IDT::new();
let sel = gdt::CS.read();
idt.breakpoint = Entry::exception(ServiceRoutine(breakpoint_handler), sel);
idt.page_fault = Entry::exception(ServiceRoutineWithErrCode(page_fault_handler), sel);
idt.double_fault = Entry::exception(ServiceRoutineWithNoReturn(double_fault_handler), sel);
idt.interrupts[0] = Entry::interrupt(ServiceRoutine(timer_handler), sel);
idt.interrupts[1] = Entry::interrupt(ServiceRoutine(keyboard_handler), sel);
idt
});
let pointer = IDT.as_pointer();
// DELETED: interrupts::disable_interrupts();
interrupts::load_idt(&pointer);
// DELETED: interrupts::enable_interrupts();
}
#[no_mangle]
extern "efiapi" fn efi_main(
handle: *const core::ffi::c_void,
sys_table: *mut SystemTable,
) -> usize {
// ...Others

boot_services.exit_boot_services(handle).unwrap();

interrupts::disable_interrupts();
setup_gdt();
setup_idt();
setup_pics();
interrupts::enable_interrupts();

// ...Others
}

Running this over and over gives just slightly non-deterministic outputs. Most runs end up with "Reached here" on the screen twice, with no errors:

Reached Here Twice With No Error

The other runs are very similar, but with one small difference: at the top left corner, about half a block is drawn:

Half A Block Is Drawn

The computer had started drawing the block but for some mysterious reason, it didn't finish it. It just stopped execution completely.

At first, it may seem that our change of where the enabling and disabling of interrupts are performed didn't change anything, but it did. If you run this code over and over, you'll notice that there are no more double fault errors. "Reached here" gets printed twice every single time, meaning that setup_pics is now working correctly.

Okay, it seems like it was actually that timer interrupt stuff that was the culprit there.

But our mystery isn't over yet. Why does the computer stop executing when it's time to draw stuff on the screen?

It's almost like there's some devil that just interrupts the execution and prevents it from continuing.

Speaking of interrupts, maybe they could be the culprit. Every 100ms, the timer interrupt is sent and the processor switches to its interrupt service routine to execute it. Perhaps this interrupt is somehow not allowing the processor to continue execution?

To test this hypothesis, let's disable timer interrupts and see if everything works as intended:

In machine/pics.rs:

impl PICs {
pub fn init(&mut self) {
// ...Others
let mut wait = || wait_port.write(0);

/* DELETED:
use core::fmt::Write;
use crate::SCREEN;
*/


// ...Others
// DELETED: write!(SCREEN.lock().as_mut().unwrap(), "Reached here");

self.first.data.write(self.first.offset);
wait();
// ...Others

// DELETED: self.first.data.write(0b11111100);
self.first.data.write(0b11111101); // NEW
self.second.data.write(0b11111111);
// DELETED:write!(SCREEN.lock().as_mut().unwrap(), "Reached here");
}
}

Referring back to this post and this post we disable the timer interrupt by setting the bit which corresponds to its enablement in the interrupt mask to 1 (meaning disabled).

Running the code now should give you the game scene. No errors. No mysterious stoppings.

Once again, the timer interrupt is the culprit. Since its service routine gets executed every 100ms, there must be something in the function that somehow prevents the computer from making progress after a visit.

extern "x86-interrupt" fn timer_handler(frame: interrupts::InterruptStackFrame) {
let screen = SCREEN.lock().as_mut().unwrap();
event_hook::send_event(EventInfo::Timer);
PICS.lock().end_of_interrupt(0);
}

First, the SCREEN static is accessed, then the event hooker is notified of the event, then the PICS end_of_interrupt function is called.

After a moment of serious thought, it should become clear why this innocent-looking service routine was such a source of frustration.

The SCREEN static is a Mutex instance and to access the value in it, the service routine first has to acquire the lock. Now here's the problem: what if the lock had already been acquired but not released? What happens then? Well, then the SCREEN's lock function will remain in a loop waiting for that lock to get released and if it's never released, then the result will be an infinite loop. A never returning function.

The processor only resumes normal execution after the service routine function has returned.

Think about it. To print the string "Reached here", that lock must be acquired. If after acquiring the lock, a timer interrupt occurs, normal execution will be suspended and that lock will still remain acquired by whichever function that acquired it. When the timer service routine reaches the SCREEN.lock, it will wait for the lock to get unlocked: an event that will never happen because the function never returns.

In other words, we have a deadlock.

A deadlock is what happens when two things wait for the other to do something. Forever. In our case, a deadlock occurs because the service routine is waiting for the SCREEN's lock to get released. But the code that holds that lock is waiting for control of the processor to be returned back to it.

For control of the processor to be returned, the service routine has to return. For the service routine to return, the lock has to be released. For the lock to be released, control of the processor must be returned.

The service routine is waiting for an event that will never happen.

Deadlock.

Of course, this explains why only "Reached here" could be printed or only part of a string could get printed. If, in the process of printing a string, the timer interrupt occurs, the service routine will end up waiting forever because the code that's printing the string already has the lock.

And there we have it: the devil that just interrupts execution and prevents it from continuing.

This also explains why only half a block could get drawn or no blocks get drawn at all. Because of a deadlock occurring in the timer interrupt.

To fix this:

extern "x86-interrupt" fn timer_handler(frame: interrupts::InterruptStackFrame) {
// DELETED: let screen = SCREEN.lock().as_mut().unwrap();
event_hook::send_event(EventInfo::Timer);
PICS.lock().end_of_interrupt(0);
}

The screen wasn't even being used at all. We probably used it for testing and then forgot to remove it.

Now, enabling the timer interrupt:

impl PICs {
pub fn init(&mut self) {
// ...Others
// DELETED: self.first.data.write(0b11111101);
self.first.data.write(0b11111100); // NEW
self.second.data.write(0b11111111);
}
}

And running again will give you the game scene. No errors. No mysterious stoppings.

Elementary, my dear Watson.

This concludes our venture into building infrastructure for event handling. Next on our roadmap is figuring out how to make the game a real game.

Take Away

For the full code up to this point, go to the repo

In The Next Post

We'll start figuring out how to make the game a real game.

References