Old problem, new solution

Intro

I've had a nagging problem on the side, and I wasn't exactly satisfied with the solutions I came up with. In hindsight the current solution seems a bit obvious to me, but sometimes things do not cross our mind when they should. A quick warning that there's going to be code ahead, and all of it is not going to be pretty. So whether or not you are of the coding type, prepare yourself.

The Problem

Since Flower has its own IR that needs to be both runnable and transpilable to LLVM, I've had an issue both finalising the format and having a good way to debug the code generation. I used to have completely separate implementations of generating the bytecode, executing the bytecode, and disassembling the bytecode. That proved to be quite cumbersome, as if I had bug in one of the things, it was a bit of a pain to debug which one was wrong. Codegen is still distinct from the other parts, but it is much less of a problem now than it used to be.

How things were implemented

The disassembler was written in simple way, using a switch to check the opcode, then simply call a function to disassemble the instruction accordingly. This was a working solution and was "good enough" for a while.

This was basically the core of what I had

static std::string disassemble_instruction(
    uint8_t*& ip,
    const uint8_t* end_ptr) noexcept
{
    if (ip >= end_ptr)
        return std::string(INVALID_INSTRUCTION);

    const auto instruction = static_cast<opcode>(*ip);
    switch (instruction) {
        case opcode::ret:
            return disassemble_return_statement(++ip,
                                                end_ptr);
        case opcode::add:
            return disassemble_add_instruction(++ip,
                                               end_ptr);
        // ... cut here for brevity
    }
    ++ip;
    return fmt::format("unhandled instruction: {}",
        to_string(instruction));
}

and then I would implement handling the bytecode for each instruction in the appropriate function. I don't think this was entirely unreasonable approach, but we'll get to the problems it created shortly.

Meanwhile, the virtual machine used a different approach and I had a table of labels and jumped from one instruction to the next more directly.

#define OPCODE(x) \
    [static_cast<uint8_t>(fir::opcode::x)] = &&opcode ## _ ## x

// ... a bunch of code clipped off here ...

constexpr const static void* instructions[] = {
    OPCODE(noop),
    OPCODE(halt),
    OPCODE(zero),
    // ...

    goto* instructions[*ip++];

    opcode_noop: {
        // code to handle noop
    }
    goto instructions[*ip++];

    opcode_halt: {
        // code to handle halt
    }
    goto instructions[*ip++];

    // ...

There are reasons why I went this route instead of more simple switch to handle the main bytecode loop, but that would probably be worth a couple of blog posts in itself.

Neither of these would've been a problem in itself, but I now had duplicate read-and-do-whatever -runners for the bytecode itself. My virtual machine is quite simple, but it still has plenty of instructions it has to deal with.

For example, the add opcode would require me to do this in the VM code:

opcode_add: {
    if constexpr (Config::check_bounds == true) {
        if (end_ptr < ip + 4) {
            instruction_out_of_bounds = true;
            goto opcode_halt;
        }
    }
    uint8_t reg_lhs = *ip++;
    uint8_t reg_rhs = *ip++;
    uint8_t reg_res = *ip++;

    if constexpr (Config::trace_instructions)
        fmt::print("  add r{} + r{} -> r{}\n", reg_lhs,
                                               reg_rhs,
                                               reg_res);

    registers[reg_res] = registers[reg_lhs] 
                       + registers[reg_rhs];
}
goto* instructions[*ip++];

And then in the disassembler do pretty much the same

static std::string disassemble_add_instruction(
    uint8_t*& ip,
    const uint8_t* end_ptr) noexcept
{
    uint8_t reg_lhs = checked_read(ip, end_ptr);
    uint8_t reg_rhs = checked_read(ip, end_ptr);
    uint8_t reg_res = checked_read(ip, end_ptr);

    return fmt::format("add %{}, %{} -> %{}", reg_lhs,
                                              reg_rhs,
                                              reg_res);
}

This wouldn't have been as problematic if the bytecode format would've been stable. But the bytecode format is still changing. (Which is actually what I'm working on stabilising a bit currently!)

Neither of these would be correct anymore, and having to deal with two pretty separate implementations was a bit of a time-drain. This duality also made it much harder to reason around how the instructions should be handled in the first place. There was a lot of extra stuff I had to keep in mind while working with this.

And it was getting ugly. You can see the if constexpr blocks in the add implementation for the VM. I wanted there to be a way for me to trace instructions for debugging, but not have a runtime branch for every single instruction that the VM would run. And I didn't want to recompile with different flags to be able to change it on the fly.

The new (current) solution

Having all those if constexprs around, I started to think that there must be a better way to handle those cases instead of the quick ad-hoc solution. And since I was already in the realm of compile-time programming and template magic. What a better way to solve it?

template <vm_runner Internals>
class vm : public Internals

That one declaration is pretty much the key to the solution to every problem mentioned. I have no idea why I didn't come up with this sooner. It seems so obvious in hindsight. I get to keep my precious label-jumping of the virtual machine, I get to read the bytecode in only one place and as a bonus, language server is much happier with this solution.

The core of the VM didn't change much. It's just that now instead of the previous opcode_add I have something like this

opcode_add: {
    if(this->check_overflow(ip, sentinel,
                            sizeof(register_index) * 3))
        return;

    auto lhs = fir::read<register_index>(ip);
    auto rhs = fir::read<register_index>(ip);
    auto res = fir::read<register_index>(ip);

    this->add(lhs, rhs, res, ip);
}
goto* instructions[*ip++];

We just read data here, and then pass on to the inherited functions. Much cleaner, and faster to compile. I had to worry a bit here about the generated assembly, but it didn't take too much to get both clang and gcc to inline the function calls even without optimisations turned on. So, no performance cost for this either.

#define FLVM_VM_OPERATION __attribute__((always_inline))
#define FLVM_INLINE_STATEMENT [[clang::always_inline]]

The benefits

The largest benefit here was a bit unexpected, but very welcome. Looking at the code like this, I find it much easier to reason around how the instructions should look like in the bytecode representation. And if I decide to change the instruction, both the disassembler flvm::vm<disassembler> and the virtual machine flvm::vm<executor> fail in the exact same place, making it pretty quick check whether codegen or the runner is to blame.

I also find this easier to work with, and especially to experiment with. So it will probably cut down some development time in the long run. Not to mention that I imagine it cut the wtf's per minute ratio considerably. (Un)fortunately, this uncovered a couple of bugs in some parts of the implementation, so I spent a day fixing those afterwards. No visible-to-outside progress, but much internal progress.

All in all, just wanted to share a little technical problem and the solution I came up with. Nice to be back at writing Flower for real. Hopefully second half of 2023 will be a better for it than the first one.

Posted 2023-07-12 in implementation