Anatomy of the Flower compiler, part 2

From syntax tree to bytecode

structure at language-examples/ex1.flow (0)
└── definition of function 'main' with return type 's32' at 1[4] (0)
    └── named scope '::__a0' at 1[26] (0)
        └── return statement at 2[0] (0)
            └── literal value '1' of type 'integer' at 2[11] (0)

We were left with this kind of tree and the question what is the next problem to solve.

Well, taking a closer look at two of the lines here reveals something we need to take care of next.

-- 'main' with return type 's32' --

and

-- literal value '1' of type 'integer' --

Welcome to the wonderful world of

Type checking

This far, we've not been too concerned about types. For all we've cared they have been just identifiers. But as we move towards the Flower bytecode, we need to take types into account. Handling just basic types is not too weird, but since we want to handle the integer type we got from the integer literal as well, we end up with something a bit more complex.

Flower bytecode is typed. And the Flower virtual machine itself handles 9 different kinds of types, which are represented by an enum

enum class base_type : uint8_t
{
    none        = 0b0000,
    sint,
    uint,
    boolean,
    real,
    integer,
    variant,
    function,
    object,
    type,

    invalid
};

none and invalid aren't really types in themselves. invalid marks strict type errors and are fatal compilation errors. none is basically a void type.

sint and uint are fixed-size integer types of some bit-depth. boolean should be self-explanatory. real is a floating point of some bit-depth; it would read float if that wouldn't be a keyword in C++. Currently it can only be either C float or double. object is a compound type, variant is a tagged union, function is a function and integer is a variable-size integer (i.e. a bigint). There's also the type type, but we're just going to ignore that for now.

In any case, integer literals in Flower are of type integer, and we need our main function to return a s32 (a 32-bit sint). As we generate bytecode for expressions, we mark their type and check it here. The code for doing so is pretty much as follows (some things omitted for brevity).

auto expression = create_expression(prg,
                                    expression_node,
                                    ctx,
                                    errors);

if (expression.is_value()) {
    auto& value = expression.get_value();

    if (value.var_type != expected_type) {
        if (try_builtin_conversion(value, 
                                   expected_type)) {
            // we good, generate code
        }
        else if (not find_conversion(
            value.var_type,
            expected_type,
            ctx))
        {
            errors.push_back(...

Our create_expression function here returns a expression_result that is either bytecode for the expression directly, or a type-value pair, which we check with expression.is_value(). Luckily for us, we don't need to think about other branches than both is_value() and try_builtin_conversion() succeeding with our simple code.

Builtin conversions and the `integer` type

Flower is pretty strict about type conversions. Currently, it does not even allow non-narrowing casts like s32 -> s64 implicitly. But we don't want to sprinkle explicit casts all around our code, so there's an exception.

The integer-type can be implicitly cast to either an unsigned or signed fixed-size integer. But only if the compiler can prove it fits. This requires the compiler to do range analysis with each of the integer variables. The compiler can only do this for simple cases for now, but later on it will need more sophisticated approach to deal with I/O events, etc.

Since our integer is just a literal that gets instantly assigned to the return value here, and our integer literal fits in the s32 type, we can just rest easy, and enjoy the fact that the try_builtin_conversion didn't error out and the compiler can just implicitly convert it to a s32. (Or rather, just treat the value as s32 from the beginning)

We end up with following bytecode disassembly

function main[<osdefault>]:
    ret s32 1

...well that was a lot of work to just get to a slightly different representation, but here we are describing a small binary sequence of just few bytes in text.

Internals and getting to native code

A Flower program is represented in the compiler as

struct function
{
    std::string     disassemble() const noexcept;

    std::string     identifier;

    type            return_type;
    smallvec<type>  parameter_types;
    std::string     abi_ident;

    buffer_type     bytecode;
};

struct program
{
    using error_code = int;

    std::string disassemble() const noexcept;
    bool is_empty() const noexcept {
        return functions.empty();
    }

    uint8_t* find_function_ip(
        const std::string& func) noexcept;

    smallvec<function>  functions;
    smallvec<type>      types;
    program_metadata    metadata;
};

which shouldn't contain any too interesting parts. smallvec is a small-size optimised vector, from which the compiler gets a lot use out of. It's one of the container types that I've built for the compiler. It will probably be a part of ftl later on, once it gets a bit more stable, better tested and gets rid of some non-freestanding parts.

buffer_type is actually also a smallvec with some extra functions so it's trivial to read and write values of different types.

My first attempt at this was way too low-level. While I technically could've stuffed all the extra necessary info into metadata and just have one buffer of bytecode and the metadata, that was way too complicated to handle. Not to mention it really didn't play nice with converting onwards to different bytecode format (e.g. for LLVM).

Our previous result

function main[<osdefault>]:
    ret s32 1

however, is (at least for now) pretty simple to convert to LLVM code using the tools given by LLVM itself. QBE is a bit harder since it requires us to deal with text format directly but should be doable as well.

But for now, we can run our bytecode program with

$ ./flvm ex1.fbc
$ echo $?
1

That is as far as our compiler can go at this stage. LLVM codegen will probably be in the agenda once more of the basic stuff functions and the Flower syntax has stabilized more.

But it'll probably be a good half a year before I even want to start to tackle that in depth, other than a quick test run here and there to make sure that our own bytecode representation doesn't get too difficult to turn into LLVM bytecode. Otherwise, running stuff with the VM is good enough for now.

Posted 2022-10-21 in compiler implementation

Anatomy of the Flower compiler, part 2

From syntax tree to bytecode

Type checking

Builtin conversions and the integer type

Internals and getting to native code

Builtin conversions and the `integer` type