Anatomy of the Flower compiler, part 2
From syntax tree to bytecode
structure at language-examples/ex1.flow (0)
└── definition of function 'main' with return type 's32' at 1[4] (0)
└── named scope '::__a0' at 1[26] (0)
└── return statement at 2[0] (0)
└── literal value '1' of type 'integer' at 2[11] (0)
We were left with this kind of tree and the question what is the next problem to solve.
Well, taking a closer look at two of the lines here reveals something we need to take care of next.
-- 'main' with return type 's32' --
and
-- literal value '1' of type 'integer' --
Welcome to the wonderful world of
Type checking
This far, we've not been too concerned about types. For all we've
cared they have been just identifiers. But as we move towards the
Flower bytecode, we need to take types into account. Handling just
basic types is not too weird, but since we want to handle the integer
type we got from the integer literal as well, we end up with something
a bit more complex.
Flower bytecode is typed. And the Flower virtual machine itself handles 9 different kinds of types, which are represented by an enum
enum class base_type : uint8_t
{
none = 0b0000,
sint,
uint,
boolean,
real,
integer,
variant,
function,
object,
type,
invalid
};
none
and invalid
aren't really types in themselves. invalid
marks
strict type errors and are fatal compilation errors. none
is basically
a void type.
sint
and uint
are fixed-size integer types of some bit-depth. boolean
should be self-explanatory. real
is a floating point of some bit-depth;
it would read float
if that wouldn't be a keyword in C++. Currently it
can only be either C float
or double
. object
is a compound type,
variant
is a tagged union, function
is a function and integer
is a
variable-size integer (i.e. a bigint). There's also the type type
, but
we're just going to ignore that for now.
In any case, integer literals in Flower are of type integer
, and we need
our main
function to return a s32
(a 32-bit sint
). As we generate
bytecode for expressions, we mark their type and check it here. The code
for doing so is pretty much as follows (some things omitted for brevity).
auto expression = create_expression(prg,
expression_node,
ctx,
errors);
if (expression.is_value()) {
auto& value = expression.get_value();
if (value.var_type != expected_type) {
if (try_builtin_conversion(value,
expected_type)) {
// we good, generate code
}
else if (not find_conversion(
value.var_type,
expected_type,
ctx))
{
errors.push_back(...
Our create_expression
function here returns a expression_result
that is
either bytecode for the expression directly, or a type-value pair, which we
check with expression.is_value()
. Luckily for us, we don't need to think
about other branches than both is_value()
and try_builtin_conversion()
succeeding with our simple code.
Builtin conversions and the integer
type
Flower is pretty strict about type conversions. Currently, it does not even
allow non-narrowing casts like s32 -> s64
implicitly. But we don't want to
sprinkle explicit casts all around our code, so there's an exception.
The integer
-type can be implicitly cast to either an unsigned or signed
fixed-size integer. But only if the compiler can prove it fits. This requires
the compiler to do range analysis with each of the integer
variables. The
compiler can only do this for simple cases for now, but later on it will need
more sophisticated approach to deal with I/O events, etc.
Since our integer
is just a literal that gets instantly assigned to the
return value here, and our integer literal fits in the s32
type, we can
just rest easy, and enjoy the fact that the try_builtin_conversion
didn't
error out and the compiler can just implicitly convert it to a s32
.
(Or rather, just treat the value as s32
from the beginning)
We end up with following bytecode disassembly
function main[<osdefault>]:
ret s32 1
...well that was a lot of work to just get to a slightly different representation, but here we are describing a small binary sequence of just few bytes in text.
Internals and getting to native code
A Flower program is represented in the compiler as
struct function
{
std::string disassemble() const noexcept;
std::string identifier;
type return_type;
smallvec<type> parameter_types;
std::string abi_ident;
buffer_type bytecode;
};
struct program
{
using error_code = int;
std::string disassemble() const noexcept;
bool is_empty() const noexcept {
return functions.empty();
}
uint8_t* find_function_ip(
const std::string& func) noexcept;
smallvec<function> functions;
smallvec<type> types;
program_metadata metadata;
};
which shouldn't contain any too interesting parts. smallvec
is a
small-size optimised vector, from which the compiler gets a lot
use out of. It's one of the container types that I've built
for the compiler. It will probably be a part of ftl
later on, once it gets a bit more stable, better tested and gets
rid of some non-freestanding parts.
buffer_type
is actually also a smallvec
with some extra
functions so it's trivial to read and write values of different
types.
My first attempt at this was way too low-level. While I technically could've stuffed all the extra necessary info into metadata and just have one buffer of bytecode and the metadata, that was way too complicated to handle. Not to mention it really didn't play nice with converting onwards to different bytecode format (e.g. for LLVM).
Our previous result
function main[<osdefault>]:
ret s32 1
however, is (at least for now) pretty simple to convert to LLVM code using the tools given by LLVM itself. QBE is a bit harder since it requires us to deal with text format directly but should be doable as well.
But for now, we can run our bytecode program with
$ ./flvm ex1.fbc
$ echo $?
1
That is as far as our compiler can go at this stage. LLVM codegen will probably be in the agenda once more of the basic stuff functions and the Flower syntax has stabilized more.
But it'll probably be a good half a year before I even want to start to tackle that in depth, other than a quick test run here and there to make sure that our own bytecode representation doesn't get too difficult to turn into LLVM bytecode. Otherwise, running stuff with the VM is good enough for now.