Anatomy of the Flower compiler, part 1

From a simple program to syntax tree

Let's make a really simple program and go through what the Flower compiler needs to do to get through that.

def func main() -> s32 as {
    return 1;
}

Well this is easy, it just needs to produce the following assembly.

mov eax, 1
ret

But since we need to work with generic code, not just this example, there are a couple of extra steps.

Lexing

The first thing most compilers do, is to turn the text representation into tokens. A token in my lexer looks like this.

struct token
{
    token_type          type {token_type::unknown};
    source_location     loc;
    std::string         value;
};

source_location holds the information where the token was located in the originally parsed source code. It is going to be useful later on. Other two variables are pretty self-explanatory. token_type is an enumeration that has values such as identifier, integer_literal, left_parenthesis, right_parenthesis, end_of_file etc.

The lexer is not really too interesting part, and the only unusual thing we do here is that we basically parse half of the stuff as identifiers, which will come up later. Mostly the lexer is quite similar to any other lexer, including the one in kaleidoscope.

Anyways, we parse the textual input and make a buffer of tokens. Some lexers/parsers handle tokens as streams, but in this compiler's case, it's just std::vector<token> that has all tokens a single file (or some source section, we accept buffer inputs) contains.

There are some lexer generators, but I chose to hand-roll my own, since its pretty fast to do and gives me full control over the tokenisation process and I get rid of one extra dependency that saves me less than a day's worth of coding.

At this point, we get output of tokens, which printed out, looks like this

at <ex1.flow, 1[0]> definition 
at <ex1.flow, 1[4]> function keyword 
at <ex1.flow, 1[9]> identifier value: main
at <ex1.flow, 1[14]> left parenthesis 
at <ex1.flow, 1[15]> right parenthesis 
at <ex1.flow, 1[16]> arrow right 
at <ex1.flow, 1[19]> identifier value: s32
at <ex1.flow, 1[23]> 'as' keyword 
at <ex1.flow, 1[26]> left brace 
at <ex1.flow, 2[0]> return statement 
at <ex1.flow, 2[11]> integer literal value: 1
at <ex1.flow, 2[13]> end of statement 
at <ex1.flow, 3[0]> right brace

Generating abstract syntax tree (semantic graph)

Next step, common to most compilers is to turn this stream of tokens into a structure more easily parseable by the compiler. This structure is the AST, or abstract syntax tree. It contains necessary information about the contents of the source code.

As you can probably figure out from the name, we generate a tree-like structure to represent the code from our tokens. In Flower, identical nodes of the tree are connected so that what we actually get is a semantic graph, but it's not very important detail at this point.

This is what our nodes in the tree look like

struct node_entry {
    source_location     loc;
    node::asg_data_ptr  storage_offset = node::empty{};
    node_index_type     parent_index = EMPTY_NODE_INDEX;
};

node_index_type is just an alias for ssize_t, and source_location we get directly from the tokens we used to generate the node. asg_data_ptr is a variant that can hold a pointer to any of the tree node types, where we store the actual important data bits. They themselves are stored sequentally in the memory.

For example, asg_data_ptr to a return statement points to a node that looks like this

struct return_statement {
    constexpr static const std::string_view idstr 
        = "return statement";
    asg_node return_expr;

    asg_node_list children() noexcept {
        return { return_expr };
    }

    const asg_node_list children() const noexcept {
        return { return_expr };
    }
};

We need asg_node return_expr; here, because the value returned might be more complex expression than a simple literal value. So, it is a link to the expression for the value. Each node type also needs a way to tell what its children are, so the graph can be traversed. So every node type must provide a list of its children with the children() function, if it has any.

From this stage of the compilation, we end up with a tree like this, with our return statement highlighted

structure at language-examples/ex1.flow (0)
└── definition of function 'main' with return type 's32' at 1[4] (0)
    └── named scope '::__a0' at 1[26] (0)
        └── return statement at 2[0] (0)
            └── expression chain at 2[11] (0)

The file is called "structure" here, but it doesn't really hold much meaning currently. Originally I thought that all files would just parse as being structures, but it hasn't led to anywhere yet. I will probably need to get back at that when I think about modules.

But this is the first point where we differ from the most basic compilers. Here our expressions are not even considered yet, we just mark them as "expression chains" and carry on.

Why we do this and most compilers do not, is because we actually need some of the declarations to be available when we are parsing through our expressions. This is because Flower allows user-defined operators. We do this first step so we know what the identifiers represent when we start unraveling our expression chains. The compiler adds the expression chain nodes to a set that we later go through to open each chain.

Chain nodes basically look like this

struct expression_chain {
    constexpr static const std::string_view idstr
        = "expression chain";
    std::vector<token>  tokens;
    scope_node          scope;
};

So it's basically just the scope where the chain is located, and the tokens associated with the chain itself.

We also created a scope/definition graph, that looks like this at the moment. This basically just tells us the scoping of different symbols so we can access them more quickly later on.

GLOBAL
├─  -
├─  +
├─  =
├─  main
└── __a0 (leaf)

At this point, we can already check if our definitions are available, where they are located and call the user out if they are using something that is not. We have three global operators defined for us already, and those are +, - and =. While those are built-in operators, at this stage, we treat them as any other identifier.

We also note that the compiler created a scope __a0, which is the scope for the main function. The compiler keeps track of these so it can later handle scoping easier.

Opening expression chains

Next we iterate through the set of expression chains and open them up. After the first round of generation, we know what definitions are available and what each of the tokens actually means. This is important because for example a bit more complex expression like

(-v1 + (v2())) - v3() = 4

tokenizes to

at <stage0.flow, 13[7]> left parenthesis 
at <stage0.flow, 13[9]> identifier value: -
at <stage0.flow, 13[10]> identifier value: v1
at <stage0.flow, 13[12]> identifier value: +
at <stage0.flow, 13[14]> left parenthesis 
at <stage0.flow, 13[16]> identifier value: v2
at <stage0.flow, 13[18]> left parenthesis 
at <stage0.flow, 13[19]> right parenthesis 
at <stage0.flow, 13[20]> right parenthesis 
at <stage0.flow, 13[21]> right parenthesis 
at <stage0.flow, 13[22]> identifier value: -
at <stage0.flow, 13[24]> identifier value: v3
at <stage0.flow, 13[27]> left parenthesis 
at <stage0.flow, 13[28]> right parenthesis 
at <stage0.flow, 13[29]> identifier value: =
at <stage0.flow, 13[31]> integer literal value: 4
at <stage0.flow, 13[33]> left brace

And you can see everything that's not a parenthesis is basically an identifier. If we didn't go through this in two steps, we would have real trouble trying to figure out what each of these identifiers actually means.

The good part here is that since we have all the required declarations from the first round to parse everything, and we have the references to each of the nodes we need to open, we could also do this opening of expressions in parallel without too many problems.

Anyways, now armed with the declarations for identifiers, we can open the expression chains and get to an AST that looks like

structure at language-examples/ex1.flow (0)
└── definition of function 'main' with return type 's32' at 1[4] (0)
    └── named scope '::__a0' at 1[26] (0)
        └── return statement at 2[0] (0)
            └── literal value '1' of type 'integer' at 2[11] (0)

Which is our final result for now. Next time we'll see how we turn the AST to Flower bytecode and what still has to happen afterwards.

For the sharp-eyed: Can you already notice a problem we have in the AST we ended up with?

Posted 2022-10-14 in compiler implementation