Representing LoxMocha in code: Fun with C++ Variants and Visitors

The first problem encountered when implementing any language is how to represent, traverse, and manipulate the code’s structure. We want to ensure our representation makes it easy to interact with the code structure and extend it with new language features. Any introductory book on compilers will introduce Abstract Syntax Trees (ASTs) within the first few chapters. In short, an AST is a tree representation of just the structure of the code, such as what expressions are nested in other expressions, what statements are in what order, and what functions are defined, and so on. But how to represent an AST leaves many questions.

The implementation language of choice often dictates how we represent ASTs. In a language like Java, we would likely use class-hierarchy with inheritance. So you can imagine you have a base class, Node, and derived classes, Expr and Decl. You can then derive more classes from them, such as BinaryExpr, LiteralExpr, VariableDecl, and so on.

We can easily imagine how we could further extend this class hierarchy to add more language features. Notably, one of the most famous compiler toolchains, LLVM, uses this approach to represent its AST in Clang.

In another language, such as Rust or Haskell, we would likely use algebraic data types (ADTs), which are called enums in Rust. In this method, instead of using a class hierarchy, we define a type that can be one of many possible variants. In this, any possible expression can be one of the variants defined in the type, meaning all possible kinds of expression are known at compile time. Whereas in a class hierarchy model, someone could make a new derived class at any time without updating any of the existing code.

In a class hierarchy, this can be useful if we have sufficiently defined all the operations we want to perform on the base class and wish to easily extend the hierarchy with new derived classes. But this can lead to difficulties if we want to add new operations to the base class, since we now need to update all possible derived classes to implement them, even if the derived class is not relevant to the new operation. Alternatively, in an ADT model, we can easily add new operations by defining new functions that inspect the ADT variant and then unwrap it to perform the operation. If we want to add new variants to the ADT, we suddenly need to update all existing operations to correctly handle (or ignore) them.

Generally, for programming languages, it’s rare that we add new kinds of expressions or statements. We often want to add new operations to perform on the AST, such as searching for certain patterns, transforming the AST, or generating code. We therefore want to make it easy to add new operations, and also make adding new kinds of expressions or statements more difficult. This is because we want to ensure care is taken when adding new language features and that the handling of them is correctly implemented everywhere it is needed.

I have said before that in object-oriented languages, such as Java and C++, the class hierarchy is often used (particularly in Java and older C++). But in ‘modern’ C++, we have access to modern features that make it easy to implement ADT-like structures. Specifically, we have std::variant, which is used to define tagged types. These have been common in functional programming languages for a long time, but recently found their way into procedural and object-oriented languages and were added to C++ in C++17 as std::variant. We use std::variant by defining the types it can hold via a template parameter pack. Where each type in the parameter pack is a possible value in the variant. For example, if we wanted to represent a simple expression AST of literals, identifiers, binary expressions, function calls, and if expressions, we could define it as:

struct expr_t : variant<
    literal_t,
    identifier_t,
    binary_t,
    call_t,
    if_t> {};

Now our expr_t can hold any of the defined types. We can see how similar this is to defining an ADT in Haskell:

type Expr
    = Literal Literal
    | Identifier Identifier
    | Binary Binary
    | Call Call
    | If If

In Haskell, to determine which variant we have and extract its value, we would use pattern matching. Unfortunately, C++ does not have pattern matching as a language feature¹. So instead, we are stuck using library functions to simulate pattern matching. The fundamental function to query and extract values from a std::variant is std::hold_alternative and std::get/std::get_if.

Let’s say we wanted to see if our expression was an if expression, and then print out if it has an else branch, we could do it with the following code:

if (hold_alternative<if_t>(expr)) {
    const auto& if_expr = get<if_t>(expr);
    if (if_expr.else_branch) {
        print("has else branch");
    } else {
        print("no else branch");
    }
}
// Or using get_if
if (const auto* if_expr = get_if<if_t>(&expr)) {
    if (if_expr->else_branch) {
        print("has else branch");
    } else {
        print("no else branch");
    }
}

But even from this small example, we can see how cumbersome it is to use these functions to interact with variants, and how quickly the control flow could get convoluted and confusing. Instead, we can emulate pattern matching with std::visit and visitor objects. This way, we can say, I want to perform this kind of operation on each kind of expression, and the std::visit function will do all the annoying unwrapping and function calling for us. To define a visitor, we simply define some struct or class with overloaded operator() functions for each variant (hint: we can use templates, auto, and concepts to make this easier!). We then pass an instance of this visitor to std::visit along with the variant to “visit”. Let’s say we wanted to print out the structure of an expression AST. We might have a visitor called expr_printer_t. We can then define, for each kind of expression, how it should be printed as so:

struct expr_printer_t {
    void operator()(literal_t lit) {
        print("literal: {}", lit.value);
    }

    void operator()(identifier_t id) {
        print("identifier: {}", id.name);
    }

    void operator()(binary_t bin) {
        print("binary: {}", bin.op);
        visit(*this, bin.left);
        visit(*this, bin.right);
    }

    void operator()(call_t call) {
        print("call:");
        visit(*this, call.callee);
        for (const auto& arg : call.arguments) {
            visit(*this, arg);
        }
    }

    void operator()(if_t if_expr) {
        print("if:");
        visit(*this, if_expr.condition);
        visit(*this, if_expr.then_branch);
        if (if_expr.else_branch) {
            visit(*this, *if_expr.else_branch);
        }
    }
};

Then, to print an expression, we can do the following single function call:

std::visit(expr_printer_t{}, expr);

All the complexity of unwrapping the variant and making sure the correct function is called is handled by std::visit. Leading to much simpler and easier-to-read code. To show how we could do partial matching, we could define another visitor that returns true if the expression is a leaf and false otherwise. We do this by defining a default operator() overload that matches any type, and then a specific overload for literal_t and identifier_t.

struct if_leaf_t {
    bool operator()(const auto&) {
        return false;
    }

    bool operator()(const literal_t&) {
        return true;
    }

    bool operator()(const identifier_t&) {
        return true;
    }
};

Here, we use function overloading to perform pattern matching on the kind of expression the variant holds.

So let’s move on to a more practical example. Consider we have a LoxMocha function² that computes the Fibonacci for a given number. But we want to output a graphical representation of the AST for this function. Having these kinds of visualisations is very useful for understanding how the input text is transformed into the AST, and how that AST is manipulated during compilation. So here is the recursive Fibonacci function in LoxMocha:

fun fib(n: u32): u32 =>
    if n <= 1 => n
    else fib(n - 1) + fib(n - 2)

Now, how would we represent this function as an AST using the variant approach we’ve discussed so far, and then how do we output a graphical representation of it? So first of all, let’s extend our AST representation to include function declarations and simple types and the structure of our expressions³.

// Forward declare expr_t to allow recursive definitions.
struct expr_t;

// A literal value in the language
using literal_t = variant<int, bool>;
// A string representing some identifier or operator.
using identifier_t = string;

// An expression of the form expr op expr
struct binary_t {
    identifier_t op;
    unique_ptr<expr_t> left;
    unique_ptr<expr_t> right;
};

// An expression of the form expr(expr...)
struct call_t {
    unique_ptr<expr_t> callee;
    vector<expr_t> arguments;
};

// An expression of the form if cond => then_branch [else => else_branch]
struct if_t {
    unique_ptr<expr_t> condition;
    unique_ptr<expr_t> then_branch;
    unique_ptr<expr_t> else_branch; // optional, if nullptr then no else.
};


using type_t = identifier_t;
struct function_t {
    identifier_t name;
    vector<pair<identifier_t, type_t>> parameters;
    type_t return_type;
    expr_t body;
}
using decl_t = variant<function_t>;

So now we have all the necessary definitions to represent our Fibonacci function. We can now create a visitor that prints a representation of the AST in the D2 diagramming language. So let’s build this up piece by piece, starting with literals and identifiers.

struct d2_printer_t {
    // Keep track of the next id to use for nodes.
    int next_id;

    int operator()(const literal_t& lit) {
        if (holds_alternative<int>(lit)) {
            println("node{}: literal_t: {}", next_id, get<int>(lit));
        } else if (holds_alternative<bool>(lit)) {
            println("node{}: literal_t: {}", next_id, get<bool>(lit));
        }
        return next_id++;
    }

    int operator()(const identifier_t& id) {
        println("node{}: identifier_t: {}", next_id, id);
        return next_id++;
    }
}

So let’s talk about how this works so far. Each node in the D2 diagram needs a unique identifier so that we can draw edges between two nodes. Therefore, we need to keep track of the next ID to use as we traverse the AST, and which ID was used to print the current node, so we can return it to the caller and let them draw edges to it. We do this by maintaining a member variable, next_id, that keeps track of the ID to use for the next node. We will see more about how this is used later when we have to recurse into child nodes! Now that we have our leaves defined, we can move on to binary expressions that require us to recurse into child nodes.

    // Helper function to traverse a child node and print the edge.
    void traverse(int parent_id, const auto& child) {
        int child_id = visit(
            *this,
            child,
            next_id++);
        println("node{} -> node{}", parent_id, child_id);
    }

    int operator()(const binary_t& binary) {
        int current_id = next_id++;
        println("node{}: binary_t: {}", current_id, binary.op);
        // Print left node.
        traverse(current_id, *binary.left);
        // Print right node.
        traverse(current_id, *binary.right);
        return current_id;
    }

Here we can see how to print out the current node, then recursively call visit on its left and right child nodes to build the full AST representation. This allows us to build the full tree by having each function in the visitor handle its own node and correctly traverse its children. Finally, let’s complete our visitor by adding the remaining expressions and function declaration!

    int operator()(const call_t& call) {
        int current_id = next_id++;
        println("node{}: call_t", current_id);
        // Print callee node.
        traverse(current_id, *call.callee);
        // Then print each argument node.
        for (const auto& arg : call.arguments) {
            traverse(current_id, arg);
        }
        return current_id;
    }

One interesting thing to note about call expressions is that they can have an arbitrary number of arguments. So we have to loop over each argument, visit each of them, and keep track of the next ID to use.

    int operator()(const if_t& if_expr) {
        int current_id = next_id++;
        println("node{}: if_t", current_id);
        // Print the condition node.
        traverse(current_id, *if_expr.condition);
        // Print the then branch node.
        traverse(current_id, *if_expr.then_branch);
        // Print the else branch node if it exists.
        if (if_expr.else_branch) {
            traverse(current_id, *if_expr.else_branch);
        }
        return current_id;
    }

Finally, we need to top it all off with our root function_t node.

    int operator()(const function_t& func) {
        int current_id = next_id++;
        println("node{}: function_t: {}", current_id, func.name);
        // Then print each parameter node.
        for (const auto& [param_name, param_type] : func.parameters) {
            traverse(current_id, param_type);
        }
        // Print the return type.
        traverse(current_id, func.return_type);
        // Finally print the body node.
        traverse(current_id, func.body);
        return current_id; 
    }

We see that the visitor for the function declaration looks more complicated, but it really does exactly the same thing as the other visitors. It prints the current node and then traverses each of its children. There just happen to be more kinds of children to traverse, making it look more complex.

Eventually, after we have completed our wonderful visitor, we can run it on our parsed Fibonacci function AST as we’ve seen before:

visit(d2_printer_t{1}, fib_function);

Once we’ve done all of that, and run it through d2 we get the following tree presentation as an SVG:

For LoxMocha, I have decided to use this variant and visitor approach to represent the AST. I have found that it has made it easy to interact with and traverse the AST, and I (so far) am happy with the approach I have taken, with some helpers along the way to make calling visit with extra parameters easier. Notably, I have found that using this visitor pattern made it extremely easy to write tests for my parser. The visitor for that is fairly simple. It takes my parsed AST as the thing to visit, and the expected AST, which it compares against and walks the two trees in parallel, checking that each node matches where it should.

There are many proposals for pattern matching in C++, see the pattern matching section in experimental features on cppreference. ↩
A description of LoxMocha’s syntax will come in the future when the language has matured. ↩
The AST types shown are simplified to focus on the concept rather than ideal code structure. ↩

Representing LoxMocha in code: Fun with C++ Variants and Visitors

Footnotes

Comments