A Function Inliner for Wasmtime and Cranelift

Function inlining is one of the most important compiler optimizations, not because of its direct effects, but because of the follow-up optimizations it unlocks. It may reveal, for example, that an otherwise-unknown function parameter value is bound to a constant argument, which makes a conditional branch unconditional, which in turn exposes that the function will always return the same value. Inlining is the catalyst of modern compiler optimization.

Note: This is cross-posted from my personal blog.

Wasmtime is a WebAssembly runtime that focuses on safety and fast Wasm execution. But despite that focus on speed, Wasmtime has historically chosen not to perform inlining in its optimizing compiler backend, Cranelift. There were two reasons for this surprising decision: first, Cranelift is a per-function compiler designed such that Wasmtime can compile all of a Wasm module’s functions in parallel. Inlining is inter-procedural and requires synchronization between function compilations; that synchronization reduces parallelism. Second, Wasm modules are generally produced by an optimizing toolchain, like LLVM, that already did all the beneficial inlining. Any calls remaining in the module will not benefit from inlining — perhaps they are on slow paths marked [[unlikely]] or the callee is annotated with #[inline(never)]. But WebAssembly’s component model changes this calculus.

With the component model, developers can compose multiple Wasm modules — each produced by different toolchains — into a single program. Those toolchains only had a local view of the call graph, limited to their own module, and they couldn’t see cross-module or fused adapter function definitions. None of them, therefore, had an opportunity to inline calls to such functions. Only the Wasm runtime’s compiler, which has the final, complete call graph and function definitions in hand, has that opportunity.

Therefore we implemented function inlining in Wasmtime and Cranelift. Its initial implementation landed in Wasmtime version 36, however, it remains off-by-default and is still baking. You can test it out via the -C inlining=y command-line flag or the wasmtime::Config::compiler_inlining method. The rest of this article describes function inlining in more detail, digs into the guts of our implementation and rationale for its design choices, and finally looks at some early performance results.

Function Inlining

Function inlining is a compiler optimization where a call to a function f is replaced by a copy of f’s body. This removes function call overheads (spilling caller-save registers, setting up the call frame, etc…) which can be beneficial on its own. But inlining’s main benefits are indirect: it enables subsequent optimization of f’s body in the context of the call site. That context is important — a parameter’s previously unknown value might be bound to a constant argument and exposing that to the optimizer might cascade into a large code clean up.

Consider the following example, where function g calls function f:

fn f(x: u32) -> bool {
    return x < u32::MAX / 2;
}

fn g() -> u32 {
    let a = 42;
    if f(a) {
        return a;
    } else {
        return 0;
    }
}

After inlining the call to f, function g looks something like this:

fn g() -> u32 {
    let a = 42;

    let x = a;
    let f_result = x < u32::MAX / 2;

    if f_result {
        return a;
    } else {
        return 0;
    }
}

Now the whole subexpression that defines f_result only depends on constant values, so the optimizer can replace that subexpression with its known value:

fn g() -> u32 {
    let a = 42;

    let f_result = true;
    if f_result {
        return a;
    } else {
        return 0;
    }
}

This reveals that the if-else conditional will, in fact, unconditionally transfer control to the consequent, and g can be simplified into the following:

fn g() -> u32 {
    let a = 42;
    return a;
}

In isolation, inlining f was a marginal transformation. When considered holistically, however, it unlocked a plethora of subsequent simplifications that ultimately led to g returning a constant value rather than computing anything at run-time.

Implementation

Cranelift’s unit of compilation is a single function, which Wasmtime leverages to compile each function in a Wasm module in parallel, speeding up compile times on multi-core systems. But inlining a function at a particular call site requires that function’s definition, which implies parallelism-hurting synchronization or some other compromise, like additional read-only copies of function bodies. So this was the first goal of our implementation: to preserve as much parallelism as possible.

Additionally, although Cranelift is primarily developed for Wasmtime by Wasmtime’s developers, it is independent from Wasmtime. It is a reusable library and is reused, for example, by the Rust project as an alternative backend for rustc. But a large part of inlining, in practice, are the heuristics for deciding when inlining a call is likely beneficial, and those heuristics can be domain specific. Wasmtime generally wants to leave most calls out-of-line, inlining only cross-module calls, while rustc wants something much more aggressive to boil away its Iterator combinators and the like. So our second implementation goal was to separate how we inline a function call from the decision of whether to inline that call.

These goals led us to a layered design where Cranelift has an optional inlining pass, but the Cranelift embedder (e.g. Wasmtime) must provide a callback to it. The inlining pass invokes the callback for each call site, the callback returns a command of either “leave the call as-is” or “here is a function body, replace the call with it”. Cranelift is responsible for the inlining transformation and the embedder is responsible for deciding whether to inline a function call and, if so, getting that function’s body (along with whatever synchronization that requires).

The mechanics of the inlining transformation — wiring arguments to parameters, renaming values, and copying instructions and basic blocks into the caller — are, well, mechanical. Cranelift makes extensive uses of arenas for various entities in its IR, and we begin by appending the callee’s arenas to the caller’s arenas, renaming entity references from the callee’s arena indices to their new indices in the caller’s arenas as we do so. Next we copy the callee’s block layout into the caller and replace the original call instruction with a jump to the caller’s inlined version of the callee’s entry block. Cranelift uses block parameters, rather than phi nodes, so the call arguments simply become jump arguments. Finally, we translate each instruction from the callee into the caller. This is done via a pre-order traversal to ensure that we process value definitions before value uses, simplifying instruction operand rewriting. The changes to Wasmtime’s compilation orchestration are more interesting.

The following pseudocode describes Wasmtime’s compilation orchestration before Cranelift gained an inlining pass and also when inlining is disabled:

// Compile each function in parallel.
let objects = parallel map for func in wasm.functions {
    compile(func)
};

// Combine the functions into one region of executable memory, resolving
// relocations by mapping function references to PC-relative offsets.
return link(objects)

The naive way to update that process to use Cranelift’s inlining pass might look something like this:

// Optionally perform some pre-inlining optimizations in parallel.
parallel for func in wasm.functions {
    pre_optimize(func);
}

// Do inlining sequentially.
for func in wasm.functions {
    func.inline(|f| if should_inline(f) {
        Some(wasm.functions[f])
    } else {
        None
    })
}

// And then proceed as before.
let objects = parallel map for func in wasm.functions {
    compile(func)
};
return link(objects)

Inlining is performed sequentially, rather than in parallel, which is a bummer. But if we tried to make that loop parallel by logically running each function’s inlining pass in its own thread, then a callee function we are inlining might or might not have had its transitive function calls inlined already depending on the whims of the scheduler. That leads to non-deterministic output, and our compilation must be deterministic, so it’s a non-starter.¹ But whether a function has already had transitive inlining done or not leads to another problem.

With this naive approach, we are either limited to one layer of inlining or else potentially duplicating inlining effort, repeatedly inlining e into f each time we inline f into g, h, and i. This is because f may come before or after g in our wasm.functions list. We would prefer it if f already contained e and was already optimized accordingly, so that every caller of f didn’t have to redo that same work when inlining calls to f.

This suggests we should topologically sort our functions based on their call graph, so that we inline in a bottom-up manner, from leaf functions (those that do not call any others) towards root functions (those that are not called by any others, typically main and other top-level exported functions). Given a topological sort, we know that whenever we are inlining f into g either (a) f has already had its own inlining done or (b) f and g participate in a cycle. Case (a) is ideal: we aren’t repeating any work because it’s already been done. Case (b), when we find cycles, means that f and g are mutually recursive. We cannot fully inline recursive calls in general (just as you cannot fully unroll a loop in general) so we will simply avoid inlining these calls.² So topological sort avoids repeating work, but our inlining phase is still sequential.

At the heart of our proposed topological sort is a call graph traversal that visits callees before callers. To parallelize inlining, you could imagine that, while traversing the call graph, we track how many still-uninlined callees each caller function has. Then we batch all functions whose associated counts are currently zero (i.e. they aren’t waiting on anything else to be inlined first) into a layer and process them in parallel. Next, we decrement each of their callers’ counts and collect the next layer of ready-to-go functions, continuing until all functions have been processed.

let call_graph = CallGraph::new(wasm.functions);

let counts = { f: call_graph.num_callees_of(f) for f in wasm.functions };

let layer = [ f for f in wasm.functions if counts[f] == 0 ];
while layer is not empty {
    parallel for func in layer {
        func.inline(...);
    }

    let next_layer = [];
    for func in layer {
        for caller in call_graph.callers_of(func) {
            counts[caller] -= 1;
            if counts[caller] == 0 {
                next_layer.push(caller)
            }
        }
    }
    layer = next_layer;
}

This algorithm will leverage available parallelism, and it avoids repeating work via the same dependency-based scheduling that topological sorting did, but it has a flaw. It will not terminate when it encounters recursion cycles in the call graph. If function f calls function g which also calls f, for example, then it will not schedule either of them into a layer because they are both waiting for the other to be processed first. One way we can avoid this problem is by avoiding cycles.

If you partition a graph’s nodes into disjoint sets, where each set contains every node reachable from every other node in that set, you get that graph’s strongly-connected components (SCCs). If a node does not participate in a cycle, then it will be in its own singleton SCC. The members of a cycle, on the other hand, will all be grouped into the same SCC, since those nodes are all reachable from each other.

In the following example, the dotted boxes designate the graph’s SCCs:

Ignoring edges between nodes within the same SCC, and only considering edges across SCCs, gives us the graph’s condensation. The condensation is always acyclic, because the original graph’s cycles are “hidden” within the SCCs.

Here is the condensation of the previous example:

We can adapt our parallel-inlining algorithm to operate on strongly-connected components, and now it will correctly terminate because we’ve removed all cycles. First, we find the call graph’s SCCs and create the reverse (or transpose) condensation, where an edge a→b is flipped to b→a. We do this because we will query this graph to find the callers of a given function f, not the functions that f calls. I am not aware of an existing name for the reverse condensation, so, at Chris Fallin’s brilliant suggestion, I have decided to call it an evaporation. From there, the algorithm largely remains as it was before, although we keep track of counts and layers by SCC rather than by function.

let call_graph = CallGraph::new(wasm.functions);
let components = StronglyConnectedComponents::new(call_graph);
let evaoporation = Evaporation::new(components);

let counts = { c: evaporation.num_callees_of(c) for c in components };

let layer = [ c for c in components if counts[c] == 0 ];
while layer is not empty {
    parallel for func in scc in layer {
        func.inline(...);
    }

    let next_layer = [];
    for scc in layer {
        for caller_scc in evaporation.callers_of(scc) {
            counts[caller_scc] -= 1;
            if counts[caller_scc] == 0 {
                next_layer.push(caller_scc);
            }
        }
    }
    layer = next_layer;
}

This is the algorithm we use in Wasmtime, modulo minor tweaks here and there to engineer some data structures and combine some loops. After parallel inlining, the rest of the compiler pipeline continues in parallel for each function, yielding unlinked machine code. Finally, we link all that together and resolve relocations, same as we did previously.

Heuristics are the only implementation detail left to discuss, but there isn’t much to say that hasn’t already been said. Wasmtime prefers not to inline calls within the same Wasm module, while cross-module calls are a strong hint that we should consider inlining. Beyond that, our heuristics are extremely naive at the moment, and only consider the code sizes of the caller and callee functions. There is a lot of room for improvement here, and we intend to make those improvements on-demand as people start playing with the inliner. For example, there are many things we don’t consider in our heuristics today, but possibly should:

Hints from WebAssembly’s compilation-hints proposal
The number of edges to a callee function in the call graph
Whether any of a call’s arguments are constants
Whether the call is inside a loop or a block marked as “cold”
Etc…

Some Initial Results

The speed up you get (or don’t get) from enabling inlining is going to vary from program to program. Here are a couple synthetic benchmarks.

First, let’s investigate the simplest case possible, a cross-module call of an empty function in a loop:

(component
  ;; Define one module, exporting an empty function `f`.
  (core module $M
    (func (export "f")
      nop
    )
  )

  ;; Define another module, importing `f`, and exporting a function
  ;; that calls `f` in a loop.
  (core module $N
    (import "m" "f" (func $f))
    (func (export "g") (param $counter i32)
      (loop $loop
        ;; When counter is zero, return.
        (if (i32.eq (local.get $counter) (i32.const 0))
          (then (return)))
        ;; Do our cross-module call.
        (call $f)
        ;; Decrement the counter and continue to the next iteration
        ;; of the loop.
        (local.set $counter (i32.sub (local.get $counter)
                                     (i32.const 1)))
        (br $loop))
    )
  )

  ;; Instantiate and link our modules.
  (core instance $m (instantiate $M))
  (core instance $n (instantiate $N (with "m" (instance $m))))

  ;; Lift and export the looping function.
  (func (export "g") (param "n" u32)
    (canon lift (core func $n "g"))
  )
)

We can inspect the machine code that this compiles down to via the wasmtime compile and wasmtime objdump commands. Let’s focus only on the looping function. Without inlining, we see a loop around a call, as we would expect:

00000020 wasm[1]::function[1]:
        ;; Function prologue.
        20: pushq   %rbp
        21: movq    %rsp, %rbp

        ;; Check for stack overflow.
        24: movq    8(%rdi), %r10
        28: movq    0x10(%r10), %r10
        2c: addq    $0x30, %r10
        30: cmpq    %rsp, %r10
        33: ja      0x89

        ;; Allocate this function's stack frame, save callee-save
        ;; registers, and shuffle some registers.
        39: subq    $0x20, %rsp
        3d: movq    %rbx, (%rsp)
        41: movq    %r14, 8(%rsp)
        46: movq    %r15, 0x10(%rsp)
        4b: movq    0x40(%rdi), %rbx
        4f: movq    %rdi, %r15
        52: movq    %rdx, %r14

        ;; Begin loop.
        ;;
        ;; Test our counter for zero and break out if so.
        55: testl   %r14d, %r14d
        58: je      0x72
        ;; Do our cross-module call.
        5e: movq    %r15, %rsi
        61: movq    %rbx, %rdi
        64: callq   0
        ;; Decrement our counter.
        69: subl    $1, %r14d
        ;; Continue to the next iteration of the loop.
        6d: jmp     0x55

        ;; Function epilogue: restore callee-save registers and
        ;; deallocate this functions's stack frame.
        72: movq    (%rsp), %rbx
        76: movq    8(%rsp), %r14
        7b: movq    0x10(%rsp), %r15
        80: addq    $0x20, %rsp
        84: movq    %rbp, %rsp
        87: popq    %rbp
        88: retq

        ;; Out-of-line traps.
        89: ud2
            ╰─╼ trap: StackOverflow

When we enable inlining, then M::f gets inlined into N::g. Despite N::g becoming a leaf function, we will still push %rbp and all that in the prologue and pop it in the epilogue, because Wasmtime always enables frame pointers. But because it no longer needs to shuffle values into ABI argument registers or allocate any stack space, it doesn’t need to do any explicit stack checks, and nearly all the rest of the code also goes away. All that is left is a loop decrementing a counter to zero:³

00000020 wasm[1]::function[1]:
        ;; Function prologue.
        20: pushq   %rbp
        21: movq    %rsp, %rbp

        ;; Loop.
        24: testl   %edx, %edx
        26: je      0x34
        2c: subl    $1, %edx
        2f: jmp     0x24

        ;; Function epilogue.
        34: movq    %rbp, %rsp
        37: popq    %rbp
        38: retq

With this simplest of examples, we can just count the difference in number of instructions in each loop body:

12 without inlining (7 in N::g and 5 in M::f which are 2 to push the frame pointer, 2 to pop it, and 1 to return)
4 with inlining

But we might as well verify that the inlined version really is faster via some quick-and-dirty benchmarking with hyperfine. This won’t measure only Wasm execution time, it also measures spawning a whole Wasmtime process, loading code from disk, etc…, but it will work for our purposes if we crank up the number of iterations:

$ hyperfine \
    "wasmtime run --allow-precompiled -Cinlining=n --invoke 'g(100000000)' no-inline.cwasm" \
    "wasmtime run --allow-precompiled -Cinlining=y --invoke 'g(100000000)' yes-inline.cwasm"

Benchmark 1: wasmtime run --allow-precompiled -Cinlining=n --invoke 'g(100000000)' no-inline.cwasm
  Time (mean ± σ):     138.2 ms ±   9.6 ms    [User: 132.7 ms, System: 6.7 ms]
  Range (min … max):   128.7 ms … 167.7 ms    19 runs

Benchmark 2: wasmtime run --allow-precompiled -Cinlining=y --invoke 'g(100000000)' yes-inline.cwasm
  Time (mean ± σ):      37.5 ms ±   1.1 ms    [User: 33.0 ms, System: 5.8 ms]
  Range (min … max):    35.7 ms …  40.8 ms    77 runs

Summary
  'wasmtime run --allow-precompiled -Cinlining=y --invoke 'g(100000000)' yes-inline.cwasm' ran
    3.69 ± 0.28 times faster than 'wasmtime run --allow-precompiled -Cinlining=n --invoke 'g(100000000)' no-inline.cwasm'

Okay so if we measure Wasm doing almost nothing but empty function calls and then we measure again after removing function call overhead, we get a big speed up — it would be disappointing if we didn’t! But maybe we can benchmark something a tiny bit more realistic.

A program that we commonly reach for when benchmarking is a small wrapper around the pulldown-cmark markdown library that parses the CommonMark specification (which is itself written in markdown) and renders that to HTML. This is Real World™ code operating on Real World™ inputs that matches Real World™ use cases people have for Wasm. That is, good benchmarking is incredibly difficult, but this program is nonetheless a pretty good candidate for inclusion in our corpus. There’s just one hiccup: in order for our inliner to activate normally, we need a program using components and making cross-module calls, and this program doesn’t do that. But we don’t have a good corpus of such benchmarks yet because this kind of component composition is still relatively new, so let’s keep using our pulldown-cmark program but measure our inliner’s effects via a more circuitous route.

Wasmtime has tunables to enable the inlining of intra-module calls⁴ and rustc and LLVM have tunables for disabling inlining⁵. Therefore we can roughly estimate the speed ups our inliner might unlock on a similar, but extensively componentized and cross-module calling, program by:

Disabling inlining when compiling the Rust source code to Wasm
Compiling the resulting Wasm binary to native code with Wasmtime twice: once with inlining disabled, and once with intra-module call inlining enabled
Comparing those two different compilations’ execution speeds

Running this experiment with Sightglass, our internal benchmarking infrastructure and tooling, yields the following results:

execution :: instructions-retired :: pulldown-cmark.wasm

  Δ = 7329995.35 ± 2.47 (confidence = 99%)

  with-inlining is 1.26x to 1.26x faster than without-inlining!

  [35729153 35729164.72 35729173] without-inlining
  [28399156 28399169.37 28399179] with-inlining

Conclusion

Wasmtime and Cranelift now have a function inliner! Test it out via the -C inlining=y command-line flag or via the wasmtime::Config::compiler_inlining method. Let us know if you run into any bugs or whether you see any speed-ups when running Wasm components containing multiple core modules.

Thanks to Chris Fallin and Graydon Hoare for reading early drafts of this piece and providing valuable feedback. Any errors that remain are my own.

Deterministic compilation gives a number of benefits: testing is easier, debugging is easier, builds can be byte-for-byte reproducible, it is well-behaved in the face of incremental compilation and fine-grained caching, etc… ↩
For what it is worth, this still allows collapsing chains of mutually-recursive calls (a calls b calls c calls a) into a single, self-recursive call (abc calls abc). Our actual implementation does not do this in practice, preferring additional parallelism instead, but it could in theory. ↩
Cranelift cannot currently remove loops without side effects, and generally doesn’t mess with control-flow at all in its mid-end. We’ve had various discussions about how we might best fit control-flow-y optimizations into Cranelift’s mid-end architecture over the years, but it also isn’t something that we’ve seen would be very beneficial for actual, Real World™ Wasm programs, given that (a) LLVM has already done much of this kind of thing when producing the Wasm, and (b) we do some branch-folding when lowering from our mid-level IR to our machine-specific IR. Maybe we will revisit this sometime in the future if it crops up more often after inlining. ↩
-C cranelift-wasmtime-inlining-intra-module=yes ↩
-Cllvm-args=--inline-threshold=0, -Cllvm-args=--inlinehint-threshold=0, and -Zinline-mir=no ↩