Cranelift Progress in 2022
Continuing the tradition of a year-end progress report from last year, we are excited to report that this year has also been quite a productive one for the Cranelift project! Cranelift is our optimizing compiler backend that provides the foundation for Wasmtime, a production-ready WebAssembly virtual machine, the rustc_codegen_cranelift backend for the Rust compiler, and a number of other applications.
In 2022, we experienced a year of significant growth and maturation. Many of the projects that we set in motion in 2021 and planned to finish in 2022 – most significantly, a new register allocator, and ISLE, a pattern-matching DSL that has come to serve as the core of our backends – have come to fruition, entering production and improving compiler output quality, speed of compilation, and correctness. We took on significant new projects, building a mid-end optimization tier based on a novel twist on “equivalence graphs”. We added support for our fourth CPU architecture, RISC-V 64 (alongside existing x86-64, aarch64, and s390x), and completed SIMD support on the latter three, now enabled by default. We continued working to improve correctness, building out our IR-level interpreter and continuous differential fuzzing against it, and collaborating with an academic effort to formally verify our instruction lowerings. We merged an incremental-compilation framework, allowing individual function compilations to be cached and reused when rebuilding a module. We have re-examined core primitives, like Wasm heap accesses with Spectre mitigations, to make them more efficient and secure. We did the superficially mundane, but very important, work of cleaning up and clarifying semantics (removing bool types, endianness-correctness in vector bitcasts, and more). And we merged a long tail of compiler speedups and improvements too numerous to list here. Over the past year, the compiler has become significantly more complete, more robust, and more efficient.
We are also proud of our community: we have onboarded a number of new full-time engineers and we have continued to see volunteer contributions from across the community as well. In the last year we had 37 unique committers1. If the body of work we describe here is impressive, it is only because of the significant number of brilliant, hard-working contributors we are grateful to have. Perhaps in 2023 you can add yourself to this list – we are always eager to help new folks learn the joys of compiler hacking!
regalloc2 in Production
The first long-running project (beginning in mid-2021) that we completed in 2022 was the integration and switchover to regalloc2, a new register allocator. The register allocator is the part of the compiler that decides where to place program values: because there are only so many registers in the CPU, and programs may have many more variables than that in use at various times, this is a difficult optimization problem. There are quite a few details about regalloc2’s design in this blog post, but in brief: it incorporates several techniques that our old register allocator did not, such as “live-range splitting” (the ability to put the same value in different places at different points in the program), and in general was designed to run faster and produce better code. This largely panned out: compile time reduced by 10-20%, and runtime performance by up to 7%, after switching over.
The project was long-running mostly because of compatibility and migration concerns. The regalloc2 API differed in some ways from the API and abstractions of regalloc.rs. We built some compatibility features into regalloc2, but still had a large atomic switchover PR as a result of some unavoidable cross-cutting changes. Following that, we have done a series of improvements and cleanups (1 2 3 4 5 6 7 8), gradually moving away from the compatibility-mode features to “native” regalloc2 concepts, like register constraints, until we finally achieved “pure SSA input” to the allocator and enabled SSA-checking mode. This is a milestone that will let us optimize the allocator further in the coming year.
We learned a lot about design-for-correctness from the development of regalloc2 and the transition to it, as detailed elsewhere; in summary, fuzzing is incredibly effective, and let us move to the new allocator with no serious issues despite the high complexity of the work.
ISLE DSL for Backends (and Mid-end!)
The next long-running project completed this year was the transition of all compiler backends to use ISLE, the instruction-selector / pattern-matching DSL that we introduced last year and have been refining since. ISLE is beneficial because it allows us to add new instruction lowerings in a simple, declarative way, and the metacompiler (which translates ISLE to Rust) generates an efficient body of code that performs pattern-matching based on the composition of all rules. We designed ISLE to allow for a gradual migration: initially, when no rules matched, the compiler backend fell back to its existing handwritten code. Over time, we translated this handwritten code to the DSL. This was a painstaking process that took hundreds of PRs (see the meta-issue above!) from a half-dozen or so people, at various points.
Was all of the effort worth it? We think so: for one, we have thought a lot about correctness, and writing our backends in a declarative way lets us reason about high-level properties like “overlap” of patterns and “unreachable” rules. These are properties that would be much harder or impossible to see in handwritten code, or check for automatically. Second, generating the backend code from the declarative specification lets us consider systematic improvements to the code generation: this is an ongoing line of work that seems likely to optimize the pattern-matching to do measurably less work to evaluate the same patterns. Decoupling such optimizations from the delicate architecture-specific “knowledge base” of rules lets us work on both the content of the rules, and the efficiency of evaluating them, more effectively. Third, reasoning declaratively has let us build up a good library of type-safe abstractions that would have been much harder to design in verbose handwritten code: for example, ProducesFlags encapsulates an instruction that affects flags, and we have a series of combinators to safely work with flags producers and consumers. We have added convenience and expressivity features along the way, like if-let
pattern clauses and implicit type conversions. Finally, there is the promise of verification: with a declarative specification of translations from our intermediate representation, or IR (which we call CLIF, the Cranelift Intermediate Format) to machine code, we can machine-check these translations. This is not just theoretical: we are actively collaborating with academics who are doing just this!
The not-so-well-hidden secret of ISLE is that it is actually more general than an instruction-lowering language: it is a general way to build pattern matchers in Rust. Once we had the language, we were able to use it for other purposes too, like the “mid-end optimizer”, which translates CLIF to other CLIF. More on that below!
Mid-end Optimizer: “Acyclic E-graphs” and ISLE Rewrites
In the middle of 2022, we began to see the need for a unified optimization framework in the mid-end of the compiler. The “mid-end” is the part between the frontend and backend (of course!): it processes IR (unoptimized, from the frontend) and produces more IR (optimized, ready for the backend to lower to instructions).
Cranelift’s mid-end previously contained a few standard optimizations like GVN (global value numbering), which removes redundant computations of the same value; constant propagation/folding, which simplifies operators applied to constant values; and loop-invariant code motion, which moves computations that are the same on every iteration of a loop out of that loop.
However, when adding alias analysis, it became apparent that the pass-ordering problem would become more important. The pass-ordering problem involves the question of which order we apply certain code transforms. In particular, if we apply alias analysis before other optimizations, it may not recognize that some memory operations access the same address; but if we apply it after, then other optimizations may not be able to further simplify following the loads that alias analysis merged or removed. Applying all passes multiple times across the entire function body is wasteful. Ideally, we would be able to “multiplex” or “dispatch” on a much finer granularity.
Separately from that need, there was also a growing sense that it would be useful to perform “CLIF to CLIF” rewrites: certain simplifications, like algebraic identities, are not specific to compiling for a certain CPU, but rather they apply on all architectures. Applying rules like x * 2 = x << 1
(“strength reduction”, useful here when shifts are faster than multiplies) is ideally done in the mid-end. We had just developed a pattern-matching DSL, ISLE, for our backends and were starting to wonder if we could use it for these sorts of rules as well. (We had previously explored a rewrite system like this in Peepmatic, experience of which then fed into the design of ISLE, so we had been wanting to do this for a while!)
So, after some research, we determined that e-graphs, short for “equivalence graphs”, might serve as a good substrate for this sort of optimization. An e-graph is a program representation that allows for applying rewrite rules, and allows multiple representations of a value to co-exist; after applying all possible rules to all forms of a value, it uses a cost-heuristic to select the best form of each value. The most appealing aspects of the e-graph were that it provided a unified framework to apply code rewrites, and it had strongly-canonicalizing properties that could potentially subsume other optimizations as well.
It turns out that, in the course of making this work, we had to invent some new ways of using e-graphs: we came up with a simple way to hybridize the control-flow graph used by CLIF holding a “side-effect skeleton” with the e-graph for “pure” operators; we designed an algorithm called “scoped elaboration” that turns this back into a full CFG with scheduled instructions, subsuming GVN and LICM; and most importantly, we realized that a combination of acyclicity (an “acyclic egraph”) and eager (rather than batched) application of rules allowed for an efficient single-pass optimization algorithm, without any “fixpoints” (applying rules until no more can apply). The body of knowledge that came from all this is still fairly new and we plan to document and write it up further; but, it seems to work!
The path to getting this merged was a bit of a roller-coaster: the initial implementation (1, 2) translated CLIF to a separate e-graph data structure and back, and showed promising initial speedups on the development branch but lost some of these gains when rebased to latest main
; so it remained off by default. Then we realized we could build the e-graph within CLIF data structures by adding one new kind of SSA Value
node, the “union”; after implementing this and a few optimizations, we see ~16% runtime speedups on SpiderMonkey.wasm, a representative real-world workload, with compile times about the same as the set of legacy optimization passes. The e-graph framework is still off by default, as we just merged it and are letting it fuzz for a while; but if all remains quiet, we’ll switch it on by default in January and then remove the legacy optimization passes that it replaces.
Instruction-Set Support
The purpose of a compiler is to generate code to run on a particular target (e.g., CPU architecture). Going into 2022, Cranelift supported three CPU architectures: x86-64, aarch64 (aka ARM64), and s390x, covering most personal and server devices and mainframes too. In 2022, however, we gained support for one more: RISC-V (specifically RV64GC
, the most common 64-bit variant), a new open-standards architecture designed at UC Berkeley and becoming more common in many applications. This was a large amount of work, about 21k lines of code, contributed by an outside developer. We are extremely grateful for such contributions, and it is a strength of our open-source model that we can pool resources in this way.
We continue to improve the existing backends constantly, both in small ways – PRs to improve instruction lowerings in certain cases – and large, such as finishing SIMD support for s390x. With that, three of our four architectures (x86-64 and aarch64 as well) fully support SIMD, and thus allow Wasmtime to turn Wasm-SIMD on by default.
We also keep our eyes open for future CPU features that we might want to support, as well, and built abstractions where necessary to support them. One example is the Scalable Vector Extensions (SVE) set of instructions from Arm: these operate on vectors whose width is “dynamic” (depends on the specific processor model, for example). We thought for a while about how to support this, and settled on a design in an RFC that added “dynamically-sized types” to CLIF. These are vector types whose widths are given by some expression. We don’t yet have support for SVE based on this, but the foundation is now there.
Correctness Efforts
In the Cranelift project we put a special emphasis on correctness through a diverse array of tools and techniques. In 2022 we put significant time into two approaches: formal verification, and new kinds of fuzzing.
We designed ISLE in 2021 with a view toward formal verification: by providing a declarative list of all lowerings from IR to machine code, we make it easier to check these lowerings against known semantics of the input (IR) and output (ISA). We began a collaboration in 2022 with a group of academic researchers who are actively working on tooling that lowers ISLE rules to a form that can be verified (or can produce a counterexample if not) via an SMT solver. As part of this effort, we have planned how we will upstream their work and maintain the necessary semantics annotations in the ISLE source, and we are exploring ways to incorporate verification into our everyday compiler-development workflow.
Alongside that effort, we have had a significant flow of contributions to the Cranelift interpreter, which has a separate and machine-independent definition of the execution semantics of our IR. This is useful for several reasons but primarily as a comparison point against Cranelift’s backends that compile to machine code. By defining a differential fuzz target that compares the interpreter’s output to compiled code’s output, we have been able to find bugs (in both the interpreter and the machine backends!). This work is also complementary with the above verification work: having a simple executable semantics of our IR written in Rust is invaluable as one source of truth for the formal spec.
Improved Security
We have adopted a number of techniques in Cranelift to improve security. Post-2018, compilers and runtimes are largely concerned with speculative side-channel security due to the Spectre family of security vulnerabilities. Cranelift has had basic Spectre mitigations built-in for bounded heap accesses (such as for WebAssembly) for several years now, but this past year we took another look and added Spectre mitigations for other bounds-check cases (e.g., branch tables), ensured we are using the processor vendor-recommended barrier instructions, and ensured that we were accurately modeling our guard regions. These are all deep-dives into very subtle details, but the broad point is that we continue to actively think about our mitigations and ensure they remain effective.
We also work to adopt other defense-in-depth technologies when available: we discussed and merged an RFC on control-flow integrity and subsequently merged support for using pointer authentication and branch-target identification (BTI) on AArch64 (1, 2, 3) and turned on these features by default on macOS, where they are widely deployed. Similar work has been discussed for the analogous Intel features and will likely be done in the future.
Compiler Speed
The speed with which Cranelift generates code is an important metric that we track and work to improve: the project aims to be a viable option for JIT-compilation applications and so we must have fast and predictable compilation as much as possible.
In 2022, we merged a project that has a huge impact on compile times in the right scenarios: incremental compilation. The basic idea is to cache the result of compiling individual functions, keyed on a hash of the IR. This way, when the compiler input only changes slightly – which is a common occurrence when developing or debugging a program – most of the compilation can reuse cached results. The actual design is much more subtle and interesting: we split the IR into two parts, a “stencil” and “parameters”, such that compilation only depends on the stencil (and this is enforced at the type level in the compiler). The cache records the stencil-to-machine-code compilation. The parameters can be applied to the machine code as “fixups”, and if they change, they do not spoil the cache. We put things like function-reference relocations and debug source locations in the parameters, because these frequently change in a global but superficial way (i.e., a mass renumbering) when modifying a compiler input. We devised a way to fuzz this framework for correctness by mutating a function and comparing incremental to from-scratch compilation, and so far have not found any miscompilation bugs.
We also worked specifically on Cranelift compile time as a program optimization problem – reducing data structure sizes, eliminating redundant work, avoiding or reusing memory allocations, and the like – in a large number of PRs (a sampling: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18). Many of these changes resulted in compile-time reductions of a few percent or less; but these little efficiencies add up over time, and we will continue to look for them when we can.
While not strictly Cranelift-related, it’s worth noting here that we also began a new subproject, a baseline compiler Winch (RFC, initial skeleton) or “WebAssembly Intentionally-Non-optimizing Compiler and Host”, in 2022. Winch is meant to live alongside Cranelift as another compilation tier in Wasmtime. It reuses Cranelift’s “assembler” layer to generate machine code, but does a straightforward single-pass translation from Wasm input to machine instructions. The baseline compiler sprang out of compile-time discussions in which we realized that fundamentally, Cranelift’s optimizing design (with a heavyweight register allocator and a “real” SSA IR) is too slow for some applications; so, we’re building an alternative. The two compilers will be complementary, and perhaps someday we could dynamically transition from one to another (“tier up”) as well, though we haven’t planned this yet.
Long Tail of Small but Consequential Projects
Finally, we have worked to pay down technical debt, simplify Cranelift’s design and implementation and remove or fix abstractions that no longer make sense. The project has seen a lot of new code over the past few years, and we have learned a lot; as a result, we have no shortage of “cleanup” work to do, similar to most software projects as they mature.
For example, we cleaned up the semantics of CLIF, our IR, considerably: we removed bool types (they were odd and restricted in many ways; just use integers at the IR level!), clarified endian-correctness in bitcasts (1 2) (s390x, as a big-endian architecture, is excellent at highlighting these issues and ensuring we get it right!), removed a number of non-orthogonal or obsolete opcodes/instructions that no longer made sense (1 2 3), and are just now completing work to remove “flags” values (which like bools have special restrictions, unclear bit-level representation, and don’t apply to all architectures). This sort of work is not glamorous, but ensuring that we have a well-defined IR with clear semantics pays off significantly in many other places in the project.
We also performed “deep dives” on a number of topics: for example, on compiler performance as highlighted above, but also on fuzzing effectiveness, on benchmarking infrastructure, and more. In general, a theme in 2022 was to see a series of incremental PRs that each improve something a little bit. These add up, and this sort of work is what makes a mature software project mature. We are happy to see this sort of progress, and happy that we are at the point that we can “fine-tune” many things; we hope to see more such work in the future, as many opportunities remain!
Future Plans
Finally, we have many plans for 2023! We haven’t yet put together a roadmap for next year, but that is planned. (See e.g. the 2022 roadmap for reference, most of which we actually managed to achieve this year, surprisingly!) We believe there is more work to be done still on compiler performance and on quality of generated code. Also, there is more work to do to keep up with evolving use-cases: Wasm features like tail-calls and exception handling (both of which need compiler support), for example, and extending Cranelift as needed in general to open new use-cases. Finally, there is always more potential to innovate in security and testing methodologies. We hope that the project will continue its healthy progress in all of these areas and more in 2023. See you next year!
Thanks to Nick Fitzgerald for reading and providing feedback on this post.
-
Measured by taking author lines from
git log --since="Jan 1 2022 00:00:00 UTC" cranelift/
on commit0456c1d2
, and deduplicating based on the name before the email address on each commit. ↩