Note: this post is part of an ongoing series on extensible compilation. See Part 1: “A Coming Revolution in Metaprogramming”.
We usually call a program a compiler when it takes code of one form and translates it into code of a significantly different form, often times a different language (and we usually call more minor transformations “macros”). These two components are called the “source” and “target,” the kind of thing you compile from and to. An important question when building a new programming language is what language you will compile to, or what target you choose. Traditionally, a compiler would target platform-specific machine code/assembly (e.g. x86 or ARM). If you compile C with GCC or if you compile OCaml, then chances are your compiler is using one of these targets.
However, there’s no a priori reason why C has to compile to x86 assembly. A C compiler just needs to translate C into something which, when run, has the semantics that the programmer expects from his source code (i.e. the interface, or language, is separated from the implementation, or compiler). At the end of the day, whatever we write has to get translated down into instructions that our processor can understand. CPUs are basically assembly interpreters, and that’s the most fundamental unit of execution that we can target. That said, it’s cumbersome to write a compiler that can take your language and turn it into x86, ARM, MIPS, and a billion other assembly languages. This is part of the inspiration for LLVM, or Low Level Virtual Machine, which is a kind of “abstract assembly” that looks vaguely like machine code but is platform-independent 1.
Hence, in the modern day, many languages both old and new have started to acquire compilers that target LLVM. Clang, a C/C++ compiler, is Apple’s counterpart to GCC that targets LLVM. Haskell’s GHC compiler uses LLVM. Rust compiles to LLVM 2. The list goes on. LLVM owes such widespread adoption to two virtues: first, because it is platform-independent, a compiler writer can target LLVM and then have his language work across platforms (e.g. on Linux, Mac, and Windows machines) as well as with other languages that use LLVM. Second, LLVM is a simpler language to target than traditional assembly languages—it has nice features like infinite registers and recently support for stack frames.
Of course, not every language must be compiled to x86/LLVM, or be compiled at all. Most famously, Java has its own cross-platform bytecode which is the target of the Java compiler as well as a Java bytecode interpreter called the Java Virtual Machine (JVM). The most common JVM, HotSpot, is itself written in C++, which in turn gets compiled to assembly (it’s turtles all the way down). Many other popular scripting languages like Python, Javascript, Ruby, etc. are all executed (or interpreted) by a compiled program.
In all of these cases, choosing a target language is influenced by a number of factors. A big one is interoperability—to my knowledge, the JVM was developed to get Java to run across multiple platforms, and this was a driving motivation for LLVM as well. Dynamic languages like Python choose to get interpreted as it makes it easier to run code shortly after writing it and get feedback. Rust compiles into LLVM because anything higher-level wouldn’t provide sufficient control over the processor and memory.
I bring up all these languages so as to understand what makes a good target language for compilers. In my mind, a lot of languages that compile to assembly do so because it has the fewest opinions on how your program should run. If you compile to the JVM, you already have to shoulder the runtime overhead of a garbage collector. If you compile to Python, well, your program will probably run pretty slowly and you lose any static typing guarantees. This is why LLVM is a great replacement for platform-specific assemblies—its abstraction doesn’t impose any significant overhead or opinions on the programmer while providing a number of benefits. For these same reasons, I believe that future languages that want LLVM-level control should consider targeting Rust instead of LLVM.
When I say that Rust should replace LLVM, the future I envision is thus: languages would compile down into Rust code, so implementing a new language would be writing a program to parse source code and then generate Rust code. Rust in this kind of ecosystem offers five primary advantages over LLVM:
This approach is not without drawbacks. I don’t believe these necessarily outweigh the benefits, but they must be considered or addressed when building this next generation of languages.
For my dream to become reality, we need a lot of work to develop Rust as a compile target, both on the Rust compiler and on compilers targeting Rust. Several members of the Rust community are working hard on Rust’s metaprogramming features, but more work needs to be done on languages that compile to Rust. To my knowledge, Lia is the first major language that actually compiles directly to Rust. Work on such languages and tooling can slowly push us towards a world where languages are no longer developed in isolation but instead all belong to a cohesive ecosystem. If you have any ideas on how to do this or if you vehemently disagree, send your diatribes to wcrichto@stanford.edu or post in the comments.
Disclaimer: “platform-independent” is an overgeneralization, as several people have not hesitated to point out to my. To quote my brother, “I will say you do seem to think that LLVM gives you auto compat with all architectures/OSes, but oh man do I have stories for you.” ↩
Note that when someone says a phrase like “C compiles to x86,” that’s an incorrect statement since C doesn’t intrinsically compile to anything. C is just the language specified by the C standard. Specific compiles like GCC and Clang compile C into something that satisfies that standard. However, for a language like Rust, it’s a little bit weirder since Rust doesn’t have a formal standard, but is instead informally specified by the one major Rust compiler, rustc. Hence, I just say “Rust compiles to LLVM.” ↩