$$ % Typography and symbols \newcommand{\msf}[1]{\mathsf{#1}} \newcommand{\ctx}{\Gamma} \newcommand{\qamp}{&\quad} \newcommand{\qqamp}{&&\quad} \newcommand{\Coloneqq}{::=} \newcommand{\proves}{\vdash} \newcommand{\star}[1]{#1^{*}} \newcommand{\eps}{\varepsilon} \newcommand{\nul}{\varnothing} \newcommand{\brc}[1]{\{{#1}\}} \newcommand{\binopm}[2]{#1~\bar{\oplus}~#2} \newcommand{\mag}[1]{|{#1}|} \newcommand{\aequiv}{\equiv_\alpha} \newcommand{\semi}[2]{{#1};~{#2}} % Untyped lambda calculus \newcommand{\fun}[2]{\lambda ~ {#1} ~ . ~ {#2}} \newcommand{\app}[2]{#1 ~ #2} \newcommand{\fix}[3]{\msf{fix}~({#1} : {#2}) ~ . ~ #3 } \newcommand{\truet}{\msf{true}} \newcommand{\falset}{\msf{false}} \newcommand{\define}[2]{{#1} \triangleq {#2}} % Typed lambda calculus - expressions \newcommand{\funt}[3]{\lambda ~ \left(#1 : #2\right) ~ . ~ #3} \newcommand{\lett}[4]{\msf{let} ~ \hasType{#1}{#2} = #3 ~ \msf{in} ~ #4} \newcommand{\letrec}[4]{\msf{letrec} ~ \hasType{#1}{#2} = #3 ~ \msf{in} ~ #4}a \newcommand{\ift}[3]{\msf{if} ~ {#1} ~ \msf{then} ~ {#2} ~ \msf{else} ~ {#3}} \newcommand{\rec}[5]{\msf{rec}(#1; ~ #2.#3.#4)(#5)} \newcommand{\case}[5]{\msf{case} ~ {#1} ~ \{ L(#2) \to #3 \mid R(#4) \to #5 \}} \newcommand{\pair}[2]{\left({#1},~{#2}\right)} \newcommand{\proj}[2]{#1 . #2} \newcommand{\inj}[3]{\msf{inj} ~ #1 = #2 ~ \msf{as} ~ #3} \newcommand{\letv}[3]{\msf{let} ~ {#1} = {#2} ~ \msf{in} ~ {#3}} \newcommand{\fold}[2]{\msf{fold}~{#1}~\msf{as}~{#2}} \newcommand{\unfold}[1]{\msf{unfold}~{#1}} \newcommand{\poly}[2]{\Lambda~{#1}~.~ #2} \newcommand{\polyapp}[2]{{#1}~\left[{#2}\right]} \newcommand{\export}[3]{\msf{export}~ #1 ~\msf{without}~{#2}~\msf{as}~ #3} \newcommand{\import}[4]{\msf{import} ~ ({#1}, {#2}) = {#3} ~ \msf{in} ~ #4} % Typed lambda calculus - types \newcommand{\tnum}{\msf{num}} \newcommand{\tstr}{\msf{string}} \newcommand{\tint}{\msf{int}} \newcommand{\tbool}{\msf{bool}} \newcommand{\tfun}[2]{#1 \rightarrow #2} \newcommand{\tprod}[2]{#1 \times #2} \newcommand{\tsum}[2]{#1 + #2} \newcommand{\trec}[2]{\mu~{#1}~.~{#2}} \newcommand{\tvoid}{\msf{void}} \newcommand{\tunit}{\msf{unit}} \newcommand{\tpoly}[2]{\forall~{#1}~.~{#2}} \newcommand{\tmod}[2]{\exists ~ {#1} ~ . ~ #2} % WebAssembly \newcommand{\wconst}[1]{\msf{i32.const}~{#1}} \newcommand{\wbinop}[1]{\msf{i32}.{#1}} \newcommand{\wgetlocal}[1]{\msf{get\_local}~{#1}} \newcommand{\wsetlocal}[1]{\msf{set\_local}~{#1}} \newcommand{\wgetglobal}[1]{\msf{get\_global}~{#1}} \newcommand{\wsetglobal}[1]{\msf{set\_global}~{#1}} \newcommand{\wload}{\msf{i32.load}} \newcommand{\wstore}{\msf{i32.store}} \newcommand{\wsize}{\msf{memory.size}} \newcommand{\wgrow}{\msf{memory.grow}} \newcommand{\wunreachable}{\msf{unreachable}} \newcommand{\wblock}[1]{\msf{block}~{#1}} \newcommand{\wloop}[1]{\msf{loop}~{#1}} \newcommand{\wbr}[1]{\msf{br}~{#1}} \newcommand{\wbrif}[1]{\msf{br\_if}~{#1}} \newcommand{\wreturn}{\msf{return}} \newcommand{\wcall}[1]{\msf{call}~{#1}} \newcommand{\wlabel}[2]{\msf{label}~\{#1\}~{#2}} \newcommand{\wframe}[2]{\msf{frame}~({#1}, {#2})} \newcommand{\wtrapping}{\msf{trapping}} \newcommand{\wbreaking}[1]{\msf{breaking}~{#1}} \newcommand{\wreturning}[1]{\msf{returning}~{#1}} \newcommand{\wconfig}[5]{\{\msf{module}{:}~{#1};~\msf{mem}{:}~{#2};~\msf{locals}{:}~{#3};~\msf{stack}{:}~{#4};~\msf{instrs}{:}~{#5}\}} \newcommand{\wfunc}[4]{\{\msf{params}{:}~{#1};~\msf{locals}{:}~{#2};~\msf{return}~{#3};~\msf{body}{:}~{#4}\}} \newcommand{\wmodule}[1]{\{\msf{funcs}{:}~{#1}\}} \newcommand{\wcg}{\msf{globals}} \newcommand{\wcf}{\msf{funcs}} \newcommand{\wci}{\msf{instrs}} \newcommand{\wcs}{\msf{stack}} \newcommand{\wcl}{\msf{locals}} \newcommand{\wclab}{\msf{labels}} \newcommand{\wcm}{\msf{mem}} \newcommand{\wcmod}{\msf{module}} \newcommand{\wsteps}[2]{\steps{\brc{#1}}{\brc{#2}}} \newcommand{\with}{\underline{\msf{with}}} \newcommand{\wvalid}[2]{{#1} \vdash {#2}~\msf{valid}} \newcommand{\wif}[2]{\msf{if}~{#1}~{\msf{else}}~{#2}} \newcommand{\wfor}[4]{\msf{for}~(\msf{init}~{#1})~(\msf{cond}~{#2})~(\msf{post}~{#3})~{#4}} % assign4.3 custom \newcommand{\wtry}[2]{\msf{try}~{#1}~\msf{catch}~{#2}} \newcommand{\wraise}{\msf{raise}} \newcommand{\wraising}[1]{\msf{raising}~{#1}} \newcommand{\wconst}[1]{\msf{i32.const}~{#1}} \newcommand{\wbinop}[1]{\msf{i32}.{#1}} \newcommand{\wgetlocal}[1]{\msf{get\_local}~{#1}} \newcommand{\wsetlocal}[1]{\msf{set\_local}~{#1}} \newcommand{\wgetglobal}[1]{\msf{get\_global}~{#1}} \newcommand{\wsetglobal}[1]{\msf{set\_global}~{#1}} \newcommand{\wload}{\msf{i32.load}} \newcommand{\wstore}{\msf{i32.store}} \newcommand{\wsize}{\msf{memory.size}} \newcommand{\wgrow}{\msf{memory.grow}} \newcommand{\wunreachable}{\msf{unreachable}} \newcommand{\wblock}[1]{\msf{block}~{#1}} \newcommand{\wloop}[1]{\msf{loop}~{#1}} \newcommand{\wbr}[1]{\msf{br}~{#1}} \newcommand{\wbrif}[1]{\msf{br\_if}~{#1}} \newcommand{\wreturn}{\msf{return}} \newcommand{\wcall}[1]{\msf{call}~{#1}} \newcommand{\wlabel}[2]{\msf{label}~\{#1\}~{#2}} \newcommand{\wframe}[2]{\msf{frame}~({#1}, {#2})} \newcommand{\wtrapping}{\msf{trapping}} \newcommand{\wbreaking}[1]{\msf{breaking}~{#1}} \newcommand{\wreturning}[1]{\msf{returning}~{#1}} \newcommand{\wconfig}[5]{\{\msf{module}{:}~{#1};~\msf{mem}{:}~{#2};~\msf{locals}{:}~{#3};~\msf{stack}{:}~{#4};~\msf{instrs}{:}~{#5}\}} \newcommand{\wfunc}[4]{\{\msf{params}{:}~{#1};~\msf{locals}{:}~{#2};~\msf{return}~{#3};~\msf{body}{:}~{#4}\}} \newcommand{\wmodule}[1]{\{\msf{funcs}{:}~{#1}\}} \newcommand{\wcg}{\msf{globals}} \newcommand{\wcf}{\msf{funcs}} \newcommand{\wci}{\msf{instrs}} \newcommand{\wcs}{\msf{stack}} \newcommand{\wcl}{\msf{locals}} \newcommand{\wcm}{\msf{mem}} \newcommand{\wcmod}{\msf{module}} \newcommand{\wsteps}[2]{\steps{\brc{#1}}{\brc{#2}}} \newcommand{\with}{\underline{\msf{with}}} \newcommand{\wvalid}[2]{{#1} \vdash {#2}~\msf{valid}} % assign4.3 custom \newcommand{\wtry}[2]{\msf{try}~{#1}~\msf{catch}~{#2}} \newcommand{\wraise}{\msf{raise}} \newcommand{\wraising}[1]{\msf{raising}~{#1}} \newcommand{\wif}[2]{\msf{if}~{#1}~{\msf{else}}~{#2}} \newcommand{\wfor}[4]{\msf{for}~(\msf{init}~{#1})~(\msf{cond}~{#2})~(\msf{post}~{#3})~{#4}} \newcommand{\windirect}[1]{\msf{call\_indirect}~{#1}} % session types \newcommand{\ssend}[2]{\msf{send}~{#1};~{#2}} \newcommand{\srecv}[2]{\msf{recv}~{#1};~{#2}} \newcommand{\soffer}[4]{\msf{offer}~\{{#1}\colon({#2})\mid{#3}\colon({#4})\}} \newcommand{\schoose}[4]{\msf{choose}~\{{#1}\colon({#2})\mid{#3}\colon({#4})\}} \newcommand{\srec}[1]{\msf{label};~{#1}} \newcommand{\sgoto}[1]{\msf{goto}~{#1}} \newcommand{\dual}[1]{\overline{#1}} % Inference rules \newcommand{\inferrule}[3][]{\cfrac{#2}{#3}\;{#1}} \newcommand{\ir}[3]{\inferrule[\text{(#1)}]{#2}{#3}} \newcommand{\s}{\hspace{1em}} \newcommand{\nl}{\\[2em]} \newcommand{\evalto}{\boldsymbol{\overset{*}{\mapsto}}} \newcommand{\steps}[2]{#1 \boldsymbol{\mapsto} #2} \newcommand{\evals}[2]{#1 \evalto #2} \newcommand{\subst}[3]{[#1 \rightarrow #2] ~ #3} \newcommand{\dynJ}[2]{#1 \proves #2} \newcommand{\dynJC}[1]{\dynJ{\ctx}{#1}} \newcommand{\typeJ}[3]{#1 \proves \hasType{#2}{#3}} \newcommand{\typeJC}[2]{\typeJ{\ctx}{#1}{#2}} \newcommand{\hasType}[2]{#1 : #2} \newcommand{\val}[1]{#1~\msf{val}} \newcommand{\num}[1]{\msf{Int}(#1)} \newcommand{\err}[1]{#1~\msf{err}} \newcommand{\trans}[2]{#1 \leadsto #2} \newcommand{\size}[1]{\left|#1\right|} $$

&Notepad

Rust: The New LLVM

Will Crichton   —   July 23, 2016
New programming languages with a system-level compile target should choose Rust over LLVM. Targeting Rust can give new languages free package management, a type system, and memory safety while not imposing too many opinions on the language's runtime. With more work on languages, tooling, and Rust compiler development, we can create an ecosystem of beautifully interoperable programming languages.

Note: this post is part of an ongoing series on extensible compilation. See Part 1: “A Coming Revolution in Metaprogramming”.

Introduction

We usually call a program a compiler when it takes code of one form and translates it into code of a significantly different form, often times a different language (and we usually call more minor transformations “macros”). These two components are called the “source” and “target,” the kind of thing you compile from and to. An important question when building a new programming language is what language you will compile to, or what target you choose. Traditionally, a compiler would target platform-specific machine code/assembly (e.g. x86 or ARM). If you compile C with GCC or if you compile OCaml, then chances are your compiler is using one of these targets.

However, there’s no a priori reason why C has to compile to x86 assembly. A C compiler just needs to translate C into something which, when run, has the semantics that the programmer expects from his source code (i.e. the interface, or language, is separated from the implementation, or compiler). At the end of the day, whatever we write has to get translated down into instructions that our processor can understand. CPUs are basically assembly interpreters, and that’s the most fundamental unit of execution that we can target. That said, it’s cumbersome to write a compiler that can take your language and turn it into x86, ARM, MIPS, and a billion other assembly languages. This is part of the inspiration for LLVM, or Low Level Virtual Machine, which is a kind of “abstract assembly” that looks vaguely like machine code but is platform-independent 1.

Hence, in the modern day, many languages both old and new have started to acquire compilers that target LLVM. Clang, a C/C++ compiler, is Apple’s counterpart to GCC that targets LLVM. Haskell’s GHC compiler uses LLVM. Rust compiles to LLVM 2. The list goes on. LLVM owes such widespread adoption to two virtues: first, because it is platform-independent, a compiler writer can target LLVM and then have his language work across platforms (e.g. on Linux, Mac, and Windows machines) as well as with other languages that use LLVM. Second, LLVM is a simpler language to target than traditional assembly languages—it has nice features like infinite registers and recently support for stack frames.

Of course, not every language must be compiled to x86/LLVM, or be compiled at all. Most famously, Java has its own cross-platform bytecode which is the target of the Java compiler as well as a Java bytecode interpreter called the Java Virtual Machine (JVM). The most common JVM, HotSpot, is itself written in C++, which in turn gets compiled to assembly (it’s turtles all the way down). Many other popular scripting languages like Python, Javascript, Ruby, etc. are all executed (or interpreted) by a compiled program.

In all of these cases, choosing a target language is influenced by a number of factors. A big one is interoperability—to my knowledge, the JVM was developed to get Java to run across multiple platforms, and this was a driving motivation for LLVM as well. Dynamic languages like Python choose to get interpreted as it makes it easier to run code shortly after writing it and get feedback. Rust compiles into LLVM because anything higher-level wouldn’t provide sufficient control over the processor and memory.

I bring up all these languages so as to understand what makes a good target language for compilers. In my mind, a lot of languages that compile to assembly do so because it has the fewest opinions on how your program should run. If you compile to the JVM, you already have to shoulder the runtime overhead of a garbage collector. If you compile to Python, well, your program will probably run pretty slowly and you lose any static typing guarantees. This is why LLVM is a great replacement for platform-specific assemblies—its abstraction doesn’t impose any significant overhead or opinions on the programmer while providing a number of benefits. For these same reasons, I believe that future languages that want LLVM-level control should consider targeting Rust instead of LLVM.

The Rust Advantage

When I say that Rust should replace LLVM, the future I envision is thus: languages would compile down into Rust code, so implementing a new language would be writing a program to parse source code and then generate Rust code. Rust in this kind of ecosystem offers five primary advantages over LLVM:

  1. Rust has an advanced type system. Interoperability between languages is limited because often times the least common denominator between two programming languages is at the assembly level. Hence, many languages have decent support for calling out to C but not to Python or Java. However, if two languages compile to the same type system, then they can seamlessly interoperate. My current project Lia is a good example of this. This also means that new languages don’t need to fully re-implement type analysis (although they may need to augment the typechecker in some way).
  2. Rust has several useful programming constructs that don’t exist in LLVM. A lot of ideas we take for granted in our programming languages like loops, closures, and enums (sum types), have to be re-implemented every time a new language compiles to LLVM. By compiling a language to Rust, not only do you get access to these constructs, but in my mind it’s generally far easier to write the code generator than generating LLVM by hand, particularly with quasiquotations.
  3. Rust comes with a package manager, Cargo. A good language is more than its specification and compiler—it needs an ecosystem of libraries and tooling to actually function in the real world. A bare minimum of any viable language today is a package manager for distributing libraries and handling external dependencies. If new languages compiled to Rust, not only would they get a package manager for free, but they also get free access to programs written in every other language that compiles to Rust. Imagine a world where you could have Rust, Java, and Python all in the same ecosystem with seamless type-level interoperability and the same package manager.
  4. People actually write in Rust. Whereas LLVM is intended just as an intermediate representation for compilers, Rust is a language actually meant to be written. This means writing a language that compiles to Rust gets all existing Rust libraries potentially for free—no need to rewrite HTTP servers or command-line parsing for the millionth time.
  5. Rust only imposes one opinion on your program: that it should be safe. I would argue that the only real difference at the runtime level between programs you can write in LLVM and programs you can write in Rust is that Rust requires all of your programs to be safe, i.e. not have memory errors/data races. Rust does not require you to have dynamic types, garbage collection, or really any particular runtime overhead—all of the hard work is done at compile time. And this is an opinion that I believe should hold for all programs regardless of how you intend to implement your language.

The Rust Disadvantage

This approach is not without drawbacks. I don’t believe these necessarily outweigh the benefits, but they must be considered or addressed when building this next generation of languages.

  1. Rust needs to expose more compiler internals. Currently, all compiler internals and syntax extensions are unstable and only available on nightly. There are plans in the works to change this, but these really only allow programmatic access to the parser/AST. If languages with interesting properties at the type-level (e.g. a gradually typed language) were to be implemented in Rust, it would probably need access to the typechecker to avoid reimplementing the entire type system.
  2. The story for dynamic languages needs refinement. Specifically, it needs a battle-hardened GC (as far as I know, there’s really only one major attempt so far). Rust also needs some sort of JIT in order to build any kind of REPL (I’m working on this).
  3. Error reporting and IDE tooling are insufficient. Ideally a language that compiles to Rust would want to leave things like checking for invalid variable uses (e.g. using a variable that doesn’t exist) to the Rust compiler. However, when rustc emits such an error, there’s no meaningful way to map it back to where it came from in the original source. This is not an unsolvable problem—Javascript has had this for a while with source maps. Additionally, features like syntax highlighting in editors for new languages intermixed with Rust would need to be developed.

Moving Forward

For my dream to become reality, we need a lot of work to develop Rust as a compile target, both on the Rust compiler and on compilers targeting Rust. Several members of the Rust community are working hard on Rust’s metaprogramming features, but more work needs to be done on languages that compile to Rust. To my knowledge, Lia is the first major language that actually compiles directly to Rust. Work on such languages and tooling can slowly push us towards a world where languages are no longer developed in isolation but instead all belong to a cohesive ecosystem. If you have any ideas on how to do this or if you vehemently disagree, send your diatribes to wcrichto@stanford.edu or post in the comments.

References

  1. Disclaimer: “platform-independent” is an overgeneralization, as several people have not hesitated to point out to my. To quote my brother, “I will say you do seem to think that LLVM gives you auto compat with all architectures/OSes, but oh man do I have stories for you.” 

  2. Note that when someone says a phrase like “C compiles to x86,” that’s an incorrect statement since C doesn’t intrinsically compile to anything. C is just the language specified by the C standard. Specific compiles like GCC and Clang compile C into something that satisfies that standard. However, for a language like Rust, it’s a little bit weirder since Rust doesn’t have a formal standard, but is instead informally specified by the one major Rust compiler, rustc. Hence, I just say “Rust compiles to LLVM.”