$$ % Typography and symbols \newcommand{\msf}[1]{\mathsf{#1}} \newcommand{\ctx}{\Gamma} \newcommand{\qamp}{&\quad} \newcommand{\qqamp}{&&\quad} \newcommand{\Coloneqq}{::=} \newcommand{\proves}{\vdash} \newcommand{\star}[1]{#1^{*}} \newcommand{\eps}{\varepsilon} \newcommand{\nul}{\varnothing} \newcommand{\brc}[1]{\{{#1}\}} \newcommand{\binopm}[2]{#1~\bar{\oplus}~#2} \newcommand{\mag}[1]{|{#1}|} \newcommand{\aequiv}{\equiv_\alpha} \newcommand{\semi}[2]{{#1};~{#2}} % Untyped lambda calculus \newcommand{\fun}[2]{\lambda ~ {#1} ~ . ~ {#2}} \newcommand{\app}[2]{#1 ~ #2} \newcommand{\fix}[3]{\msf{fix}~({#1} : {#2}) ~ . ~ #3 } \newcommand{\truet}{\msf{true}} \newcommand{\falset}{\msf{false}} \newcommand{\define}[2]{{#1} \triangleq {#2}} % Typed lambda calculus - expressions \newcommand{\funt}[3]{\lambda ~ \left(#1 : #2\right) ~ . ~ #3} \newcommand{\lett}[4]{\msf{let} ~ \hasType{#1}{#2} = #3 ~ \msf{in} ~ #4} \newcommand{\letrec}[4]{\msf{letrec} ~ \hasType{#1}{#2} = #3 ~ \msf{in} ~ #4}a \newcommand{\ift}[3]{\msf{if} ~ {#1} ~ \msf{then} ~ {#2} ~ \msf{else} ~ {#3}} \newcommand{\rec}[5]{\msf{rec}(#1; ~ #2.#3.#4)(#5)} \newcommand{\case}[5]{\msf{case} ~ {#1} ~ \{ L(#2) \to #3 \mid R(#4) \to #5 \}} \newcommand{\pair}[2]{\left({#1},~{#2}\right)} \newcommand{\proj}[2]{#1 . #2} \newcommand{\inj}[3]{\msf{inj} ~ #1 = #2 ~ \msf{as} ~ #3} \newcommand{\letv}[3]{\msf{let} ~ {#1} = {#2} ~ \msf{in} ~ {#3}} \newcommand{\fold}[2]{\msf{fold}~{#1}~\msf{as}~{#2}} \newcommand{\unfold}[1]{\msf{unfold}~{#1}} \newcommand{\poly}[2]{\Lambda~{#1}~.~ #2} \newcommand{\polyapp}[2]{{#1}~\left[{#2}\right]} \newcommand{\export}[3]{\msf{export}~ #1 ~\msf{without}~{#2}~\msf{as}~ #3} \newcommand{\import}[4]{\msf{import} ~ ({#1}, {#2}) = {#3} ~ \msf{in} ~ #4} % Typed lambda calculus - types \newcommand{\tnum}{\msf{num}} \newcommand{\tstr}{\msf{string}} \newcommand{\tint}{\msf{int}} \newcommand{\tbool}{\msf{bool}} \newcommand{\tfun}[2]{#1 \rightarrow #2} \newcommand{\tprod}[2]{#1 \times #2} \newcommand{\tsum}[2]{#1 + #2} \newcommand{\trec}[2]{\mu~{#1}~.~{#2}} \newcommand{\tvoid}{\msf{void}} \newcommand{\tunit}{\msf{unit}} \newcommand{\tpoly}[2]{\forall~{#1}~.~{#2}} \newcommand{\tmod}[2]{\exists ~ {#1} ~ . ~ #2} % WebAssembly \newcommand{\wconst}[1]{\msf{i32.const}~{#1}} \newcommand{\wbinop}[1]{\msf{i32}.{#1}} \newcommand{\wgetlocal}[1]{\msf{get\_local}~{#1}} \newcommand{\wsetlocal}[1]{\msf{set\_local}~{#1}} \newcommand{\wgetglobal}[1]{\msf{get\_global}~{#1}} \newcommand{\wsetglobal}[1]{\msf{set\_global}~{#1}} \newcommand{\wload}{\msf{i32.load}} \newcommand{\wstore}{\msf{i32.store}} \newcommand{\wsize}{\msf{memory.size}} \newcommand{\wgrow}{\msf{memory.grow}} \newcommand{\wunreachable}{\msf{unreachable}} \newcommand{\wblock}[1]{\msf{block}~{#1}} \newcommand{\wloop}[1]{\msf{loop}~{#1}} \newcommand{\wbr}[1]{\msf{br}~{#1}} \newcommand{\wbrif}[1]{\msf{br\_if}~{#1}} \newcommand{\wreturn}{\msf{return}} \newcommand{\wcall}[1]{\msf{call}~{#1}} \newcommand{\wlabel}[2]{\msf{label}~\{#1\}~{#2}} \newcommand{\wframe}[2]{\msf{frame}~({#1}, {#2})} \newcommand{\wtrapping}{\msf{trapping}} \newcommand{\wbreaking}[1]{\msf{breaking}~{#1}} \newcommand{\wreturning}[1]{\msf{returning}~{#1}} \newcommand{\wconfig}[5]{\{\msf{module}{:}~{#1};~\msf{mem}{:}~{#2};~\msf{locals}{:}~{#3};~\msf{stack}{:}~{#4};~\msf{instrs}{:}~{#5}\}} \newcommand{\wfunc}[4]{\{\msf{params}{:}~{#1};~\msf{locals}{:}~{#2};~\msf{return}~{#3};~\msf{body}{:}~{#4}\}} \newcommand{\wmodule}[1]{\{\msf{funcs}{:}~{#1}\}} \newcommand{\wcg}{\msf{globals}} \newcommand{\wcf}{\msf{funcs}} \newcommand{\wci}{\msf{instrs}} \newcommand{\wcs}{\msf{stack}} \newcommand{\wcl}{\msf{locals}} \newcommand{\wclab}{\msf{labels}} \newcommand{\wcm}{\msf{mem}} \newcommand{\wcmod}{\msf{module}} \newcommand{\wsteps}[2]{\steps{\brc{#1}}{\brc{#2}}} \newcommand{\with}{\underline{\msf{with}}} \newcommand{\wvalid}[2]{{#1} \vdash {#2}~\msf{valid}} \newcommand{\wif}[2]{\msf{if}~{#1}~{\msf{else}}~{#2}} \newcommand{\wfor}[4]{\msf{for}~(\msf{init}~{#1})~(\msf{cond}~{#2})~(\msf{post}~{#3})~{#4}} % assign4.3 custom \newcommand{\wtry}[2]{\msf{try}~{#1}~\msf{catch}~{#2}} \newcommand{\wraise}{\msf{raise}} \newcommand{\wraising}[1]{\msf{raising}~{#1}} \newcommand{\wconst}[1]{\msf{i32.const}~{#1}} \newcommand{\wbinop}[1]{\msf{i32}.{#1}} \newcommand{\wgetlocal}[1]{\msf{get\_local}~{#1}} \newcommand{\wsetlocal}[1]{\msf{set\_local}~{#1}} \newcommand{\wgetglobal}[1]{\msf{get\_global}~{#1}} \newcommand{\wsetglobal}[1]{\msf{set\_global}~{#1}} \newcommand{\wload}{\msf{i32.load}} \newcommand{\wstore}{\msf{i32.store}} \newcommand{\wsize}{\msf{memory.size}} \newcommand{\wgrow}{\msf{memory.grow}} \newcommand{\wunreachable}{\msf{unreachable}} \newcommand{\wblock}[1]{\msf{block}~{#1}} \newcommand{\wloop}[1]{\msf{loop}~{#1}} \newcommand{\wbr}[1]{\msf{br}~{#1}} \newcommand{\wbrif}[1]{\msf{br\_if}~{#1}} \newcommand{\wreturn}{\msf{return}} \newcommand{\wcall}[1]{\msf{call}~{#1}} \newcommand{\wlabel}[2]{\msf{label}~\{#1\}~{#2}} \newcommand{\wframe}[2]{\msf{frame}~({#1}, {#2})} \newcommand{\wtrapping}{\msf{trapping}} \newcommand{\wbreaking}[1]{\msf{breaking}~{#1}} \newcommand{\wreturning}[1]{\msf{returning}~{#1}} \newcommand{\wconfig}[5]{\{\msf{module}{:}~{#1};~\msf{mem}{:}~{#2};~\msf{locals}{:}~{#3};~\msf{stack}{:}~{#4};~\msf{instrs}{:}~{#5}\}} \newcommand{\wfunc}[4]{\{\msf{params}{:}~{#1};~\msf{locals}{:}~{#2};~\msf{return}~{#3};~\msf{body}{:}~{#4}\}} \newcommand{\wmodule}[1]{\{\msf{funcs}{:}~{#1}\}} \newcommand{\wcg}{\msf{globals}} \newcommand{\wcf}{\msf{funcs}} \newcommand{\wci}{\msf{instrs}} \newcommand{\wcs}{\msf{stack}} \newcommand{\wcl}{\msf{locals}} \newcommand{\wcm}{\msf{mem}} \newcommand{\wcmod}{\msf{module}} \newcommand{\wsteps}[2]{\steps{\brc{#1}}{\brc{#2}}} \newcommand{\with}{\underline{\msf{with}}} \newcommand{\wvalid}[2]{{#1} \vdash {#2}~\msf{valid}} % assign4.3 custom \newcommand{\wtry}[2]{\msf{try}~{#1}~\msf{catch}~{#2}} \newcommand{\wraise}{\msf{raise}} \newcommand{\wraising}[1]{\msf{raising}~{#1}} \newcommand{\wif}[2]{\msf{if}~{#1}~{\msf{else}}~{#2}} \newcommand{\wfor}[4]{\msf{for}~(\msf{init}~{#1})~(\msf{cond}~{#2})~(\msf{post}~{#3})~{#4}} \newcommand{\windirect}[1]{\msf{call\_indirect}~{#1}} % session types \newcommand{\ssend}[2]{\msf{send}~{#1};~{#2}} \newcommand{\srecv}[2]{\msf{recv}~{#1};~{#2}} \newcommand{\soffer}[4]{\msf{offer}~\{{#1}\colon({#2})\mid{#3}\colon({#4})\}} \newcommand{\schoose}[4]{\msf{choose}~\{{#1}\colon({#2})\mid{#3}\colon({#4})\}} \newcommand{\srec}[1]{\msf{label};~{#1}} \newcommand{\sgoto}[1]{\msf{goto}~{#1}} \newcommand{\dual}[1]{\overline{#1}} % Inference rules \newcommand{\inferrule}[3][]{\cfrac{#2}{#3}\;{#1}} \newcommand{\ir}[3]{\inferrule[\text{(#1)}]{#2}{#3}} \newcommand{\s}{\hspace{1em}} \newcommand{\nl}{\\[2em]} \newcommand{\evalto}{\boldsymbol{\overset{*}{\mapsto}}} \newcommand{\steps}[2]{#1 \boldsymbol{\mapsto} #2} \newcommand{\evals}[2]{#1 \evalto #2} \newcommand{\subst}[3]{[#1 \rightarrow #2] ~ #3} \newcommand{\dynJ}[2]{#1 \proves #2} \newcommand{\dynJC}[1]{\dynJ{\ctx}{#1}} \newcommand{\typeJ}[3]{#1 \proves \hasType{#2}{#3}} \newcommand{\typeJC}[2]{\typeJ{\ctx}{#1}{#2}} \newcommand{\hasType}[2]{#1 : #2} \newcommand{\val}[1]{#1~\msf{val}} \newcommand{\num}[1]{\msf{Int}(#1)} \newcommand{\err}[1]{#1~\msf{err}} \newcommand{\trans}[2]{#1 \leadsto #2} \newcommand{\size}[1]{\left|#1\right|} $$

&Notepad

Compiling Knowledge into Probabilistic Models

Will Crichton — March 23, 2019

Turning procedural and structural knowledge into programs has established methodologies, but what about turning knowledge into probabilistic models? I explore a few examples of what such a process could look like.

In programming, compilation is about translating between equivalent representations of things. For example, gcc (the C compiler) takes a C program, a representation of a thing your computer should do, and translates it into an assembly program, also a representation of a thing your computer should do, just in a language the computer actually understands. Ideally, those two programs are “equivalent” in the sense that the C specification defines a behavior for how the C program should work, and the computer executes the assembly in a manner consistent with the spec.

Unbeknownst to the compiler, a second kind of compilation occurs behind the scenes: from a C programmer’s mental model of their program into the C program. For example, if I think “I need to loop through this array and print each element”, then that translates into:

int array[N] = { ... };
for (int i = 0; i < N; ++i) {
  printf("%d\n", array[i]);
}

Again, there are two representations: the mental model and the textual C program, which the programmer wants to be equivalent, but may not be if they wrote a typo, or don’t know how C works, or have an incorrect expectation of how C works¹. And a final kind of compilation occurs when a programmer translates from their mental model of the problem being solved into a “programmatic” mental model of the solution, e.g. formulating an algorithm, designing a system, and so on. For example, if my problem is to sum the elements of an array, I can compile this into a C/systems/low-level/imperative programmatic model²:

Make a temporary sum variable and index. Write a for loop that iterates over the array. Get the i-th element and add into the sum.

Which then “compiles” into C:

int array[N] = { ... };
int sum = 0;
for (int i = 0; i < N; ++i) {
  sum += array[i];
}

I could also compile this into an OCaml/functional/declarative programmatic model:

Reduce the list with a sum operator.

let arr = [...] in
let sum = List.reduce (+) arr

This kind of commonplace “knowledge compilation” characterizes what I’ll call procedural knowledge, i.e. verbs or actions, how to actually do something in a domain, usually in the form of a function or code fragment. Programmers also frequently compile ontological (or structural) knowledge, i.e. nouns or objects, how a domain is structured, usually in the form of a type or class definition. For example, if two distinct quantities are related, like the x/y of a 2D point, this is programmatically a product type, e.g. a struct in C:

struct point_t { float x; float y; }

Or a record (or tuple) in OCaml:

type Point = { x : float; y : float; }
type PointTuple = float * float

If one quantity is dependent upon another, then we need a function type:

type Time = float
type PointGenerator = Time -> Point

If an entity is made up of categories, then we need a sum type (or variant, or tagged union):

type Chart = Scatter of Point list | Pie of float list | ...

These are all examples of identifying structure within a domain and encoding that structure, that ontology, into a program so that we may structure our remaining program around it. Programming languages offer an ontological toolbox of functions, products, sums, lists, and so on that we programmers adapt to a given domain.

While some programmers learn these skills implicitly through writing many programs, we have more resources than ever today on explicit methodologies for compiling knowledge into programs. For example, How to Design Programs is a great introductory CS textbook that explains a methodology for designing programs around data types.

Rich in methods, poor in methodologies

If such methodologies exist for general-purpose programming, do they exist for probabilistic modeling? Given a probabilistic mental model of a domain, how can I compile that model into an executable program? Over the last two quarters at Stanford, I took Stefano Ermon’s course on probabilistic graphical models and Noah Goodman’s course on cognitive science through probabilistic programming seeking such enlightenment. While the path to nirvana was not neatly laid out before me, I have seen an ephemeral glimpse at a brighter future.

At their core, I think probabilistic models offer two fundamental capabilities: reasoning under uncertainty, and learning from data. One form of uncertainty arises from nondeterminism, i.e. some input to your problem changes in ways not fully predictable, e.g. Google modeling search inputs from people, or NASA modeling noisy communications from space probes. Probability theory has a rich standard library of models (e.g. Bernoulli, Gaussian, Poisson, …) with mathematical properties that make them easy to use programmatically in modeling sources of noise/nondeterminisim.

Another kind of uncertainty is incomplete information. Even in a system that’s fully deterministic, if a function is not bijective, then it can be impossible to reconstruct events only knowing partial information about inputs, outputs and intermediates. The standard example from a probability textbook is that if you wake up in the morning to see your lawn is wet, did it rain or did your sprinkler turn on? Probabilistic models offer the ability to explicitly represent these situations and answer these questions by marginalizing out unknown information.

Moreover, these models can permit nonmonotonic reasoning, i.e. my conclusion can change after seeing more evidence. For example, I might assume my sprinkler turned on because that happens more often than rain, but if I go outside and the roof is wet, then that suggests rain was the true culprit—adding information changed my conclusion.

Lastly, probabilistic models provide an explicit formalism for bridging domain knowledge and domain data. For example, if I know the causal structure of a domain, I can represent that as a graphical model: either rain or a sprinkler caused my lawn to get wet. The parameters (probability lawn is wet given rain or given sprinkler, and baseline probability of rain and sprinkler) can then be learned from the data. Parameters can also be augmented with human knowledge through priors in a Bayesian learning setup. In the general-purpose programming context, imagine if you could give examples of a program output (domain data) along with a skeleton of a program (source file with incomplete parts) and ask a system to fill in the holes. This kind of program synthesis task fits well as a probabilistic model.

Example knowledge compilations

To concretize these points, I’ll show a few examples of compiling knowledge into probabilistic models.

Example: I have a fair coin.

General case: Something happens with two possible outcomes, or an object has two possible categories. Each outcome/category is equally likely.

Mathematical model:

$\text{CoinFlip} \sim \bern(p = 0.5)$

Example: I have an unfair 6-sided dice, where the first side is rolled 50% of the time, and the rest 10%.

General case: Something happens with N possible outcomes, and each outcome can have a different probability.

Mathematical model:

$\text{DiceRoll} \sim \text{Categorical}(\theta = \{0.5, 0.1, 0.1, 0.1, 0.1, 0.1\})$

Example: I have a 6-sided dice, and I think that one of the sides is biased, but I’m not sure which.

General case: Something happens with N possible outcomes, and one outcome is more likely than the rest.

Mathematical model:

$\begin{align*} \text{DicePrior} &\sim \text{Dirichlet}(\alpha = \{0.5, 0.5, 0.5, 0.5, 0.5, 0.5\}) \\ \text{DiceRoll} &\sim \text{Categorical}(\theta = \text{DicePrior}) \end{align*}$

Example: The higher my GPA, the more likely I’ll get into college.

General case: The probability of an event is correlated with a real-valued outcome, with an expected positive correlation.

Mathematical model:

$\begin{align*} \text{GPAWeight} &\sim \mathcal{N}(\mu = 1, \sigma = 0.5) \\ \text{Accepted} &\sim \text{Bernoulli}\left(p = \cfrac{1}{1 + \exp(-(\text{GPAWeight} * \text{GPA}))}\right) \end{align*}$

These are all examples of structural/ontological probabilistic knowledge, i.e. they describe the shape of a situation with uncertainty. Next, let’s look at procedural probabilistic knowledge, i.e. how to answer questions against these models.

Example: If 5 people with GPA 3.0 and 10 people with GPA 4.0 got accepted, what’s the most suitable parameters for GPA weight?

General case: What parameters best explain the dataset, i.e. maximize the probability of all data under the model?

Mathematical model:

$\hat{\mu}, \hat{\sigma} = \argmax_{\mu, \sigma} \prod_{(\text{acc}, \text{gpa}) \in \dataset} P_{\mu,\sigma}(\text{Accepted} = \text{acc}, \text{GPA} = \text{gpa})$

Example: If someone was accepted to college, what was their most likely GPA?

General case: Given observations, what values for unknown data maximize the joint probability under the model?

Mathematical model:

$\hat{\text{gpa}} = \argmax_{\text{gpa}} P(\text{GPA} = \text{gpa} \mid \text{Accepted} = \text{yes})$

Both of these questions, parameter learning and conditional inference, are still answered at a high level of abstraction. From my impression taking the graphical models course, a lot of work in practice is picking the right method to accomplish a particular task, e.g. doing inference with exact (variable elimination, belief propagation) vs. approximate methods (MCMC, variational inference). These methods also have hyperparameters that need to be carefully tuned, e.g. mixing time for MCMC, or variational family for variational inference.³

Regardless, it seems to me that the the act of compiling knowledge into probabilistic models is still more art than science. At best, we are left to learn from repositories of examples and to try to back out a methodology accordingly. That’s quite inefficient! And as we hurtle towards a data-driven world where more and more programmers are likely to need probabilistic models as a part of their daily routine, we ought to invest more in developing explicit techniques for probabilistic knowledge compilation. The examples given here are just a seed of an idea, but who knows, maybe a How to Design Probabilistic Programs textbook could be just around the corner.

For example, if I thought arrays in C were 1-indexed, then I could translate my goal into a C program that’s consistent with my expectations, but will not execute as I expect because of my misunderstanding. That’s a different kind of bug then if I accidentally 1-indexed and didn’t mean to. ↩
Entertainingly, some old psychology of programming literature cite this kind of an example as the way “all experienced programmers” would decompose the problem. Sheil, B. A. “The Psychological Study of Programming.” ACM Computing Surveys, 1981. ↩
Worth noting that the industrial-strength probabilistic programming frameworks focus mostly on variational inference, suggesting that for models of real-world scale, approximate methods are necessary for computational feasibility. ↩