Expressiveness Benchmark

Dataset analysis

1. Overall, how concise are programs in each language and category?

Our goal is to use quantitative metrics to compare the conciseness of programs in each language. We use number of tokens to measure program length. For example, the program "(var_name + 1) * 2" has the tokens "[(, var_name, +, 1, ), *, 2]" for a length of 7. Using tokens instead of lines-of-code or number of characters helps control for stylistic differences in indentation, and it does not penalize e.g. longer variable names. The boxplots below show the distribution of the number of tokens in programs for each category, sorted by median.

2. For a given language, what are its most and least concise tasks?

To compare languages within categories, we take each task and assign its programs a z-score based on length. The z-score tells us: for a given task (e.g. Youngest over 35), how does program's size in one language compare to other languages? A high z-score means a larger program than normal, and low z-score is smaller. Because the z-score is normalized, we can compare z-scores across multiple tasks. A language's highest z-score is its worst category, and lowest z-score is its best category. Below we plot the z-scores for each language and category (z-scores within a given category/language pair are averaged).

To understand these statistics, let's dig into an example. For Datalog, its best category is Joins and worst category is Strings. Here are the two Datalog join programs:

Datalog has more concise programs for joins because relationships between tables are implicitly expressed by sharing variables across tables. By contrast, languages like SQL require explicit JOIN clauses. But if we look to Strings, we can see when Datalog gets verbose:

The main issue is that Datalog (i.e. Souffle) does not have many built-in primitives for string process like splitting or removing characters, so re-implementing those primitives requires a lot of code.

3. For a given task, what are its most and least concise languages?

To answer this question, we can transpose the previous analysis. For each category, we can compare the z-scores for different languages, shown below.

Python - Imperative has the most verbose programs for every category except Strings. The most concise programs vary mostly between SQL, R, and Q.

4. How much do plans overlap in each language?

Each program is annotated by which pieces of the code implement which sub-goals of a task. For example, the Continent with highest average population task:

Specification: Find the name of the continent with the highest average population by country.

Python - Imperative

def continent_by_population(countries):
  continent_stats = defaultdict(lambda: [0, 0])
  for country in countries:
    continent = country['continent']
    continent_stats[continent][0] += country['population']
    continent_stats[continent][1] += 1
     
  max_continent = None
  max_average = None
  for continent, [total, count] in continent_stats.items():
    average = total / count
    if max_average is None or max_average < average:
      max_average = average
      max_continent = continent
      
  return max_continent

Python - Functional

def continent_by_population(countries):
  continents = set([c['continent'] for c in countries])
  populations_by_continent = [
    (continent, [c['population'] for c in countries 
                 if c['continent'] == continent])
    for continent in continents
  ]
  averages = [
    (continent, sum(pops) / len(pops))
    for continent, pops in populations_by_continent
  ]
  return max(averages, key=lambda t: t[1])[0]

For a given sub-goal, e.g. "average population", the set of corresponding highlighted regions is collectively its plan. Plans can tell us how hard or easy a program may be to write or read. For example, Elliot Soloway found in the 1980s that merging two plans together is hard for programmers. In the Python - Imperative solution, the "highest" and "average population" plans are merged together into a single for-loop, whereas in the Python - Functional solution the "highest" plan is separated by use of a higher-order function max.

Based on this observation, Duran et al. proposed that the number of overlapping plans in a program could be used as a metric of cognitive complexity. Using this dataset's plan annotations, we can actually compute this metric. Specifically, for each program, we count the number of pairs of plans that overlap. For example, above the Python - Functional program has the "name" and "highest" plans overlapping, but does not have the "average" and "highest" plans overlapping. Below we plot the distribution of plan overlaps in each language, sorted by median:

On average, the languages Q, SQL, Python - Pandas, and R had fewer overlapping plans (median 1) while the languages Python - Functional, Python - Imperative, and Datalog had a median 3 overlapping plans. For example, here are examples with 0 overlapping plans (left) and 10 overlapping plans (right).

Continent with the highest average population / R - Tidyverse

continent_by_population <- function(countries) {
  countries %>%
    group_by(continent) %>%
    summarize(mean_pop = mean(population)) %>%
    slice(which.max(mean_pop)) %>%
    .$continent
}

Row per family to row per child / Datalog - Souffle

row_per_child("child1", dob, family, height) :-
  families(dob, _, _, family, height, _, _).
row_per_child("child2", dob, family, height) :-
  families(_, dob, _, family, _, height, _).
row_per_child("child3", dob, family, height) :-
  families(_, _, dob, family, _, _, height).

The R program has a clean separation of each row for its task. The Datalog program has every plan overlapping with every other plan, for a total of 5 choose 2 = 10 overlaps.