Expressiveness Benchmark

Task: Filter and clean tweets

Specification: Select the lower-case body and timestamp of tweets that are in English and not retweets

Input: data

language	is_retweet	likes	body	ts
`en`	`false`	`8`	`Some Text`	`1604534320`
`en`	`true`	`8`	`some Text`	`1604534321`
`en`	`false`	`8`	`some Text`	`1604534322`
`fr`	`false`	`8`	`some Text`	`1604534322`

Output:

body ts
some text 1604534320
some text 1604534322

body	ts
`some text`	`1604534320`
`some text`	`1604534322`

Python - Imperative

def process_tweets(data):
  result = []
  for value in data:
    if (value["language"] == "en" and
        value["is_retweet"] == "false"):
      result.append({
        "body": value["body"].lower(),
        "ts": value["ts"]
      })
  return result

Python - Functional

def process_tweets(data):
  return [
    {"body": value["body"].lower(),
     "ts": value["ts"]}
    for value in data
    if value["language"] == "en" and
       value["is_retweet"] == "false" 
  ]

Python - Pandas

def process_tweets(data):
  result = data[
    (data.language == 'en') &
    (data.is_retweet == 'false')]
  result.body = result.body.apply(lambda s: s.lower())
  return result[["body", "ts"]]

R - Tidyverse

process_tweets <- function(data) {
  data %>%
    filter(language == "en" & is_retweet == "false") %>%
    mutate(body = tolower(body)) %>%
    select(ts, body)
}

SQL - SQLite

SELECT LOWER(body) as body, ts
FROM data
WHERE language = "en" and is_retweet = "false"

Q - kdb+

process_tweets:
  select lower[body], ts from data 
  where (is_retweet ~\: "false") and (language ~\: "en")