14 Tidy evaluation

library(tidyverse)

Now, we’ll go into some more detail about tidy evaluation. We’ll explain when and why you need tidy evaluation, and what’s going on with enquo(), !!, and !!!.

You don’t always need tidy evaluation when programming with dplyr. For example, the following function works just fine.

filter_fun_1 <- function(df, value) {
  df %>% 
    filter(model == value) %>% 
    nrow()
}

filter_fun_1(df = mpg, value = "corvette")
#> [1] 5

But if we want to make this function more general by adding an argument to specify a column, we’ll run into trouble.

filter_fun_2 <- function(df, var, value) {
  df %>% 
    filter(var == value) %>% 
    nrow()
}

filter_fun_2(df = mpg, var = model, value = "corvette")
#> Error: object 'model' not found

You need tidy evaluation if you want to build a function that passes the names of tibble columns into dplyr verbs.

Here’s another example of a function that doesn’t work:

grouped_mean <- function(df, group_var, summary_var) {
  df %>% 
    group_by(group_var) %>% 
    summarize(mean = mean(summary_var))
}

grouped_mean(df = mpg, group_var = manufacturer, summary_var = cty)
#> Error: Column `group_var` is unknown

In the following sections, we’ll explore why grouped_mean() doesn’t work and show you how to create a function that does.

14.1 Quoting functions

14.1.1 Quoted arguments

Before we can explain what’s going wrong in functions like grouped_mean(), we need to lay some groundwork. In R, you can divide function arguments into two classes: evaluated and quoted.

Evaluated arguments are what you might think of as “normal.” Here’s an example of a function that evaluates its function arguments.

log(2)
#> [1] 0.693

We can also save 2 as a variable and get the same answer.

x <- 2

log(x)
#> [1] 0.693

Code in an evaluated argument executes the same regardless of whether or not it’s in a function argument. So x evaluates to 2 whether it’s outside log()

x
#> [1] 2

or inside.

log(x) == log(2)
#> [1] TRUE

Because log() evaluates its arguments, it figures out what x refers to before operating on x. In our example, log() figures out that x refers to 2, and then takes the log of 2. For a function like log(), evaluating its argument seems like an obviously good idea. It would be meaningless to take the log of the letter “x”. Sometimes, however, functions don’t want to evaluate their arguments. Let’s find out if dplyr functions use evaluated arguments. Here’s a dplyr function.

mpg %>% 
  select(cty)
#> # A tibble: 234 x 1
#>     cty
#>   <int>
#> 1    18
#> 2    21
#> 3    20
#> 4    21
#> 5    16
#> 6    18
#> # … with 228 more rows

In this case, let’s just consider the cty argument, even though mpg is technically an argument as well.

If select() evaluates its argument, then cty should evaluate to the same thing inside select() as it does outside select(). Inside select(), we know that cty somehow refers to a column of mpg. Does it evaluate to a column of mpg outside select()?

cty
#> Error in eval(expr, envir, enclos): object 'cty' not found

No. R doesn’t know what cty refers to, because cty, unlike x, doesn’t exist in the global environment. It only exists as an element of mpg.

To make this point even more explicit, let’s assign a number to cty.

cty <- 3

cty
#> [1] 3

Now, cty evaluates to 3 outside select(), but select(cty) still works as expected.

mpg %>% 
  select(cty)
#> # A tibble: 234 x 1
#>     cty
#>   <int>
#> 1    18
#> 2    21
#> 3    20
#> 4    21
#> 5    16
#> 6    18
#> # … with 228 more rows

Arguments like cty are quoted. Instead of immediately evaluating cty before operating, select() and the other dplyr verbs hold on to what was literally supplied as an argument. Here, that’s just “cty”, but other cases can be more complicated. Quoting allows select() and the other dplyr verbs to use their input as they want without worrying about how that input evaluates in the global environment. In our example, quoting its argument allows select() to look for cty inside the mpg tibble, without worrying about cty referring to 3 in the global environment.

Note that dplyr verbs only quote the arguments that supply column names. They do not quote the argument that refers to the data you pipe in, or non-column-name arguments like count()’s sort argument or top_n()’s n argument.

dplyr’s quoting behavior makes it really easy to use. You can supply bare names of columns to dplyr functions and they just work. However, the quoting behavior causes some wrinkles when you want to program with dplyr.

You might now have a hypothesis about why grouped_mean() didn’t work. Here it is again.

grouped_mean <- function(df, group_var, summary_var) {
  df %>% 
    group_by(group_var) %>% 
    summarize(mean = mean(summary_var))
}

grouped_mean(df = mpg, group_var = manufacturer, summary_var = cty)
#> Error: Column `group_var` is unknown

The error says that the column group_var is unknown. Because group_by() quotes its argument, it didn’t evaluate group_var, find out that it refers to manufacturer, and then look for a column named manufacturer. Instead, it took its input literally and looked for a column named group_var. mpg doesn’t have a column called group_var, so we got an error.

We want group_by() to understand that group_var refers to manufacturer, so we need to change our function. Before diving into these changes, here’s a summary of the points covered so far:

Some functions evaluate their arguments and some function quote their arguments.
If a function quotes its argument, the argument will evaluate differently inside and outside the function.
dplyr functions quote the arguments that take tibble column names.

14.1.2 Strings and `glue()`

If we want group_by() to understand that group_var refers to manufacturer, we’re going to have to unquote group_var.

Before we talk about how to do this with dplyr, let’s take a moment to examine a situation in which you’ve actually already been quoting and unquoting input.

When you create a string, R doesn’t evaluate its contents. When you type something like:

"y <- 1"
#> [1] "y <- 1"

R creates a string, not an object named y with a value of 1.

Sometimes, though, you’ll want to write a function that inserts a variable into a string. For example, say we want to write a function that tells you what species of animal you are.

You can probably predict that the following function won’t work.

species <- function(my_species) {
  "I am a my_species"
}

species("human")
#> [1] "I am a my_species"

We need to tell our function that we actually do want to evaluate my_species. As you’ve already learned, you can do this with str_glue and {}.

species <- function(my_species) {
  str_glue("I am a {my_species}")
}

species("human")
#> I am a human

The {} tells str_glue() to evaluate my_species before constructing the string. Unfortunately, we can’t just use {} to get dplyr verbs to evaluate the arguments we want. We’ll need to figure out an equivalent to {} for dplyr function calls.

14.1.3 Quosures

One reason we can’t just generalize from our string example and use {} is that when dplyr functions quote their arguments, they don’t quote and create strings. When you quote with quotation marks, as you know, you create a string.

"y <- 1"
#> [1] "y <- 1"

But when dplyr functions quote their arguments, they create something called a quosure.

You can create your own quosure with the function quo().

quo(y <- 1)
#> <quosure>
#> expr: ^y <- 1
#> env:  global

Our quosure has two parts: the expr (which stands for expression) and the env (which stands for environment).

You can think of expressions like recipes. A recipe for chocolate chip cookies specifies how to make the cookies, but does not itself create any cookies. Similarly, the expression y <- 1 species how to create a variable, but doesn’t actually create that variable. Just as you need to carry out the recipe to create cookies, R needs to evaluate the expression to produce the results.

Recipes, unfortunately, aren’t sufficient for cookies. You also need a stock of ingredients, like flour and chocolate chips. Similarly, in order to evaluate an expression, R needs an environment that supplies variables. Different types of flour and chocolate chips can create different cookies, and different environments can cause the same expression to be evaluated differently.

If we place quo(y <- 1) inside a function, the environment will change.

quo_fun <- function() {
  quo(y <- 1)
}

quo_fun()
#> <quosure>
#> expr: ^y <- 1
#> env:  0x7f8f88edac08

Quosures are objects and so can be passed around, carrying their environment with them.

more_quo_fun <- function(my_quosure) {
  my_quosure
}

more_quo_fun(quo(y <- 1))
#> <quosure>
#> expr: ^y <- 1
#> env:  global

Now you know that:

dplyr quotes its arguments and creates quosures, which consist of an expression and an environment. An expression is kind of like a recipe, and the environment is what supplies the ingredients you use to carry out that recipe.
You can create a quosure with quo().

14.2 Wrapping quoting functions

14.2.1 `enquo()` and `!!`

Now that we’ve gone over some theory, we can return to the task of fixing grouped_mean().

You just learned that dplyr verbs create quosures. In order to make our function work, we’ll need to make our own quosure that captures our desired meaning of group_var and summary_var. Here’s our earlier, unsuccessful function.

grouped_mean <- function(df, group_var, summarize_var) {
  df %>% 
    group_by(group_var) %>% 
    summarize(mean = mean(summarize_var))
}

grouped_mean(df = mpg, group_var = manufacturer, summarize_var = cty)
#> Error: Column `group_var` is unknown

We want to track the environment of group_var so that group_by() knows that it refers to manufacturer. We can do this with quo().

grouped_mean <- function(df, group_var, summarize_var) {
  print(group_var)
  
  df %>% 
    group_by(group_var) %>% 
    summarize(mean = mean(summarize_var))
}

grouped_mean(
  df = mpg, 
  group_var = quo(manufacturer), 
  summarize_var = quo(cty)
)
#> <quosure>
#> expr: ^manufacturer
#> env:  global
#> Error: Column `group_var` is unknown

Our print() statement lets us know what group_var looks like inside the function. group_var is a quosure and evaluates to manufacturer, which seems like a step in the right direction.

However, our function still isn’t giving us what we want. group_by() still quotes group_var and looks for a column called group_var. We’ve already quoted the input with quo(), but group_by() doesn’t know that and so is just carrying on as usual.

We can tell group_by() not to quote by using !! (pronounced “bang bang”). !! says something like “evaluate me!” or “unquote!”

grouped_mean <- function(df, group_var, summarize_var) {
  print(group_var)
  
  df %>% 
    group_by(!! group_var) %>% 
    summarize(mean = mean(!! summarize_var))
}

grouped_mean(
  df = mpg, 
  group_var = quo(manufacturer), 
  summarize_var = quo(cty)
)
#> <quosure>
#> expr: ^manufacturer
#> env:  global
#> # A tibble: 15 x 2
#>   manufacturer  mean
#>   <chr>        <dbl>
#> 1 audi          17.6
#> 2 chevrolet     15  
#> 3 dodge         13.1
#> 4 ford          14  
#> 5 honda         24.4
#> 6 hyundai       18.6
#> # … with 9 more rows

Success!!

quo() and !! work well, but it’s kind of a hassle to have to quo() our input each time. It would be even better if we could write the function call like this:

grouped_mean(df = mpg, group_var = manufacturer, summarize_var = cty)

To do so, we’ll take care of the quoting inside our function. We can’t use quo() to quote inside our function.

grouped_mean <- function(df, group_var, summary_var) {
  group_var <- quo(group_var)
  summary_var <- quo(summary_var)
  print(group_var)
  
  df %>% 
    group_by(!! group_var) %>% 
    summarize(mean = mean(!! summary_var))
}

grouped_mean(df = mpg, group_var = manufacturer, summary_var = cty)
#> <quosure>
#> expr: ^group_var
#> env:  0x7f8f88e064c0
#> Error: Column `group_var` is unknown

The environment of our quosure is wrong. We want R to evaluate group_var using the global environment, not the environment of our function. We’ll need quo()’s cousin, enquo(), in order to capture the correct environment of group_var.

enquo_fun <- function(group_var) {
  print(enquo(group_var))
}

enquo_fun(group_var = manufacturer)
#> <quosure>
#> expr: ^manufacturer
#> env:  global

Now, we can rewrite grouped_mean().

grouped_mean <- function(df, group_var, summary_var) {
  group_var <- enquo(group_var)
  summary_var <- enquo(summary_var)

  df %>% 
    group_by(!! group_var) %>% 
    summarize(mean = mean(!! summary_var))
}

grouped_mean(df = mpg, group_var = manufacturer, summary_var = cty)
#> # A tibble: 15 x 2
#>   manufacturer  mean
#>   <chr>        <dbl>
#> 1 audi          17.6
#> 2 chevrolet     15  
#> 3 dodge         13.1
#> 4 ford          14  
#> 5 honda         24.4
#> 6 hyundai       18.6
#> # … with 9 more rows

You can use this same technique for any of the dplyr verbs.

filter_var <- function(var, value) {
  var <- enquo(var)
  
  mpg %>% 
    filter(!! var == value)
}

filter_var(class, "minivan")
#> # A tibble: 11 x 11
#>   manufacturer model  displ  year   cyl trans drv     cty   hwy fl    class
#>   <chr>        <chr>  <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 dodge        carav…   2.4  1999     4 auto… f        18    24 r     mini…
#> 2 dodge        carav…   3    1999     6 auto… f        17    24 r     mini…
#> 3 dodge        carav…   3.3  1999     6 auto… f        16    22 r     mini…
#> 4 dodge        carav…   3.3  1999     6 auto… f        16    22 r     mini…
#> 5 dodge        carav…   3.3  2008     6 auto… f        17    24 r     mini…
#> 6 dodge        carav…   3.3  2008     6 auto… f        17    24 r     mini…
#> # … with 5 more rows

The enquo() and !! strategy is incredibly useful, and you don’t need to fully understand the theory behind it in order to write successful functions. If some of the earlier explanation is still confusing, don’t worry about it too much. Tidy evaluation is a complicated subject, and it takes a while to really grasp what’s going on behind enquo() and !!.

In summary, to build a function that takes an argument to a dplyr verb, use the following template:

my_tidyeval_function <- function(column_name) {
  column_name <- enquo(column_name)
  
  df %>% 
    dplyr_verb(!! column_name)
}

14.2.2 Passing `...`

Say you want to extend grouped_mean() so that you can group by any number of variables. You might have noticed that some functions, like scoped verbs and the purrr functions, take ... as a final argument, allowing you to specify additional arguments. We can use that same functionality here.

grouped_mean_2 <- function(df, summary_var, ...) {
  summary_var <- enquo(summary_var)
  
  df %>% 
    group_by(...) %>% 
    summarize(mean = mean(!! summary_var))
}

grouped_mean_2(df = mpg, summary_var = cty, manufacturer, model)
#> # A tibble: 38 x 3
#> # Groups:   manufacturer [15]
#>   manufacturer model               mean
#>   <chr>        <chr>              <dbl>
#> 1 audi         a4                  18.9
#> 2 audi         a4 quattro          17.1
#> 3 audi         a6 quattro          16  
#> 4 chevrolet    c1500 suburban 2wd  12.8
#> 5 chevrolet    corvette            15.4
#> 6 chevrolet    k1500 tahoe 4wd     12.5
#> # … with 32 more rows

Notice that with …, we didn’t have to use enquo() or !!. ... takes care of all the quoting and unquoting for you.

You can also use ... to pass in full expressions to dplyr verbs.

filter_fun <- function(df, summary_var, ...) {
  summarize_var <- enquo(summary_var)
  
  df %>% 
    filter(...) 
}

filter_fun(mpg, manufacturer == "audi", model == "a4")
#> # A tibble: 7 x 11
#>   manufacturer model displ  year   cyl trans  drv     cty   hwy fl    class
#>   <chr>        <chr> <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
#> 1 audi         a4      1.8  1999     4 auto(… f        18    29 p     comp…
#> 2 audi         a4      1.8  1999     4 manua… f        21    29 p     comp…
#> 3 audi         a4      2    2008     4 manua… f        20    31 p     comp…
#> 4 audi         a4      2    2008     4 auto(… f        21    30 p     comp…
#> 5 audi         a4      2.8  1999     6 auto(… f        16    26 p     comp…
#> 6 audi         a4      2.8  1999     6 manua… f        18    26 p     comp…
#> # … with 1 more row

14.2.3 Assigning names

Let’s return to our grouped_mean() function. We finally got it working in the last section. Here it is again:

grouped_mean <- function(df, group_var, summarize_var) {
  group_var <- enquo(group_var)
  summarize_var <- enquo(summarize_var)
  
  df %>% 
    group_by(!! group_var) %>% 
    summarize(mean = mean(!! summarize_var))
}

It would be nice if we could name the mean column something more informative.

Maybe we can just apply our enquo() and !! strategy?

grouped_mean <- function(df, group_var, summary_var, summary_name) {
  group_var <- enquo(group_var)
  summary_var <- enquo(summary_var)
  summary_name <- enquo(summary_name)
  
  df %>% 
    group_by(!! group_var) %>% 
    summarize(!! summary_name = mean(!! summary_var))
}

grouped_mean(
  df = mpg, 
  group_var = manufacturer, 
  summary_var = hwy, 
  summary_name = mean_hwy
)

This doesn’t work. It turns out that you can’t use !! on both sides of an =. We have to use a special = that looks like :=.

grouped_mean <- function(df, group_var, summary_var, summary_name) {
  group_var <- enquo(group_var)
  summary_var <- enquo(summary_var)
  summary_name <- enquo(summary_name)
  
  df %>% 
    group_by(!! group_var) %>% 
    summarize(!! summary_name := mean(!! summary_var))
}

grouped_mean(
  df = mpg, 
  group_var = manufacturer, 
  summary_var = hwy, 
  summary_name = mean_hwy
)
#> # A tibble: 15 x 2
#>   manufacturer mean_hwy
#>   <chr>           <dbl>
#> 1 audi             26.4
#> 2 chevrolet        21.9
#> 3 dodge            17.9
#> 4 ford             19.4
#> 5 honda            32.6
#> 6 hyundai          26.9
#> # … with 9 more rows

Success!!

14.3 Passing vectors with `!!!`

Here’s one more common tidy evaluation use case.

Say you want to use recode() to recode a variable.

mpg %>% 
  mutate(drv = recode(drv, "f" = "front", "r" = "rear", "4" = "four")) %>% 
  select(drv)
#> # A tibble: 234 x 1
#>   drv  
#>   <chr>
#> 1 front
#> 2 front
#> 3 front
#> 4 front
#> 5 front
#> 6 front
#> # … with 228 more rows

It’s often a good idea to store your recode mapping in a parameter because you might want to change the mapping later on or use it in other locations.

We can store the mapping in a named character vector.

drv_recode <- c("f" = "front", "r" = "rear", "4" = "four")

However, now recode() doesn’t work.

mpg %>% 
  mutate(drv = recode(drv, drv_recode)) %>% 
  select(drv)
#> Argument 2 must be named, not unnamed

recode(), like group_by(), summarize(), and the other dplyr functions, quotes its input. We therefore need to tell recode() to evaluate recode_key immediately. Let’s try !!.

mpg %>% 
  mutate(drv = recode(drv, !! drv_recode)) %>% 
  select(drv)
#> Argument 2 must be named, not unnamed

!! doesn’t work because recode_key is a vector. Not only do we need to immediately evaluate recode_key, we also need to unpack its contents. To do so, we’ll use !!!.

mpg %>% 
  mutate(drv = recode(drv, !!! drv_recode)) %>% 
  select(drv)
#> # A tibble: 234 x 1
#>   drv  
#>   <chr>
#> 1 front
#> 2 front
#> 3 front
#> 4 front
#> 5 front
#> 6 front
#> # … with 228 more rows

Success!!!