3 List columns
Recall that tibbles are lists of vectors.
Usually, these vectors are atomic vectors, so the elements in the columns are single values, like “a” or 1.
Tibbles can also have columns that are lists. These columns are (appropriately) called list columns.
List columns are more flexible than normal, atomic vector columns. Lists can contain anything, so a list column can be made up of atomic vectors, other lists, tibbles, etc.
As you’ll see, this can be a useful way to store data. In this chapter, you’ll learn how to create list columns, how to turn list columns back into normal columns, and how to manipulate list columns.
Typically, you’ll create list columns by manipulating an existing tibble. There are three primary ways to create list columns:
countries is a simplified version of
dcldata::gm_countries, which contains Gapminder data on 197 countries.
countries <- gm_countries %>% select(name, region_gm4, un_status, un_admission, income_wb_2017) countries #> # A tibble: 197 x 5 #> name region_gm4 un_status un_admission income_wb_2017 #> <chr> <chr> <chr> <date> <chr> #> 1 Afghanistan asia member 1946-11-19 Low income #> 2 Albania europe member 1955-12-14 Upper middle income #> 3 Algeria africa member 1962-10-08 Upper middle income #> 4 Andorra europe member 1993-07-28 High income #> 5 Angola africa member 1976-12-01 Lower middle income #> 6 Antigua and Barbuda americas member 1981-11-11 High income #> # … with 191 more rows
The tidyr function
nest() creates list columns of tibbles.
nest() the names of the columns to put into each individual tibble.
nest() will create one row for each unique value of the remaining variables. For example, say we select just two columns from
and then nest
nest() created one tibble for each
Each of these tibbles contains all the countries that belong to the continent.
The entire column is a list.
If we nest multiple variables, the individual tibbles will have multiple columns.
regions_data$data[] #> # A tibble: 59 x 4 #> name un_status un_admission income_wb_2017 #> <chr> <chr> <date> <chr> #> 1 Afghanistan member 1946-11-19 Low income #> 2 Australia member 1945-11-01 High income #> 3 Bahrain member 1971-09-21 High income #> 4 Bangladesh member 1974-09-17 Lower middle income #> 5 Bhutan member 1971-09-21 Lower middle income #> 6 Brunei member 1984-09-21 High income #> # … with 53 more rows
You can specify columns to nest using the same syntax as
You can also create multiple list columns at once.
countries %>% nest(countries = name, data = c(name, contains("un"), income_wb_2017)) #> # A tibble: 4 x 3 #> region_gm4 countries data #> <chr> <list> <list> #> 1 asia <tibble [59 × 1]> <tibble [59 × 4]> #> 2 europe <tibble [49 × 1]> <tibble [49 × 4]> #> 3 africa <tibble [54 × 1]> <tibble [54 × 4]> #> 4 americas <tibble [35 × 1]> <tibble [35 × 4]>
summarize() collapsing groups into single rows. We can also use
summarize() to create a list column, where each element is a vector, list, or tibble.
If you supply
list() with multiple atomic vectors, it will create a list of atomic vectors.
We can use
list() to create a list of atomic vectors where each vector corresponds to one
region_gm4. For example, the following creates a list column of countries.
countries column is similar to the one created earlier with
nest(), except each element is an atomic vector, not a tibble.
What if we want to manipulate each vector before creating the list column? For example, say we want to arrange all country names alphabetically. The following doesn’t work:
You need to collect all the atomic vectors into a list.
Here’s another example, which only stores countries that begin with “A” in
a_countries <- countries %>% group_by(region_gm4) %>% summarize(countries = list(str_subset(name, "^A"))) a_countries #> # A tibble: 4 x 2 #> region_gm4 countries #> * <chr> <list> #> 1 africa <chr > #> 2 americas <chr > #> 3 asia <chr > #> 4 europe <chr > a_countries$countries[] #>  "Algeria" "Angola"
The third way to create a list column involves functions that take a vector as input and return a list.
map() is one such function.
In the following code,
map() iterates over
name, generating a vector of random numbers for each country.
countries %>% select(name) %>% mutate(random = map(name, ~ rnorm(n = str_length(.)))) #> # A tibble: 197 x 2 #> name random #> <chr> <list> #> 1 Afghanistan <dbl > #> 2 Albania <dbl > #> 3 Algeria <dbl > #> 4 Andorra <dbl > #> 5 Angola <dbl > #> 6 Antigua and Barbuda <dbl > #> # … with 191 more rows
To transform a list column into normal columns, use
unnest(). Here’s our tibble with a list column of country names.
cols argument with the name of the columns to unnest.
To manipulate list columns, you’ll need to use purrr. For example, say we want to find the number of countries in each continent. Here’s
We can’t call
length() directly on
countries, because we’ll just get the length of the entire column.
Instead, we need to iterate over
countries, taking the length of each element separately.
Note that we need to use
nrow() because each element of
countries is actually a tibble.
When manipulating list columns, it’s often helpful to use the
. strategy discussed in Basic map functions. Assign the first element of the column to
Then, test out your manipulation on
.. For example, the following finds the proportion of country names that end in “a.”
Then, supply your code to a map function.
regions %>% mutate( ends_in_a = map_dbl( countries, ~ summarize(., prop = sum(str_detect(name, "a$")) / n()) %>% pull(prop) ) ) %>% arrange(desc(ends_in_a)) #> # A tibble: 4 x 3 #> region_gm4 countries ends_in_a #> <chr> <list> <dbl> #> 1 americas <tibble [35 × 1]> 0.457 #> 2 africa <tibble [54 × 1]> 0.407 #> 3 europe <tibble [49 × 1]> 0.347 #> 4 asia <tibble [59 × 1]> 0.271
. does not like to be piped into functions, so we couldn’t pipe