5 Working with Data
In this course, we will be working with the tidyverse
packages for
manipulating and reshaping data. Of course, there are other ways of manipulating
data, but in our view, the tidyverse
method provides
the most consistent way of working with data, and reduces the cognitive load on
the user.
The packages used for data manipulations are tidyr
for reshaping data frames, and
dplyr
for manipulating data frames. The functions we are going to look at are:
Package | Function name | |
---|---|---|
tidyr |
pivot_longer() |
increases the number of rows, decreases the number of columns |
tidyr |
pivot_wider() |
decreases the number of rows, increases the number of columns |
tidyr |
drop_na() |
omits all rows in a data frame containing NA values |
dplyr |
filter() |
selects observations |
dplyr |
select() |
selects variables |
dplyr |
summarize() or summarise() |
summarizes data |
dplyr |
mutate() |
creates new variables and changes already existing variables |
dplyr |
group_by() |
lets you perform operations on subsets of the data |
The functions filter()
, select()
, summarize()
and mutate()
have
so-called scoped versions, which let you apply the same transformation to
multiple variables. There are three variants of each of these functions:
*_all
applies to every variable (e.g.mutate_all()
)*_at
applies to variables selected with a character vector or the helper functionvars()
(e.g.mutate_at()
)*_if
applies to variables selected with a predicate function (e.g.mutate_if()
)
These functions allow you to perform very complex data manipulation, using
clean and elegant code. Besides the functions mentioned above, dplyr
contains
many more. These include functions that perform set operations, e.g. finding the
intersection of two data frames, and allow you to join data frames together.
In addition tidyr
and dplyr
, we will also use the forcats
package for manipulating factors,
and the stringr
package for working with strings.
5.1 The pipe operator: %>%
Before we begin manipulating data, we need to introduce a new operator. We have
already seen that code can become hard to read when we have a sequence of
operations, i.e. nested function calls. Suppose we have a numeric vector (n =
10), which we want to center (we will generate the data using rnorm()
). Then we
want to compute and round the standard deviation to two decimal places.
set.seed(1283)
<- rnorm(n = 10, mean = 24, sd = 5)
data data
## [1] 24.74984 21.91726 23.98551 19.63019 23.96428 22.83092 18.86240 19.08125
## [9] 23.76589 21.88846
We can use nested function calls:
round(sd(scale(data,
center = TRUE,
scale = FALSE)),
digits = 2)
## [1] 2.19
The functions scale()
, sd()
and round()
are called in order (inner to
outer). The output of one function is passed as an argument to the next
function.
The functions scale()
and round()
have additional arguments: center = TRUE, scale = FALSE
for scale
, and digits = 2
for round
.
This way of writing code can be difficult to read, although using indentation (with line breaks as in the code above) really helps here.
An alternative approach is to save each intermediate step:
<- scale(data, center = TRUE,
data_z scale = FALSE)
<- sd(data_z)
sd_data_z <- round(sd_data_z,
sd_data_z_rounded digits = 2)
sd_data_z_rounded
## [1] 2.19
Doing this means that we create many variables that we don’t really need, cluttering up our workspace.
Luckily, there is an elegant solution to this problem. Using the pipe
operator, we can pass the output of one function as an argument to the next
function without using nested function calls. Note that this is really just a
type of ‘syntactic sugar’; the result is exactly the same as using traditional
function calls.
The pipe
operator is available in the dplyr
package.
library(dplyr)
%>%
It is defined as in infix operator, meaning that it is used in between two objects.
The pipe operator has its own key combo in RStudio: Cmd
+Shift
+M
(MacOS) or
Ctrl
+Shift
+M
(Windows and Linux).
Using the %>%
operator, our example from above:
round(sd(scale(data,
center = TRUE,
scale = FALSE)),
digits = 2)
## [1] 2.19
becomes:
library(dplyr)
%>%
data scale(center = TRUE, scale = FALSE) %>%
sd() %>%
round(digits = 2)
## [1] 2.19
This should be read as:
- We start with the object
data
and pass it to the functionscale()
as an argument - We use
scale()
, with argumentscenter = TRUE, scale = FALSE
and then pass the output to the functionsd()
- We apply the function
sd()
and pass the output to the functionround()
round()
is called, with the argumentdigits = 2
. The output is printed to the console.
Per default, %>%
passes an object to a function as its first argument. Many functions have the data argument as its first argument. In our example, the first argument of scale(x, center = TRUE, scale = TRUE)
is the data argument and it is data
that we pipe to scale()
. The same holds for the functions sd()
and round()
: we pipe the results of the previous function as the (first and) data argument to the following function.
However, it doesn’t have to be the first argument the we “pipe to”. In this case we need to
use the placeholder .
:
%>% f(y, .) x
This is equivalent to:
f(y, x)
Now x
is piped to the second argument of the function f()
.
5.2 Reshaping: tidyr
In statistics and data science, you will often hear the term wide and long when referring to data. This refers to the shape of a dataset; a wide dataset has more columns than a corresponding long dataset. A long dataset has more rows than a corresponding wide dataset. However, the terms long and wide aren’t just about shape. In a long data set, each observation has its own row and each variable is a column. In a wide dataset, variables can be spread out over several columns.
Consider this example: we have two observations for two subjects, at three
separate time points. Let’s call these t1
, t2
and t3
. The following
representation is considered to be in the wide format (the function tribble()
lets you easily create a tibble by rows):
<- tribble(
df ~name, ~t1, ~t2, ~t3,
"Marshall", 4, 5, 7,
"Skye", 3, 6, 7
%>% mutate_if(is.character, as_factor) )
Each row represents one subject, and the observations are spread over the three
columns t1
, t2
and t3
. These columns actually represent three levels of a common
factor, time
. If you were going to perform a repeated-measures ANOVA in SPSS, this is
the kind of dataset that you would use.
We can represent exactly the same data in a longer format, in which each row
corresponds to an observation, and each column represents a variable. In total,
there are three variables: name
, time
and score
.
## # A tibble: 6 × 3
## name time score
## <fct> <chr> <dbl>
## 1 Marshall t1 4
## 2 Marshall t2 5
## 3 Marshall t3 7
## 4 Skye t1 3
## 5 Skye t2 6
## 6 Skye t3 7
We no longer have one row for each subject. In fact, variables that do not change for each observation are repeated across rows.
A data set that is organized in this way is often referred to as tidy. As we will see later, many kinds of statistical analysis and in particular plotting functions require a long data set, and thus, a surprisingly large amount of time has to be spent on organizing data for further analysis (this type of work is often called ‘data wrangling’).
In order to reshape a data set from wide to long, we will need the function pivot_longer()
. The function for the opposite operation (long to wide) is pivot_wider()
, both of which are provided by the tidyr
package.
5.2.1 pivot_longer()
pivot_longer()
is used to combine columns, which may represent levels of a common factor, into
one column. The values of the original columns are combined in a new value column. Another
way of putting this is that pivot_longer()
increases the number of rows and decreases the
number of columns of a data frame.
pivot_longer()` is used like this:
pivot_longer(data, cols, names_to, values_to)
or, using %>%
:
%>%
data pivot_longer(cols, names_to, values_to)
The arguments are:
data: a data frame
cols: the columns you want to pivot
names_to: a string specifying the name of the new column created from
the selected column names
values_to: a string specifying the name of the column created from the
observations in the cells
We assume that t1
, t2
and t3
are not really separate variables, but should
be considered as levels of a time factor (e.g. as three time points in a
repeated measures data set). The values in the t1
, t2
and t3
columns refer
to the same kind of measurement (some score
) and therefore should be actually
one variable in a long data set. We would like our new factor to be called
time
and the measured variable should be called score
. The name
variable should be ignored, i.e. it should not be pivoted.
library(tidyr)
<- df %>%
df_long pivot_longer(-name, names_to = "time", values_to = "score")
df_long
## # A tibble: 6 × 3
## name time score
## <fct> <chr> <dbl>
## 1 Marshall t1 4
## 2 Marshall t2 5
## 3 Marshall t3 7
## 4 Skye t1 3
## 5 Skye t2 6
## 6 Skye t3 7
time
should be defined as a factor in the next step:
$time <- as_factor(df_long$time) df_long
df_long
## # A tibble: 6 × 3
## name time score
## <fct> <fct> <dbl>
## 1 Marshall t1 4
## 2 Marshall t2 5
## 3 Marshall t3 7
## 4 Skye t1 3
## 5 Skye t2 6
## 6 Skye t3 7
Our call to pivot_longer()
could also have been written more explicitly:
%>%
df pivot_longer(cols = -name,
names_to = "time",
values_to = "score")
In the cols
argument we used -name
, which tells the function that we want
to pivot all columns except the column called name
. We could also have selected the three time columns:
%>%
df pivot_longer(cols = c(t1, t2, t3),
names_to = "time",
values_to = "score")
## # A tibble: 6 × 3
## name time score
## <fct> <chr> <dbl>
## 1 Marshall t1 4
## 2 Marshall t2 5
## 3 Marshall t3 7
## 4 Skye t1 3
## 5 Skye t2 6
## 6 Skye t3 7
We will see more examples of how to select columns when we look at the dplyr
package.
5.2.2 pivot_wider()
pivot_wider()
is the complement of pivot_longer()
. This function takes a
factor and ‘spreads’ the values of a measured variable over the levels of
the factor, creating as many new variables as there are factor levels.
Obviously, pivot_wider()
is used to convert from long to wide.
The pivot_wider()
syntax looks like this:
pivot_wider(data, names_from, values_from)
# or
%>%
data pivot_wider(names_from, values_from)
Using our previous example, we want to create new variables for each level of the
factor time
, using the values in the score
variable.
df_long
## # A tibble: 6 × 3
## name time score
## <fct> <fct> <dbl>
## 1 Marshall t1 4
## 2 Marshall t2 5
## 3 Marshall t3 7
## 4 Skye t1 3
## 5 Skye t2 6
## 6 Skye t3 7
<- df_long %>%
df_wide pivot_wider(names_from = time, values_from = score)
df_wide
## # A tibble: 2 × 4
## name t1 t2 t3
## <fct> <dbl> <dbl> <dbl>
## 1 Marshall 4 5 7
## 2 Skye 3 6 7
df_wide
looks exactly the same as the original dataset df
. In fact, we can
test whether this is the case:
all_equal(df, df_wide)
## [1] TRUE
5.2.3 Excluding missing values: drop_na()
A very important function from the tidyr
package is drop_na()
. It is used to
omit all rows in a data frame containing missing values:
Let’s create another toy dataset containing NAs:
<- tibble(var1 = c(1, 2, NA), var2 = c("a", NA, "b"))
df_1 df_1
## # A tibble: 3 × 2
## var1 var2
## <dbl> <chr>
## 1 1 a
## 2 2 <NA>
## 3 NA b
<- df_1 %>%
df_2 drop_na()
df_2
## # A tibble: 1 × 2
## var1 var2
## <dbl> <chr>
## 1 1 a
This is very useful, but of course should be used carefully, otherwise you will end up omitting rows that contain NAs in columns that might not be relevant to the question at hand.
5.3 Data manipulation: dplyr
Transforming data sets from wide to long is only one step when ‘tidying’ data. We will often need to select and rename variables, select a subset of observations, recode values, create new variables and create summary variables, often grouped by certain categorical variables.
All these transformations can be performed by functions in the dplyr
package.
These can be seen as a kind of grammar of data manipulation, with each function
performing a specific action. These actions can be combined in many ways,
resulting in an extremely flexible yet highly consistent workflow.
We cannot cover all of the available functions in this course, but they are all described in the dplyr documentation.
We need to load the dplyr
package (or the tidyverse
package):
library(dplyr)
5.3.1 Selecting variables: select()
The function select()
is used to select variables from a data frame.
Syntax:
# df is a data frame
select(df, variable1, variable2, -variable3)
# or
%>%
df select(variable1, variable2, -variable3)
select()
selects the variables variable1
and variable2
from the data frame
df
. variable3
is omitted.
Examples
# only name
%>%
df_long select(name)
## # A tibble: 6 × 1
## name
## <fct>
## 1 Marshall
## 2 Marshall
## 3 Marshall
## 4 Skye
## 5 Skye
## 6 Skye
# time and score
%>%
df_long select(time, score)
## # A tibble: 6 × 2
## time score
## <fct> <dbl>
## 1 t1 4
## 2 t2 5
## 3 t3 7
## 4 t1 3
## 5 t2 6
## 6 t3 7
# or
%>%
df_long select(-name)
## # A tibble: 6 × 2
## time score
## <fct> <dbl>
## 1 t1 4
## 2 t2 5
## 3 t3 7
## 4 t1 3
## 5 t2 6
## 6 t3 7
Changing the order of variables
select()
can also be used to rearrange (reorder) variables in a data frame (without actually selecting anything):
%>%
df_long select(score, time, name)
## # A tibble: 6 × 3
## score time name
## <dbl> <fct> <fct>
## 1 4 t1 Marshall
## 2 5 t2 Marshall
## 3 7 t3 Marshall
## 4 3 t1 Skye
## 5 6 t2 Skye
## 6 7 t3 Skye
“Helper” functions for select()
There are a number of “helper functions” we can use for selecting variables:
# including
%>% select(starts_with("leben")) beispieldaten
## # A tibble: 286 × 16
## leben1 leben2 leben3 leben4 leben5 leben6 leben7 leben8 leben9 leben10
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 4 6 2 6 7 7 7 7 7
## 2 5 4 5 4 5 5 5 5 5 4
## 3 7 7 2 1 5 6 5 7 6 6
## 4 1 6 6 5 6 6 7 7 7 7
## 5 5 6 6 5 6 6 7 6 5 6
## 6 4 5 6 5 6 5 6 6 5 5
## 7 6 5 4 2 5 7 5 6 7 4
## 8 7 5 6 6 5 7 NA 6 6 5
## 9 5 6 5 5 5 6 7 6 6 6
## 10 5 5 4 4 5 6 5 5 4 5
## # … with 276 more rows, and 6 more variables: leben11 <dbl>,
## # leben_selbst <dbl>, leben_familie <dbl>, leben_schule <dbl>,
## # leben_freunde <dbl>, leben_gesamt <dbl>
%>% select(ends_with("note")) beispieldaten
## # A tibble: 286 × 4
## Deutschnote Mathenote Fremdsprachenote Gesamtnote
## <dbl> <dbl> <dbl> <dbl>
## 1 4 3 5 NA
## 2 4 4 3 4
## 3 5 5 4 5
## 4 4 2 3 4
## 5 4 4 4 5
## 6 4 4 3 4
## 7 4 4 4 4
## 8 5 5 5 5
## 9 4 4 3 4
## 10 5 5 4 4
## # … with 276 more rows
%>% select(contains("mutter")) beispieldaten
## # A tibble: 286 × 2
## bildung_mutter bildung_mutter_binaer
## <fct> <fct>
## 1 Fachhochschulabschluss, Universitätsabschluss hoch
## 2 Realschulabschluss (mittlere Reife) niedrig
## 3 Hauptschulabschluss oder niedriger niedrig
## 4 Fachhochschulabschluss, Universitätsabschluss hoch
## 5 Realschulabschluss (mittlere Reife) niedrig
## 6 Realschulabschluss (mittlere Reife) niedrig
## 7 Realschulabschluss (mittlere Reife) niedrig
## 8 Fachhochschulabschluss, Universitätsabschluss hoch
## 9 Realschulabschluss (mittlere Reife) niedrig
## 10 Realschulabschluss (mittlere Reife) niedrig
## # … with 276 more rows
# OR using one_of()
<- c("westost", "geschlecht", "alter")
vars %>% select(one_of(vars)) beispieldaten
## # A tibble: 286 × 3
## westost geschlecht alter
## <fct> <fct> <dbl>
## 1 West weiblich 13
## 2 West männlich 14
## 3 West weiblich 14
## 4 West weiblich 14
## 5 West weiblich 14
## 6 West männlich 14
## 7 West männlich 14
## 8 West weiblich 15
## 9 West männlich 15
## 10 West männlich 14
## # … with 276 more rows
# excluding
%>% select(-starts_with(c("swk", "unt", "leben", "stress"))) beispieldaten
## # A tibble: 286 × 12
## ID westost geschlecht alter Deutschnote Mathenote Fremdsprachenote
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 West weiblich 13 4 3 5
## 2 2 West männlich 14 4 4 3
## 3 10 West weiblich 14 5 5 4
## 4 11 West weiblich 14 4 2 3
## 5 12 West weiblich 14 4 4 4
## 6 14 West männlich 14 4 4 3
## 7 15 West männlich 14 4 4 4
## 8 16 West weiblich 15 5 5 5
## 9 17 West männlich 15 4 4 3
## 10 18 West männlich 14 5 5 4
## # … with 276 more rows, and 5 more variables: Gesamtnote <dbl>,
## # bildung_vater <fct>, bildung_mutter <fct>, bildung_vater_binaer <fct>,
## # bildung_mutter_binaer <fct>
5.3.1.1 Scoped versions of select
We can also use the scoped versions of select
: select_if()
, select_at()
and select_all()
.
For instance, we can select only the factors from the beispieldaten
data frame with select_if()
combined with the function is.factor()
(in this case no parentheses are included in the function call) :
%>%
beispieldaten select_if(is.factor)
## # A tibble: 286 × 6
## westost geschlecht bildung_vater bildung_mutter bildung_vater_b…
## <fct> <fct> <fct> <fct> <fct>
## 1 West weiblich Fachhochschulabschl… Fachhochschulabschl… hoch
## 2 West männlich Fachabitur, Abitur Realschulabschluss … hoch
## 3 West weiblich Realschulabschluss … Hauptschulabschluss… niedrig
## 4 West weiblich Realschulabschluss … Fachhochschulabschl… niedrig
## 5 West weiblich Fachhochschulabschl… Realschulabschluss … hoch
## 6 West männlich Realschulabschluss … Realschulabschluss … niedrig
## 7 West männlich Realschulabschluss … Realschulabschluss … niedrig
## 8 West weiblich Fachhochschulabschl… Fachhochschulabschl… hoch
## 9 West männlich Realschulabschluss … Realschulabschluss … niedrig
## 10 West männlich Fachabitur, Abitur Realschulabschluss … hoch
## # … with 276 more rows, and 1 more variable: bildung_mutter_binaer <fct>
5.3.2 Renaming variables: rename()
Variables can be renamed using rename()
. Variables that have not been selected are not changed.
Syntax:
%>% rename(new_name = old_name) df
For the following examples we first create a subdataset from beispieldaten
to obtain less cluttered output:
<- beispieldaten %>%
beispieldaten_red select(ID:alter, unteltern1:unteltern3)
Example
# rename westost to region, geschlecht to gender, and alter to age
<- beispieldaten_red %>%
beispieldaten_red rename(region = westost,
gender = geschlecht,
age = alter)
Similarly to select
, we can use a scoped version. Using rename_all
, we can - for example - change all variable names to upper case using the base R function toupper()
.
%>%
beispieldaten_red rename_all(toupper)
## # A tibble: 286 × 7
## ID REGION GENDER AGE UNTELTERN1 UNTELTERN2 UNTELTERN3
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 West weiblich 13 7 6 6
## 2 2 West männlich 14 6 6 3
## 3 10 West weiblich 14 7 7 6
## 4 11 West weiblich 14 7 7 7
## 5 12 West weiblich 14 6 7 7
## 6 14 West männlich 14 6 7 6
## 7 15 West männlich 14 6 5 7
## 8 16 West weiblich 15 5 5 6
## 9 17 West männlich 15 6 6 7
## 10 18 West männlich 14 4 5 6
## # … with 276 more rows
5.3.3 Selecting observations (cases): filter()
filter()
is used to select rows from a data frame that fulfill certain logical conditions (or not).
We can combine multiple logical operators.
Syntax:
%>%
df filter(variable1 < VALUE1 & variable2 == VALUE2)
Examples
# only adolescents from Eastern Germany
%>%
beispieldaten_red filter(region == "Ost")
## # A tibble: 143 × 7
## ID region gender age unteltern1 unteltern2 unteltern3
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 201 Ost weiblich 14 7 5 7
## 2 204 Ost weiblich 15 7 6 5
## 3 205 Ost männlich 15 7 7 7
## 4 206 Ost weiblich 14 6 5 7
## 5 207 Ost weiblich 15 7 6 7
## 6 209 Ost weiblich 15 6 4 5
## 7 210 Ost weiblich 15 6 6 7
## 8 211 Ost weiblich 15 6 4 7
## 9 212 Ost weiblich 15 7 6 7
## 10 213 Ost weiblich 14 6 5 5
## # … with 133 more rows
# only only adolescents from Western Germany
%>%
beispieldaten_red filter(region != "Ost")
## # A tibble: 143 × 7
## ID region gender age unteltern1 unteltern2 unteltern3
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 West weiblich 13 7 6 6
## 2 2 West männlich 14 6 6 3
## 3 10 West weiblich 14 7 7 6
## 4 11 West weiblich 14 7 7 7
## 5 12 West weiblich 14 6 7 7
## 6 14 West männlich 14 6 7 6
## 7 15 West männlich 14 6 5 7
## 8 16 West weiblich 15 5 5 6
## 9 17 West männlich 15 6 6 7
## 10 18 West männlich 14 4 5 6
## # … with 133 more rows
# only age 13 or younger
%>%
beispieldaten_red filter(age <= 13)
## # A tibble: 6 × 7
## ID region gender age unteltern1 unteltern2 unteltern3
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 West weiblich 13 7 6 6
## 2 49 West männlich 13 6 5 6
## 3 138 West weiblich 13 4 3 5
## 4 142 West weiblich 13 3 3 2
## 5 159 West weiblich 13 4 3 5
## 6 321 Ost weiblich 13 5 5 7
# ages between 15 and 17
%>%
beispieldaten_red filter(age >= 15 & age <= 17)
## # A tibble: 163 × 7
## ID region gender age unteltern1 unteltern2 unteltern3
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 16 West weiblich 15 5 5 6
## 2 17 West männlich 15 6 6 7
## 3 19 West männlich 15 4 6 7
## 4 20 West weiblich 15 6 5 4
## 5 25 West weiblich 15 7 7 7
## 6 26 West weiblich 15 7 7 6
## 7 27 West männlich 15 7 6 6
## 8 29 West weiblich 15 5 3 4
## 9 31 West weiblich 16 7 7 6
## 10 33 West weiblich 15 7 7 7
## # … with 153 more rows
# only subjects with IDs 1 and 49
%>%
beispieldaten_red filter(ID == 1 | ID == 49)
## # A tibble: 2 × 7
## ID region gender age unteltern1 unteltern2 unteltern3
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 West weiblich 13 7 6 6
## 2 49 West männlich 13 6 5 6
# alternative: use the %in% operator
%>%
beispieldaten_red filter(ID %in% c(1, 49))
## # A tibble: 2 × 7
## ID region gender age unteltern1 unteltern2 unteltern3
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 West weiblich 13 7 6 6
## 2 49 West männlich 13 6 5 6
5.3.4 Sorting data frames: arrange()
With the arrange()
function, we can sort rows according to variables in a data frame, either in ascending or descending order.
# ascending
%>%
beispieldaten_red arrange(ID)
## # A tibble: 286 × 7
## ID region gender age unteltern1 unteltern2 unteltern3
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 West weiblich 13 7 6 6
## 2 2 West männlich 14 6 6 3
## 3 10 West weiblich 14 7 7 6
## 4 11 West weiblich 14 7 7 7
## 5 12 West weiblich 14 6 7 7
## 6 14 West männlich 14 6 7 6
## 7 15 West männlich 14 6 5 7
## 8 16 West weiblich 15 5 5 6
## 9 17 West männlich 15 6 6 7
## 10 18 West männlich 14 4 5 6
## # … with 276 more rows
# descending
%>%
beispieldaten_red arrange(desc(ID))
## # A tibble: 286 × 7
## ID region gender age unteltern1 unteltern2 unteltern3
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 367 Ost weiblich 14 5 6 7
## 2 364 Ost männlich 15 6 6 6
## 3 363 Ost männlich 15 6 7 6
## 4 362 Ost weiblich 15 4 3 4
## 5 361 Ost weiblich 14 3 1 4
## 6 359 Ost weiblich 14 5 1 4
## 7 356 Ost männlich 15 6 4 6
## 8 355 Ost weiblich 14 6 4 4
## 9 353 Ost männlich 14 6 6 7
## 10 352 Ost weiblich 14 6 6 6
## # … with 276 more rows
5.3.5 Creating new variables: mutate()
The mutate()
function can be used to either create new variables or to change already existing ones.
Syntax:
%>% mutate(new_variable_1 = FORMULA_1,
df new_variable_2 = FORMULA_2,
old_variable_1 = FORMULA_3)
For example, let’s compute a variable with an average score for the first three items measuring the support received from parents (unteltern1
, unteltern2
, unteltern3
):
<- beispieldaten_red %>%
beispieldaten_red mutate(unteltern_1to3 = (unteltern1 + unteltern2 + unteltern3)/3)
We just reassigned beispieldaten_red
to beispieldaten_red
because we want to save the new variable. Now we realize that we did not save earlier operations since we did not reassign. For the examples above with arrange()
we may have merely wanted to look at the data set in a different order and not save it in that order.
Below we will see more examples where we will not reassign, because we only want to illustrate how the functions work, thus we do not need to save the manipulations every time, we only create the respective output. In “real (data-analytic) life” we will usually reassign to the current data frame since we want to work with the transformed or “mutated” data set.
There is also a function called transmute()
, which returns only the newly created variables. (In this case reassigning would mean erasing the content of beispieldaten_red
and replacing it by the newly created variable.)
%>%
beispieldaten_red transmute(unteltern_1to3 = (unteltern1 + unteltern2 + unteltern3)/3)
## # A tibble: 286 × 1
## unteltern_1to3
## <dbl>
## 1 6.33
## 2 5
## 3 6.67
## 4 7
## 5 6.67
## 6 6.33
## 7 6
## 8 5.33
## 9 6.33
## 10 5
## # … with 276 more rows
This looks all quite easy and functional for the often needed operation of aggregating over items (per subject) to compute a scale mean. However, this works only for subjects who have no missing values in any of the variables/items involved. As soon as missing values exist, the computation will return NA
for the respective subject. As to our knowledge there is not yet a package/function that easily circumvents this problem. We will therefore introduce rowwise()
for row-wise operations at the end of this chapter and present a custom-made function to deal with the issue.
5.3.5.1 Scoped versions of mutate
mutate
has scoped versions that allows you to apply transformations to multiple variables at once.
As an example, say we want to convert all ratings (Likert-scale from 1-7) in the beispieldaten
data frame to a different scale. First, we create a function that does the job of converting the rating to a percentage.
<- function(x) 100 * x / 7 change_scale
Then we indicate using mutate_if
together with is.numeric
that the function should be applied only to the numeric variables in the data frame.
%>%
beispieldaten_red mutate_if(is.numeric, change_scale)
## # A tibble: 286 × 8
## ID region gender age unteltern1 unteltern2 unteltern3 unteltern_1to3
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 14.3 West weiblich 186. 100 85.7 85.7 90.5
## 2 28.6 West männlich 200 85.7 85.7 42.9 71.4
## 3 143. West weiblich 200 100 100 85.7 95.2
## 4 157. West weiblich 200 100 100 100 100
## 5 171. West weiblich 200 85.7 100 100 95.2
## 6 200 West männlich 200 85.7 100 85.7 90.5
## 7 214. West männlich 200 85.7 71.4 100 85.7
## 8 229. West weiblich 214. 71.4 71.4 85.7 76.2
## 9 243. West männlich 214. 85.7 85.7 100 90.5
## 10 257. West männlich 200 57.1 71.4 85.7 71.4
## # … with 276 more rows
Now we accidentally also converted ID
and alter
to percentages! ID
should actually not be a numeric variable, we should convert it to a factor! We can do this also using mutate()
:
<- beispieldaten_red %>% mutate(ID = as.factor(ID)) beispieldaten_red
We can also use mutate_at
with the helper function vars()
and starts_with
to select only those variables we really want to convert into percentages:
%>%
beispieldaten_red mutate_at(vars(starts_with("unt")), change_scale)
## # A tibble: 286 × 8
## ID region gender age unteltern1 unteltern2 unteltern3 unteltern_1to3
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 West weiblich 13 100 85.7 85.7 90.5
## 2 2 West männlich 14 85.7 85.7 42.9 71.4
## 3 10 West weiblich 14 100 100 85.7 95.2
## 4 11 West weiblich 14 100 100 100 100
## 5 12 West weiblich 14 85.7 100 100 95.2
## 6 14 West männlich 14 85.7 100 85.7 90.5
## 7 15 West männlich 14 85.7 71.4 100 85.7
## 8 16 West weiblich 15 71.4 71.4 85.7 76.2
## 9 17 West männlich 15 85.7 85.7 100 90.5
## 10 18 West männlich 14 57.1 71.4 85.7 71.4
## # … with 276 more rows
5.3.6 Grouping data frames: group_by()
We often want to perform operations on subsets of a data frame, given by levels of a grouping factor. The function group_by()
groups the data frame according to one or more grouping variables, applies a function to every group and then recombines the data frame (split-apply-combine). group_by()
is usually used in combination with other functions.
Syntax:
<- group_by(grouping_1, grouping_2, grouping_3) df
Example
Let’s import a new dataset. In this example, 12 subjects were allocated to 4 experimental conditions, and the level of aggressive behaviour was measured. Suppose that we want to add a group-centred aggression variable.
library(readr)
<- read_csv("data/alkohol-aggression.csv") %>%
alk_aggr mutate(alkoholbedingung = as.factor(alkoholbedingung))
## Rows: 12 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): alkoholbedingung
## dbl (1): aggressivitaet
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We can use group_by()
to split the data frame into two separate pieces,
compute the group mean for each, and put the pieces back together again. Before we do that, we will select only the relevant variables and remove all cases with missing values on those:
%>%
beispieldaten_red select(ID, gender, unteltern_1to3) %>%
drop_na() %>%
group_by(gender) %>%
mutate(group_mean = mean(unteltern_1to3),
unteltern_c = unteltern_1to3 - group_mean)
## # A tibble: 283 × 5
## # Groups: gender [2]
## ID gender unteltern_1to3 group_mean unteltern_c
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 1 weiblich 6.33 5.60 0.732
## 2 2 männlich 5 5.88 -0.882
## 3 10 weiblich 6.67 5.60 1.07
## 4 11 weiblich 7 5.60 1.40
## 5 12 weiblich 6.67 5.60 1.07
## 6 14 männlich 6.33 5.88 0.451
## 7 15 männlich 6 5.88 0.118
## 8 16 weiblich 5.33 5.60 -0.268
## 9 17 männlich 6.33 5.88 0.451
## 10 18 männlich 5 5.88 -0.882
## # … with 273 more rows
5.3.7 Recoding values: recode()
Together with mutate()
, recode()
can be used to change values of a
variable. This can be done with numeric or character variables.
Syntax:
recode(variable,
old_value_1 = "new_value_1",
old_value_2 = "new_value_2")
If we want to change the levels of a factor, the forcats
package provides more functionality. To manually recode the factor, we use fct_recode()
.
Syntax:
fct_recode(variable,
new_value_1 = "old_value_1",
new_value_2 = "old_value_2")
Let us recode the German factor levels of region
and gender
to English:
5.3.8 Summarizing variables: summarize()
Using the functions summarize()
or summarise()
(both spellings are possible)
we can summarize variables and compute descriptive statistics. In contrast to
mutate()
, summarize()
returns a value for each group, not each observation.
Syntax:
%>% summarize(summary = FUNCTION(variable)) df
FUNCTION
is a placeholder for any function that can be used to summarize data, e.g. mean()
, sd()
.
If we want the mean and the standard error, we could use the function that we created earlier:
<- function(x) {
se <- length(x)
n sd(x) / sqrt(n)
}
Let’s compute the group mean, standard deviation and standard error of unteltern_1to3
for male and female as well as for east and west German adolescents separately:
%>%
beispieldaten_red drop_na() %>%
group_by(region, gender) %>%
summarize(mean = mean(unteltern_1to3),
sd = sd(unteltern_1to3),
se = se(unteltern_1to3))
## `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
## # A tibble: 4 × 5
## # Groups: region [2]
## region gender mean sd se
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 west male 5.68 1.10 0.120
## 2 west female 5.51 1.22 0.158
## 3 east male 6.13 0.711 0.0868
## 4 east female 5.68 1.10 0.129
A more advanced example of this uses summarize_if
. We can apply a list of
functions to all variables that pass the test is.numeric
. This gives the same output as the code above.
%>%
beispieldaten_red drop_na() %>%
group_by(region, gender) %>%
summarize_if(is.numeric, list(mean = mean, sd = sd, se = se))
## # A tibble: 4 × 17
## # Groups: region [2]
## region gender age_mean unteltern1_mean unteltern2_mean unteltern3_mean
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 west male 14.5 5.63 5.45 5.96
## 2 west female 14.5 5.73 5.2 5.58
## 3 east male 15.0 6.13 5.97 6.30
## 4 east female 14.7 5.86 5.29 5.89
## # … with 11 more variables: unteltern_1to3_mean <dbl>, age_sd <dbl>,
## # unteltern1_sd <dbl>, unteltern2_sd <dbl>, unteltern3_sd <dbl>,
## # unteltern_1to3_sd <dbl>, age_se <dbl>, unteltern1_se <dbl>,
## # unteltern2_se <dbl>, unteltern3_se <dbl>, unteltern_1to3_se <dbl>
5.4 Row-wise operations
We already saw above how we can create a scale mean across multiple items using mutate()
together with a FORMULA. Unfortunately this did not work with missing values present. We could think that we could just use mean()
together with its argument na.rm = TRUE
. However, operations in R (and the tidyverse
packages) are intended to operate on columns of data frames, not on rows, so this will also not work.
Let’s create a toy example to illustrate this:
library(tidyverse)
<- tribble(
df ~name, ~n1, ~n2, ~n3, ~n4,
"Lydia", 7, 8, 9, 4,
"Todd", 3, 2, 3, 3)
We have two subjects, who responded to four questions n1
to n4
that represent a common n
factor.
Like above, as long as there are no missing values, this will work:
%>%
df mutate(n = (n1 + n2 + n3 + n4)/4)
## # A tibble: 2 × 6
## name n1 n2 n3 n4 n
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia 7 8 9 4 7
## 2 Todd 3 2 3 3 2.75
Let’s add a missing value:
<- tribble(
dfNA ~name, ~n1, ~n2, ~n3, ~n4,
"Lydia", 7, 8, 9, 4,
"Todd", NA, 2, 3, 3)
dfNA
## # A tibble: 2 × 5
## name n1 n2 n3 n4
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia 7 8 9 4
## 2 Todd NA 2 3 3
This approach will no longer work (at least not for Todd):
%>%
dfNA mutate(n = (n1 + n2 + n3 + n4)/4)
## # A tibble: 2 × 6
## name n1 n2 n3 n4 n
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia 7 8 9 4 7
## 2 Todd NA 2 3 3 NA
One can argue that someone who did not answer to one of four n
items should not be included in the analysis of an n
scale. However, often we have much more than 4 items and anyone who did not answer to only one of - say - 10 items would not be included.
We could use a multiple imputation approach that predicts missing values from other variables in the data set and where the uncertainty associated with the imputation is taken into account by creating multiple versions of imputed data sets, analyzing them separately and combining their results afterwards.
This is not always possible, though, and one might rather compute the scale mean for this person by averaging across the existing items (i.e. imputing/replacing missing values with the mean of existing values of a scale). This is what we try to do here…
Unfortunately, this does also not work:
%>%
dfNA mutate(n = mean(c(n1, n2, n3, n4), na.rm = TRUE))
## # A tibble: 2 × 6
## name n1 n2 n3 n4 n
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia 7 8 9 4 5.14
## 2 Todd NA 2 3 3 5.14
Here we get the mean of all 7 values for both of our participants…
Fortunately, though, dplyr
does provide a function, rowwise()
that will help us here:
%>%
dfNA rowwise() %>%
mutate(n = mean(c(n1, n2, n3, n4), na.rm = TRUE))
## # A tibble: 2 × 6
## # Rowwise:
## name n1 n2 n3 n4 n
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia 7 8 9 4 7
## 2 Todd NA 2 3 3 2.67
We could even recreate some missing functionality from SPSS here. Our summary should only be created for rows that do not exceed the maximum permissible number of missing values.
The goal is to compute some sort of row-wise score, i.e. a sum or an average, over several columns of a data frame, while being able to specify how many missing values you will allow per row.
We can create a function that does this:
<- function(v, max_na = NULL) {
score_items if (is.null(max_na)) {
<- length(v)
max_na
}
<- length(v[is.na(v)])
n_na
if (n_na <= max_na) {
<- mean(v, na.rm = TRUE)
out else {
} <- NA
out
}
out }
This function will compute the mean of a vector, but only if that vector contains fewer than a pre-determined number of missing values.
%>%
dfNA rowwise() %>%
mutate(n = score_items(c(n1, n2, n3, n4), max_na = 1))
## # A tibble: 2 × 6
## # Rowwise:
## name n1 n2 n3 n4 n
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia 7 8 9 4 7
## 2 Todd NA 2 3 3 2.67
This will compute the mean over the specified columns, but only if the subject has at most 1 missing value.
If we are not willing to accept any NAs:
%>%
dfNA rowwise() %>%
mutate(n = score_items(c(n1, n2, n3, n4), max_na = 0))
## # A tibble: 2 × 6
## # Rowwise:
## name n1 n2 n3 n4 n
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia 7 8 9 4 7
## 2 Todd NA 2 3 3 NA