6 Working with Data
6.1 Tidying data
In statistics and data science, you will often hear the term wide and long when referring to data. This refers to the shape of a dataset; a wide dataset has more columns than a corresponding long dataset. A long dataset has more rows than a corresponding wide dataset. However, the terms long and wide aren’t just about shape. In a long data set, each observation has its own row and each variable is a column. In a wide dataset, variables can be spread out over several columns.
Consider this example: we have two observations for two subjects, at three
separate time points. Let’s call these t1
, t2
and t3
. The following
representation is considered to be in the wide format:
#> # A tibble: 2 x 4
#> name t1 t2 t3
#> <fct> <dbl> <dbl> <dbl>
#> 1 Marshall 4 5 7
#> 2 Skye 3 6 7
Each row represents one subject, and the observations are spread over the three
columns t1
, t2
and t3
. These columns actually represent three levels of a common
factor, time
. If you were going to perform a repeated-measures ANOVA in SPSS, this is
the kind of dataset that you would use.
We can represent exactly the same data in a longer format, in which each row
corresponds to an observation, and each column represents a variable. In total,
there are three variables: name
, time
and score
.
df %>%
pivot_longer(-name, names_to = "time", values_to = "score")
#> # A tibble: 6 x 3
#> name time score
#> <fct> <chr> <dbl>
#> 1 Marshall t1 4
#> 2 Marshall t2 5
#> 3 Marshall t3 7
#> 4 Skye t1 3
#> 5 Skye t2 6
#> 6 Skye t3 7
We no longer have one row for each subject. In fact, variables that do not change for each observation are repeated across rows.
A data set that is organized in this way is often referred to as tidy.
As we will see later, many kinds of statistical analysis and in particular plotting functions require a long data set, and thus, a surprisingly large amount of time has to be spent on organizing data for further analysis (this type of work is often called ‘data wrangling’).
In this course, we will be working with the tidyverse
packages for
manipulating and reshaping data. Of course, there are other ways of manipulating
data, but in our view, the tidyverse
method provides
the most consistent way of working with data, and reduces the cognitive load on
the user.
The packages used for data manipulations are tidyr
for reshaping data frames, and
dplyr
for manipulating data frames. The functions we are going to look at are:
Package | Function name | |
---|---|---|
tidyr |
pivot_longer() |
increases the number of rows, decreases the number of columns |
tidyr |
pivot_wider() |
decreases the number of rows, increases the number of columns |
tidyr |
drop_na() |
omits all rows in a data frame containing NA values |
dplyr |
filter() |
selects observations |
dplyr |
select() |
selects variables |
dplyr |
summarize() or summarise() |
summarizes data |
dplyr |
mutate() |
creates new variables and changes already existing variables |
dplyr |
group_by() |
lets you perform operations on subsets of the data |
The functions filter()
, select()
, summarize()
and mutate()
have
so-called scoped versions, which let you apply the same transformation to
multiple variables. There are three variants of each of these functions:
*_all
applies to every variable (e.g.mutate_all()
)*_at
applies to variables selected with a character vector or the helper functionvars()
(e.g.mutate_at()
)*_if
applies to variables selected with a predicate function (e.g.mutate_if()
)
These functions allow you to perform very complex data manipulation, using
clean and elegant code. Besides the functions mentioned above, dplyr
contains
many more. These include functions that perform set operations, e.g. finding the
intersection of two data frames, and allow you to join data frames together.
In addition tidyr
and dplyr
, we will also use the forcats
package for manipulating factors,
and the stringr
package for working with strings.
6.2 The pipe operator: %>%
Before we begin manipulating data, we need to introduce a new operator. We have
already seen that code can become hard to read when we have a sequence of
operations, i.e. nested function calls. Suppose we have a numeric vector (n =
10), which we want to center (we will generate the data using rnorm
). Then we
want to compute and round the standard deviation to two decimal places.
set.seed(1283)
data <- rnorm(10, 24, 5)
data
#> [1] 24.74984 21.91726 23.98551 19.63019 23.96428 22.83092 18.86240 19.08125
#> [9] 23.76589 21.88846
We can use nested function calls:
The functions scale()
, sd()
and round()
are called in order (inner to
outer). The output of one function is passed as an argument to the next
function.
The functions scale()
and round()
have additional arguments: center = TRUE, scale = FALSE
for scale
, and digits = 2
for round
.
This way of writing code can be difficult to read, although using indentation really helps here.
An alternative approach is to save each intermediate step:
data_z <- scale(data, center = TRUE,
scale = FALSE)
sd_data_z <- sd(data_z)
sd_data_z_rounded <- round(sd_data_z,
digits = 2)
sd_data_z_rounded
#> [1] 2.19
Doing this means that we create many variables that we don’t really need, cluttering up our workspace.
Luckily, there is an elegant solution to this problem. Using the pipe
operator, we can pass the output of one function as an argument to the next
function without using nested function calls. Note that this is really just a
type of ‘syntactic sugar’; the result is exactly the same as using traditional
function calls.
The pipe
operator is available in the dplyr
package.
It is defined as in infix operator, meaning that it is used in between two objects.
The pipe operator has its own key combo in RStudio: Cmd
+Shift
+M
(MacOS) or
Ctrl
+Shift
+M
(Windows and Linux).
Using the %>%
operator, our example from above:
becomes:
library(dplyr)
data %>%
scale(center = TRUE, scale = FALSE) %>%
sd() %>%
round(digits = 2)
#> [1] 2.19
This should be read as:
- We start with the object
data
and pass it to the functionscale()
as an argument - We use
scale()
, with argumentscenter = TRUE, scale = FALSE
and then pass the output to the functionsd()
- We apply the function
sd()
and pass the output to the functionround()
round()
is called, with the argumentdigits = 2
. The output is printed to the console.
If we want to save the last output, we therefore have to assign the entire computation to a variable:
sd_data_z_rounded <- data %>%
scale(center = TRUE, scale = FALSE) %>%
sd() %>%
round(digits = 2)
sd_data_z_rounded
#> [1] 2.19
%>%
passes an object to a function as its first argument (unless specified
otherwise). The advantages are:
- code becomes easier to read
- we don’t need intermediate variables
Pipe syntax
Suppose that f()
, g()
and h()
are functions. Then, the %>%
operator is
used like this:
is equivalent to:
if y
is an argument of f()
:
Applying f()
, g()
and h()
in sequence:
is equivalent to:
We can use x
as any argument, it doesn’t have to be the first. We just need to
use the placeholder .
:
This is equivalent to:
is equivalent to:
6.3 Reshaping: tidyr
In order to reshape a data set, we will need two functions:
pivot_longer()
and pivot_wider()
, both of which are provided by the tidyr
package.
6.3.1 pivot_longer()
pivot_longer()
is used for converting a wide to a long data set. This means that
pivot_longer()
is used to combine columns, which may represent levels of a common factor, into
one column. The values of the original columns are combined in a new value column. Another
way of putting this is that pivot_longer()’ increases the number of rows and decreases the
number of columns of a data frame.
pivot_longer()` is used like this:
or, using %>%
:
The arguments are:
data: a data frame
cols: the columns you want to pivot
names_to: a string specifying the name of the new column created from the selected column names
values_to: a string specifying the name of the column created from the observations in the cells
Let’s have another look at the example used above. We have a wide data set
with two subjects, measured at three time points (the function tribble()
lets you easily create a tibble by rows).
df <- tribble(
~name, ~t1, ~t2, ~t3,
"Marshall", 4, 5, 7,
"Skye", 3, 6, 7
) %>% mutate_if(is.character, as_factor)
We assume that t1
, t2
and t3
are not really separate variables, but should
be considered as levels of a time factor (e.g. as three time points in a
repeated measures data set). The values in the t1
, t2
and t3
columns refer
to the same kind of measurement (e.g. score) and therefore should be actually
one variable in a long data set. We would like our new factor to be called
time
and the measured variable should be called score
. The name
variable should be ignored, i.e. it should not be pivoted.
df_long
#> # A tibble: 6 x 3
#> name time score
#> <fct> <chr> <dbl>
#> 1 Marshall t1 4
#> 2 Marshall t2 5
#> 3 Marshall t3 7
#> 4 Skye t1 3
#> 5 Skye t2 6
#> 6 Skye t3 7
time
should be defined as a factor in the next step:
df_long
#> # A tibble: 6 x 3
#> name time score
#> <fct> <fct> <dbl>
#> 1 Marshall t1 4
#> 2 Marshall t2 5
#> 3 Marshall t3 7
#> 4 Skye t1 3
#> 5 Skye t2 6
#> 6 Skye t3 7
Our call to pivot_longer()
could also have been written more explicitly:
In the cols
argument we used -name
, which tells the function that we want
to pivot all columns except the column called name
. We could also have selected the three time columns:
df %>%
pivot_longer(cols = c(t1, t2, t3),
names_to = "time",
values_to = "score")
#> # A tibble: 6 x 3
#> name time score
#> <fct> <chr> <dbl>
#> 1 Marshall t1 4
#> 2 Marshall t2 5
#> 3 Marshall t3 7
#> 4 Skye t1 3
#> 5 Skye t2 6
#> 6 Skye t3 7
We will see more examples of how to select columns when we look at the dplyr
package.
Example
Let’s look at another example, this time using the Therapy
dataset:
library(haven)
Therapy <- read_sav("data/Therapy.sav")
Therapy$Vpnr <- as_factor(Therapy$Vpnr)
Therapy$Gruppe <- as_factor(Therapy$Gruppe)
Therapy
#> # A tibble: 100 x 5
#> Vpnr Gruppe Pretest Posttest Difference_PrePost
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 1 Kontrollgruppe 4.29 3.21 1.08
#> 2 2 Kontrollgruppe 6.18 5.99 0.190
#> 3 3 Kontrollgruppe 3.93 4.17 -0.239
#> 4 4 Kontrollgruppe 5.06 4.76 0.295
#> 5 5 Kontrollgruppe 6.45 5.64 0.814
#> 6 6 Kontrollgruppe 4.49 4.67 -0.180
#> # … with 94 more rows
The structure of this dataset is similar to that of the toy example given above.
We want to combine the ‘variables’ Pretest
and Posttest
, as they represent
levels of a common factor time
. The data in the cells are repeated
measurements. Vpnr
and Gruppe
should be ignored for pivoting but kept in the
data set.
First we can get rid of the Difference_PrePost
variable.
The following code selects all rows, and the columns specified in the vector
c("Vpnr", "Gruppe", "Pretest", "Posttest")
.
Now we are ready to pivot:
Therapy_long <- Therapy %>%
pivot_longer(c(Pretest, Posttest),
names_to = "time",
values_to = "rating")
# time should be a factor
Therapy_long$time <- factor(Therapy_long$time,
levels = c("Pretest", "Posttest"))
Therapy_long
#> # A tibble: 200 x 4
#> Vpnr Gruppe time rating
#> <fct> <fct> <fct> <dbl>
#> 1 1 Kontrollgruppe Pretest 4.29
#> 2 1 Kontrollgruppe Posttest 3.21
#> 3 2 Kontrollgruppe Pretest 6.18
#> 4 2 Kontrollgruppe Posttest 5.99
#> 5 3 Kontrollgruppe Pretest 3.93
#> 6 3 Kontrollgruppe Posttest 4.17
#> # … with 194 more rows
6.3.2 pivot_wider()
pivot_wider()
is the complement of pivot_longer()
. This function takes a
factor and ‘spreads’ the values of a measured variable over the levels of
the factor, creating as many new variables as there are factor levels.
Obviously, pivot_wider()
is used to convert from long to wide.
The pivot_wider()
syntax looks like this:
Using our previous example, we want to create new variables for each level of the
factor time
, using the values in the score
variable.
df_long
#> # A tibble: 6 x 3
#> name time score
#> <fct> <fct> <dbl>
#> 1 Marshall t1 4
#> 2 Marshall t2 5
#> 3 Marshall t3 7
#> 4 Skye t1 3
#> 5 Skye t2 6
#> 6 Skye t3 7
df_wide
#> # A tibble: 2 x 4
#> name t1 t2 t3
#> <fct> <dbl> <dbl> <dbl>
#> 1 Marshall 4 5 7
#> 2 Skye 3 6 7
df_wide
looks exaclty the same as the original dataset df
. In fact, we can
test whether this is the case:
Example
Let’s convert the “Therapy” data set from long back to wide:
Therapy_long
#> # A tibble: 200 x 4
#> Vpnr Gruppe time rating
#> <fct> <fct> <fct> <dbl>
#> 1 1 Kontrollgruppe Pretest 4.29
#> 2 1 Kontrollgruppe Posttest 3.21
#> 3 2 Kontrollgruppe Pretest 6.18
#> 4 2 Kontrollgruppe Posttest 5.99
#> 5 3 Kontrollgruppe Pretest 3.93
#> 6 3 Kontrollgruppe Posttest 4.17
#> # … with 194 more rows
Therapy_wide
#> # A tibble: 100 x 4
#> Vpnr Gruppe Pretest Posttest
#> <fct> <fct> <dbl> <dbl>
#> 1 1 Kontrollgruppe 4.29 3.21
#> 2 2 Kontrollgruppe 6.18 5.99
#> 3 3 Kontrollgruppe 3.93 4.17
#> 4 4 Kontrollgruppe 5.06 4.76
#> 5 5 Kontrollgruppe 6.45 5.64
#> 6 6 Kontrollgruppe 4.49 4.67
#> # … with 94 more rows
6.3.3 Excluding missing values: drop_na()
A very important function from the tidyr
package is drop_na()
. It is used to
omit all rows in a data frame containing missing values:
Let’s create another toy dataset containing NAs:
df_1 <- tibble(var1 = c(1, 2, NA), var2 = c("a", NA, "b"))
df_1
#> # A tibble: 3 x 2
#> var1 var2
#> <dbl> <chr>
#> 1 1 a
#> 2 2 <NA>
#> 3 NA b
df_2 <- df_1 %>%
drop_na()
df_2
#> # A tibble: 1 x 2
#> var1 var2
#> <dbl> <chr>
#> 1 1 a
This is very useful, but of course should be used carefully, otherwise you will end up omitting rows that contain NAs in columns that might not be relevant to the question at hand.
6.4 Data manipulation: dplyr
Transforming data sets from wide to long is only one step when ‘tidying’ data. We will often need to select and rename variables, select a subset of observations, recode values, create new variables and create summary variables, often grouped by certain categorical variables.
All these transformations can be performed by functions in the dplyr
package.
These can be seen as a kind of grammar of data manipulation, with each function
performing a specific action. These actions can be combined in many ways,
resulting in an extremely flexible yet highly consistent workflow.
We cannot cover all of the available functions in this course, but they are all described in the dplyr documentation.
We need to load the dplyr
package (or the tidyverse
package):
6.4.1 Selecting variables: select()
The function select()
is used to select variables from a data frame.
Syntax:
# df is a data frame
select(df, variable1, variable2, -variable3)
# or
df %>%
select(variable1, variable2, -variable3)
select()
selects the variables variable1
and variable2
from the data frame
df
. variable3
is omitted.
Examples
# only name
df_long %>%
select(name)
#> # A tibble: 6 x 1
#> name
#> <fct>
#> 1 Marshall
#> 2 Marshall
#> 3 Marshall
#> 4 Skye
#> 5 Skye
#> 6 Skye
# time and score
df_long %>%
select(time, score)
#> # A tibble: 6 x 2
#> time score
#> <fct> <dbl>
#> 1 t1 4
#> 2 t2 5
#> 3 t3 7
#> 4 t1 3
#> 5 t2 6
#> 6 t3 7
# or
df_long %>%
select(-name)
#> # A tibble: 6 x 2
#> time score
#> <fct> <dbl>
#> 1 t1 4
#> 2 t2 5
#> 3 t3 7
#> 4 t1 3
#> 5 t2 6
#> 6 t3 7
Therapy_long %>%
select(Vpnr, Gruppe, rating)
#> # A tibble: 200 x 3
#> Vpnr Gruppe rating
#> <fct> <fct> <dbl>
#> 1 1 Kontrollgruppe 4.29
#> 2 1 Kontrollgruppe 3.21
#> 3 2 Kontrollgruppe 6.18
#> 4 2 Kontrollgruppe 5.99
#> 5 3 Kontrollgruppe 3.93
#> 6 3 Kontrollgruppe 4.17
#> # … with 194 more rows
# selects all variables from Gruppe to rating (including everything in between)
Therapy_long %>%
select(Gruppe:rating)
#> # A tibble: 200 x 3
#> Gruppe time rating
#> <fct> <fct> <dbl>
#> 1 Kontrollgruppe Pretest 4.29
#> 2 Kontrollgruppe Posttest 3.21
#> 3 Kontrollgruppe Pretest 6.18
#> 4 Kontrollgruppe Posttest 5.99
#> 5 Kontrollgruppe Pretest 3.93
#> 6 Kontrollgruppe Posttest 4.17
#> # … with 194 more rows
Changing the order of variables
select()
can be used to rearrange (reorder) variables in a data frame:
Therapy_long %>%
select(rating, time, Gruppe, Vpnr)
#> # A tibble: 200 x 4
#> rating time Gruppe Vpnr
#> <dbl> <fct> <fct> <fct>
#> 1 4.29 Pretest Kontrollgruppe 1
#> 2 3.21 Posttest Kontrollgruppe 1
#> 3 6.18 Pretest Kontrollgruppe 2
#> 4 5.99 Posttest Kontrollgruppe 2
#> 5 3.93 Pretest Kontrollgruppe 3
#> 6 4.17 Posttest Kontrollgruppe 3
#> # … with 194 more rows
“Helper” functions for select()
There are a number of “helper functions” we can use for selecting variables:
# including
Therapy_long %>% select(starts_with("Gr"))
#> # A tibble: 200 x 1
#> Gruppe
#> <fct>
#> 1 Kontrollgruppe
#> 2 Kontrollgruppe
#> 3 Kontrollgruppe
#> 4 Kontrollgruppe
#> 5 Kontrollgruppe
#> 6 Kontrollgruppe
#> # … with 194 more rows
Therapy_long %>% select(ends_with("ng"))
#> # A tibble: 200 x 1
#> rating
#> <dbl>
#> 1 4.29
#> 2 3.21
#> 3 6.18
#> 4 5.99
#> 5 3.93
#> 6 4.17
#> # … with 194 more rows
Therapy_long %>% select(contains("u"))
#> # A tibble: 200 x 1
#> Gruppe
#> <fct>
#> 1 Kontrollgruppe
#> 2 Kontrollgruppe
#> 3 Kontrollgruppe
#> 4 Kontrollgruppe
#> 5 Kontrollgruppe
#> 6 Kontrollgruppe
#> # … with 194 more rows
# OR using one_of()
vars <- c("Gruppe", "rating")
Therapy_long %>% select(one_of(vars))
#> # A tibble: 200 x 2
#> Gruppe rating
#> <fct> <dbl>
#> 1 Kontrollgruppe 4.29
#> 2 Kontrollgruppe 3.21
#> 3 Kontrollgruppe 6.18
#> 4 Kontrollgruppe 5.99
#> 5 Kontrollgruppe 3.93
#> 6 Kontrollgruppe 4.17
#> # … with 194 more rows
# excluding
Therapy_long %>% select(-starts_with("Gr"))
#> # A tibble: 200 x 3
#> Vpnr time rating
#> <fct> <fct> <dbl>
#> 1 1 Pretest 4.29
#> 2 1 Posttest 3.21
#> 3 2 Pretest 6.18
#> 4 2 Posttest 5.99
#> 5 3 Pretest 3.93
#> 6 3 Posttest 4.17
#> # … with 194 more rows
Therapy_long %>% select(-ends_with("ng"))
#> # A tibble: 200 x 3
#> Vpnr Gruppe time
#> <fct> <fct> <fct>
#> 1 1 Kontrollgruppe Pretest
#> 2 1 Kontrollgruppe Posttest
#> 3 2 Kontrollgruppe Pretest
#> 4 2 Kontrollgruppe Posttest
#> 5 3 Kontrollgruppe Pretest
#> 6 3 Kontrollgruppe Posttest
#> # … with 194 more rows
Therapy_long %>% select(-contains("u"))
#> # A tibble: 200 x 3
#> Vpnr time rating
#> <fct> <fct> <dbl>
#> 1 1 Pretest 4.29
#> 2 1 Posttest 3.21
#> 3 2 Pretest 6.18
#> 4 2 Posttest 5.99
#> 5 3 Pretest 3.93
#> 6 3 Posttest 4.17
#> # … with 194 more rows
vars <- c("Gruppe", "rating")
Therapy_long %>% select(-one_of(vars))
#> # A tibble: 200 x 2
#> Vpnr time
#> <fct> <fct>
#> 1 1 Pretest
#> 2 1 Posttest
#> 3 2 Pretest
#> 4 2 Posttest
#> 5 3 Pretest
#> 6 3 Posttest
#> # … with 194 more rows
6.4.1.1 Scoped versions of select
We can also use the scoped versions of select
: select_if()
, select_at()
and select_all()
.
For instance, we can select the numeric variables in the wide Therapy
data frame:
6.4.2 Renaming variables: rename()
Variables can be renamed using rename()
. Variables that have not been selected are not changed.
Syntax:
Example
# rename "Vpnr" to "ID"
Therapy_long %>%
rename(ID = Vpnr)
#> # A tibble: 200 x 4
#> ID Gruppe time rating
#> <fct> <fct> <fct> <dbl>
#> 1 1 Kontrollgruppe Pretest 4.29
#> 2 1 Kontrollgruppe Posttest 3.21
#> 3 2 Kontrollgruppe Pretest 6.18
#> 4 2 Kontrollgruppe Posttest 5.99
#> 5 3 Kontrollgruppe Pretest 3.93
#> 6 3 Kontrollgruppe Posttest 4.17
#> # … with 194 more rows
Similarly to select
, we can use a scoped version. Using rename_all
, we can change all variable names to lower case.
Therapy_wide %>%
rename_all(tolower)
#> # A tibble: 100 x 4
#> vpnr gruppe pretest posttest
#> <fct> <fct> <dbl> <dbl>
#> 1 1 Kontrollgruppe 4.29 3.21
#> 2 2 Kontrollgruppe 6.18 5.99
#> 3 3 Kontrollgruppe 3.93 4.17
#> 4 4 Kontrollgruppe 5.06 4.76
#> 5 5 Kontrollgruppe 6.45 5.64
#> 6 6 Kontrollgruppe 4.49 4.67
#> # … with 94 more rows
6.4.3 Selecting observations (cases): filter()
filter()
is used to select rows from a data frame that fulfill certain logical
conditions (or not).
We can combine multiple logical operators.
Syntax:
Examples
# only control group
Therapy_long %>%
filter(Gruppe == "Kontrollgruppe")
#> # A tibble: 100 x 4
#> Vpnr Gruppe time rating
#> <fct> <fct> <fct> <dbl>
#> 1 1 Kontrollgruppe Pretest 4.29
#> 2 1 Kontrollgruppe Posttest 3.21
#> 3 2 Kontrollgruppe Pretest 6.18
#> 4 2 Kontrollgruppe Posttest 5.99
#> 5 3 Kontrollgruppe Pretest 3.93
#> 6 3 Kontrollgruppe Posttest 4.17
#> # … with 94 more rows
# only Posttest
Therapy_long %>%
filter(time == "Posttest")
#> # A tibble: 100 x 4
#> Vpnr Gruppe time rating
#> <fct> <fct> <fct> <dbl>
#> 1 1 Kontrollgruppe Posttest 3.21
#> 2 2 Kontrollgruppe Posttest 5.99
#> 3 3 Kontrollgruppe Posttest 4.17
#> 4 4 Kontrollgruppe Posttest 4.76
#> 5 5 Kontrollgruppe Posttest 5.64
#> 6 6 Kontrollgruppe Posttest 4.67
#> # … with 94 more rows
# only Pretest
Therapy_long %>%
filter(time != "Posttest")
#> # A tibble: 100 x 4
#> Vpnr Gruppe time rating
#> <fct> <fct> <fct> <dbl>
#> 1 1 Kontrollgruppe Pretest 4.29
#> 2 2 Kontrollgruppe Pretest 6.18
#> 3 3 Kontrollgruppe Pretest 3.93
#> 4 4 Kontrollgruppe Pretest 5.06
#> 5 5 Kontrollgruppe Pretest 6.45
#> 6 6 Kontrollgruppe Pretest 4.49
#> # … with 94 more rows
# only ratings >= 5
Therapy_long %>%
filter(rating >= 5)
#> # A tibble: 66 x 4
#> Vpnr Gruppe time rating
#> <fct> <fct> <fct> <dbl>
#> 1 2 Kontrollgruppe Pretest 6.18
#> 2 2 Kontrollgruppe Posttest 5.99
#> 3 4 Kontrollgruppe Pretest 5.06
#> 4 5 Kontrollgruppe Pretest 6.45
#> 5 5 Kontrollgruppe Posttest 5.64
#> 6 10 Kontrollgruppe Pretest 5.12
#> # … with 60 more rows
# ratings between 3 and 5
Therapy_long %>%
filter(rating >= 3 & rating <= 5)
#> # A tibble: 130 x 4
#> Vpnr Gruppe time rating
#> <fct> <fct> <fct> <dbl>
#> 1 1 Kontrollgruppe Pretest 4.29
#> 2 1 Kontrollgruppe Posttest 3.21
#> 3 3 Kontrollgruppe Pretest 3.93
#> 4 3 Kontrollgruppe Posttest 4.17
#> 5 4 Kontrollgruppe Posttest 4.76
#> 6 6 Kontrollgruppe Pretest 4.49
#> # … with 124 more rows
# only subjects 3 and 5
Therapy_long %>%
filter(Vpnr == 3 | Vpnr == 5)
#> # A tibble: 4 x 4
#> Vpnr Gruppe time rating
#> <fct> <fct> <fct> <dbl>
#> 1 3 Kontrollgruppe Pretest 3.93
#> 2 3 Kontrollgruppe Posttest 4.17
#> 3 5 Kontrollgruppe Pretest 6.45
#> 4 5 Kontrollgruppe Posttest 5.64
# alternative: use the %in% operator
Therapy_long %>%
filter(Vpnr %in% c(3, 5))
#> # A tibble: 4 x 4
#> Vpnr Gruppe time rating
#> <fct> <fct> <fct> <dbl>
#> 1 3 Kontrollgruppe Pretest 3.93
#> 2 3 Kontrollgruppe Posttest 4.17
#> 3 5 Kontrollgruppe Pretest 6.45
#> 4 5 Kontrollgruppe Posttest 5.64
6.4.4 Sorting data frames: arrange()
With the arrange()
function, we can sort rows according to variables in a data
frame, either in ascending or descending order.
# ascending
Therapy_long %>%
arrange(Vpnr)
#> # A tibble: 200 x 4
#> Vpnr Gruppe time rating
#> <fct> <fct> <fct> <dbl>
#> 1 1 Kontrollgruppe Pretest 4.29
#> 2 1 Kontrollgruppe Posttest 3.21
#> 3 2 Kontrollgruppe Pretest 6.18
#> 4 2 Kontrollgruppe Posttest 5.99
#> 5 3 Kontrollgruppe Pretest 3.93
#> 6 3 Kontrollgruppe Posttest 4.17
#> # … with 194 more rows
# descending
Therapy_long %>%
arrange(desc(Vpnr))
#> # A tibble: 200 x 4
#> Vpnr Gruppe time rating
#> <fct> <fct> <fct> <dbl>
#> 1 100 Therapiegruppe Pretest 4.77
#> 2 100 Therapiegruppe Posttest 4.50
#> 3 99 Therapiegruppe Pretest 4.66
#> 4 99 Therapiegruppe Posttest 3.80
#> 5 98 Therapiegruppe Pretest 4.36
#> 6 98 Therapiegruppe Posttest 3.91
#> # … with 194 more rows
6.4.5 Creating new variables: mutate()
The mutate()
function can be used to either create new variables or to change
already existing ones.
Syntax:
Examples
# we want ratings in % (percent)
Therapy_long %>%
mutate(rating_p = round(rating/7 * 100, digits = 1))
#> # A tibble: 200 x 5
#> Vpnr Gruppe time rating rating_p
#> <fct> <fct> <fct> <dbl> <dbl>
#> 1 1 Kontrollgruppe Pretest 4.29 61.2
#> 2 1 Kontrollgruppe Posttest 3.21 45.8
#> 3 2 Kontrollgruppe Pretest 6.18 88.2
#> 4 2 Kontrollgruppe Posttest 5.99 85.5
#> 5 3 Kontrollgruppe Pretest 3.93 56.2
#> 6 3 Kontrollgruppe Posttest 4.17 59.6
#> # … with 194 more rows
There is also a function called transmute()
, which returns only the newly created variables.
6.4.5.1 Scoped versions of mutate
mutate
has scoped versions that allows you to apply transformations to multiple variables at once.
As an example, say we want to convert the ratings in the Therapy data frame to a different scale. First, we create a function that does the job of converting the rating to a percentage.
Then we select both Pretest and Posttest columns. They are the only numeric
variables, so we can use mutate_if
with is.numeric
.
Therapy_wide %>%
mutate_if(is.numeric, change_scale)
#> # A tibble: 100 x 4
#> Vpnr Gruppe Pretest Posttest
#> <fct> <fct> <dbl> <dbl>
#> 1 1 Kontrollgruppe 61.2 45.8
#> 2 2 Kontrollgruppe 88.2 85.5
#> 3 3 Kontrollgruppe 56.2 59.6
#> 4 4 Kontrollgruppe 72.3 68.1
#> 5 5 Kontrollgruppe 92.2 80.6
#> 6 6 Kontrollgruppe 64.1 66.7
#> # … with 94 more rows
Alternatively, we can use mutate_at
with the helper functions vars()
and starts_with
.
Therapy_wide %>%
mutate_at(vars(starts_with("P")), change_scale)
#> # A tibble: 100 x 4
#> Vpnr Gruppe Pretest Posttest
#> <fct> <fct> <dbl> <dbl>
#> 1 1 Kontrollgruppe 61.2 45.8
#> 2 2 Kontrollgruppe 88.2 85.5
#> 3 3 Kontrollgruppe 56.2 59.6
#> 4 4 Kontrollgruppe 72.3 68.1
#> 5 5 Kontrollgruppe 92.2 80.6
#> 6 6 Kontrollgruppe 64.1 66.7
#> # … with 94 more rows
6.4.6 Grouping data frames: group_by()
We often want to perform operations on subsets of a data frame, given by levels
of a grouping factor. The function group_by()
groups the data frame according
to one or more grouping variables, applies a function to every group and then
recombines the data frame (split-apply-combine). group_by()
is usually used in
combination with other functions.
Syntax:
Examples
Let’s import a new dataset. In this example, taken from the Statistics II course, 12 subjects were allocated to 4 experimental conditions, and the level of aggressive behaviour was measured. Suppose that we want to add a group-centred aggression variable.
library(readr)
alk_aggr <- read_csv("data/alkohol-aggression.csv") %>%
mutate(alkoholbedingung = as_factor(alkoholbedingung))
#> Parsed with column specification:
#> cols(
#> aggressivitaet = col_double(),
#> alkoholbedingung = col_character()
#> )
We can use group_by()
to split the data frame into four separate pieces,
compute the group mean for each, and put the pieces back together again. Before
we do that, we will change the order of the columns using select
.
alk_aggr %>%
select(alkoholbedingung, aggressivitaet) %>%
group_by(alkoholbedingung) %>%
mutate(group_mean = mean(aggressivitaet),
aggr_c = aggressivitaet - group_mean)
#> # A tibble: 12 x 4
#> # Groups: alkoholbedingung [4]
#> alkoholbedingung aggressivitaet group_mean aggr_c
#> <fct> <dbl> <dbl> <dbl>
#> 1 kein_alkohol 64 62 2
#> 2 kein_alkohol 58 62 -4
#> 3 kein_alkohol 64 62 2
#> 4 placebo 74 75 -1
#> 5 placebo 79 75 4
#> 6 placebo 72 75 -3
#> # … with 6 more rows
6.4.7 Recoding values: recode()
Together with mutate()
, recode()
can be used to change values of a
variable. This can be done with numeric or character variables.
Syntax:
If we want to change the levels of a factor, the forcats
package provides more
functionality. To manually recode the factor, we use fct_recode()
.
Syntax:
# we want to rename the control group to "control" ,
# and the therapy group to "treatment"
Therapy_long %>%
mutate(Gruppe = fct_recode(Gruppe,
control = "Kontrollgruppe",
treatment = "Therapiegruppe"))
#> # A tibble: 200 x 4
#> Vpnr Gruppe time rating
#> <fct> <fct> <fct> <dbl>
#> 1 1 control Pretest 4.29
#> 2 1 control Posttest 3.21
#> 3 2 control Pretest 6.18
#> 4 2 control Posttest 5.99
#> 5 3 control Pretest 3.93
#> 6 3 control Posttest 4.17
#> # … with 194 more rows
6.4.8 Summarizing variables: summarize()
Using the functions summarize()
or summarise()
(both spellings are possible)
we can summarize variables and compute descriptive statistics. In contrast to
mutate()
, summarize()
returns a value for each group, not each observation.
Syntax:
FUNCTION
is a placeholder for any function that can be used to summarize data, e.g. mean()
, sd()
.
If we want the mean and the standard error, we could use the function that we created in chapter 4:
Let’s compute the group mean, standard deviation and standard error for each time point:
Therapy_long %>%
group_by(Gruppe, time) %>%
summarize(mean= mean(rating),
sd = sd(rating),
se = se(rating))
#> # A tibble: 4 x 5
#> # Groups: Gruppe [2]
#> Gruppe time mean sd se
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 Kontrollgruppe Pretest 5.06 0.908 0.128
#> 2 Kontrollgruppe Posttest 4.65 0.834 0.118
#> 3 Therapiegruppe Pretest 4.82 0.691 0.0977
#> 4 Therapiegruppe Posttest 4.23 0.752 0.106
A more advanced example of this uses summarize_if
. We can apply a list of
functions to all variables that pass the test is.numeric
. This gives the same output as the code above.
Therapy_long %>%
group_by(Gruppe, time) %>%
summarize_if(is.numeric, list(mean = mean, sd = sd, se = se))
#> # A tibble: 4 x 5
#> # Groups: Gruppe [2]
#> Gruppe time mean sd se
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 Kontrollgruppe Pretest 5.06 0.908 0.128
#> 2 Kontrollgruppe Posttest 4.65 0.834 0.118
#> 3 Therapiegruppe Pretest 4.82 0.691 0.0977
#> 4 Therapiegruppe Posttest 4.23 0.752 0.106
Or we can use summarize_at
and use vars()
to select the rating
variable.
Therapy_long %>%
group_by(Gruppe, time) %>%
summarize_at(vars(rating), list(~mean(.), ~sd(.), ~se(.)))
#> # A tibble: 4 x 5
#> # Groups: Gruppe [2]
#> Gruppe time mean sd se
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 Kontrollgruppe Pretest 5.06 0.908 0.128
#> 2 Kontrollgruppe Posttest 4.65 0.834 0.118
#> 3 Therapiegruppe Pretest 4.82 0.691 0.0977
#> 4 Therapiegruppe Posttest 4.23 0.752 0.106
# or
Therapy_long %>%
group_by(Gruppe, time) %>%
summarize_at("rating", list(~mean(.), ~sd(.), ~se(.)))
#> # A tibble: 4 x 5
#> # Groups: Gruppe [2]
#> Gruppe time mean sd se
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 Kontrollgruppe Pretest 5.06 0.908 0.128
#> 2 Kontrollgruppe Posttest 4.65 0.834 0.118
#> 3 Therapiegruppe Pretest 4.82 0.691 0.0977
#> 4 Therapiegruppe Posttest 4.23 0.752 0.106
6.4.9 Summary
We can now start combining all functions. As an example, we can compute the
difference between pretest and posttest measurements in the Therapy
data set
(we previously omitted this).
Therapy_long %>%
pivot_wider(names_from = time, values_from = rating) %>%
mutate(Difference = Pretest - Posttest)
#> # A tibble: 100 x 5
#> Vpnr Gruppe Pretest Posttest Difference
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 1 Kontrollgruppe 4.29 3.21 1.08
#> 2 2 Kontrollgruppe 6.18 5.99 0.190
#> 3 3 Kontrollgruppe 3.93 4.17 -0.239
#> 4 4 Kontrollgruppe 5.06 4.76 0.295
#> 5 5 Kontrollgruppe 6.45 5.64 0.814
#> 6 6 Kontrollgruppe 4.49 4.67 -0.180
#> # … with 94 more rows
6.5 Row-wise operations
Suppose that you have your data in a wide data frame, and you want to create summary scores across a number of columns. In other words, you want to perform an operation on the rows of a data frame.
The operations in the tidyverse
packages are intended to operate on columns, not
rows, so this can be rather tricky.
Let’s create a toy example to illustrate this:
library(tidyverse)
df <- tribble(
~name, ~n1, ~n2, ~e1, ~e2,
"Diane", 7, 8, 9, 4,
"Todd", 3, 2, 3, 3)
We have two subjects, who responded to four questions. Two of those, n1
and
n2
represent a common n
factor, whereas the other two variables, e1
and
e2
, represent a common e
factor.
It is tempting to try out something like this:
df %>%
mutate(n = (n1 + n2)/2,
e = (e1 + e2)/2)
#> # A tibble: 2 x 7
#> name n1 n2 e1 e2 n e
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane 7 8 9 4 7.5 6.5
#> 2 Todd 3 2 3 3 2.5 3
However, this solution is very specific to this particular example, and involves a lot of repetition.
dplyr
does provide a function, rowwise()
that will help us here:
df %>%
rowwise() %>%
mutate(n = mean(c(n1, n2)),
e = mean(c(e1, e2)))
#> Source: local data frame [2 x 7]
#> Groups: <by row>
#>
#> # A tibble: 2 x 7
#> name n1 n2 e1 e2 n e
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane 7 8 9 4 7.5 6.5
#> 2 Todd 3 2 3 3 2.5 3
This is much better, in the sense that we can now apply a function rather than
specifying the calculations by hand. A further advantage is that functions, such
as mean
or sd
, can handle missing values.
Let’s add a missing value:
df <- tribble(
~name, ~n1, ~n2, ~e1, ~e2,
"Diane", 7, 8, 9, 4,
"Todd", NA, 2, 3, 3)
df
#> # A tibble: 2 x 5
#> name n1 n2 e1 e2
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane 7 8 9 4
#> 2 Todd NA 2 3 3
This approach will no longer work:
df %>%
mutate(n = (n1 + n2)/2,
e = (e1 + e2)/2)
#> # A tibble: 2 x 7
#> name n1 n2 e1 e2 n e
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane 7 8 9 4 7.5 6.5
#> 2 Todd NA 2 3 3 NA 3
but this one will:
df %>%
rowwise() %>%
mutate(n = mean(c(n1, n2), na.rm = TRUE),
e = mean(c(e1, e2), na.rm = TRUE))
#> Source: local data frame [2 x 7]
#> Groups: <by row>
#>
#> # A tibble: 2 x 7
#> name n1 n2 e1 e2 n e
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane 7 8 9 4 7.5 6.5
#> 2 Todd NA 2 3 3 2 3
We could even recreate some missing functionality from SPSS here. Our summary should only be created for rows that do not exceed the maximum permissible number of missing values.
The goal is to compute some sort of row-wise score, i.e. a sum or an average, over several columns of a data frame, while being able to specify how many missing values you will allow per row.
We can create a function that does this:
score_items <- function(v, max_na = NULL) {
if (is.null(max_na)) {
max_na <- length(v)
}
n_na <- length(v[is.na(v)])
if (n_na <= max_na) {
out <- mean(v, na.rm = TRUE)
} else {
out <- NA
}
out
}
This function will compute the mean of a vector, but only if that vector contains fewer than a pre-determined number of missing values.
Let’s create a new data frame to test this:
df_2 <- tribble(
~name, ~n1, ~n2, ~n3, ~n4, ~e1, ~e2, ~e3, ~e4,
"Diane", 7, 8, 9, 4, NA, 2, 3, 2,
"Todd", NA, 2, NA, 3, 1, 4, 5, 1)
df_2
#> # A tibble: 2 x 9
#> name n1 n2 n3 n4 e1 e2 e3 e4
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane 7 8 9 4 NA 2 3 2
#> 2 Todd NA 2 NA 3 1 4 5 1
df_2 %>%
rowwise() %>%
mutate(n = score_items(c(n1, n2, n3, n4), max_na = 1),
e = score_items(c(e1, e2, e3, e4), max_na = 1))
#> Source: local data frame [2 x 11]
#> Groups: <by row>
#>
#> # A tibble: 2 x 11
#> name n1 n2 n3 n4 e1 e2 e3 e4 n e
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane 7 8 9 4 NA 2 3 2 7 2.33
#> 2 Todd NA 2 NA 3 1 4 5 1 NA 2.75
This will compute the mean over the specified columns, but only if the subject has at most 1 missing value.
If we are not willing to accept any NAs:
df_2 %>%
rowwise() %>%
mutate(n = score_items(c(n1, n2, n3, n4), max_na = 0),
e = score_items(c(e1, e2, e3, e4), max_na = 0))
#> Source: local data frame [2 x 11]
#> Groups: <by row>
#>
#> # A tibble: 2 x 11
#> name n1 n2 n3 n4 e1 e2 e3 e4 n e
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane 7 8 9 4 NA 2 3 2 7 NA
#> 2 Todd NA 2 NA 3 1 4 5 1 NA 2.75
This code still seems rather inelegant, and involves far too much much repetition. If the data frame were long, they would look like this:
df_2_long <- df_2 %>%
pivot_longer(-name,
names_to = c(".value", "obs"),
names_pattern = "(.)(.)")
df_2_long
#> # A tibble: 8 x 4
#> name obs n e
#> <chr> <chr> <dbl> <dbl>
#> 1 Diane 1 7 NA
#> 2 Diane 2 8 2
#> 3 Diane 3 9 3
#> 4 Diane 4 4 2
#> 5 Todd 1 NA 1
#> 6 Todd 2 2 4
#> # … with 2 more rows
Since the columns in df_2
contain two variables, we need to split them. This
is achieved by using the names_pattern
argument.
In this case it would be pretty straightforward to summarize the variables n
and e
after using group_by(name)
.
df_2_long %>%
group_by(name) %>%
select(-obs) %>%
summarize_if(is.numeric, mean, na.rm = TRUE)
#> # A tibble: 2 x 3
#> name n e
#> <chr> <dbl> <dbl>
#> 1 Diane 7 2.33
#> 2 Todd 2.5 2.75
We can also use our score_items()
function:
df_2_long %>%
group_by(name) %>%
select(-obs) %>%
summarize_if(is.numeric, score_items, max_na = 1)
#> # A tibble: 2 x 3
#> name n e
#> <chr> <dbl> <dbl>
#> 1 Diane 7 2.33
#> 2 Todd NA 2.75
Here, we are using scoped version of summarize
, summarize_if
,
which summarizes all columns for which the provided predicate (is.numeric
in
this case) evaluates to TRUE
. The mean
function is used to summarize.
df_2 %>%
pivot_longer(-name,
names_to = c(".value", "obs"),
names_pattern = "(.)(.)") %>%
select(-obs) %>%
group_by(name) %>%
summarize_if(is.numeric, ~score_items(.)) %>%
right_join(df_2)
#> Joining, by = "name"
#> # A tibble: 2 x 11
#> name n e n1 n2 n3 n4 e1 e2 e3 e4
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane 7 2.33 7 8 9 4 NA 2 3 2
#> 2 Todd 2.5 2.75 NA 2 NA 3 1 4 5 1
6.6 Exercises
Download and import the data files “rtata.csv” and “beispieldaten.sav” from ILIAS.
Reaction times
In this simulated data set, 3 participants performed a reaction time experiment in 3 conditions. In one condition, participants should respond quickly, in another they are expected to be slower. The third condition is a control.
library(readr)
rt_wide <- read_csv("data/rtdata.csv")
#> Parsed with column specification:
#> cols(
#> ID = col_double(),
#> control = col_double(),
#> slow = col_double(),
#> fast = col_double()
#> )
rt_wide
#> # A tibble: 300 x 4
#> ID control slow fast
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1.55 3.62 3.31
#> 2 1 0.782 1.50 1.57
#> 3 1 3.10 2.30 1.13
#> 4 1 2.97 1.83 1.97
#> 5 1 1.38 3.53 3.26
#> 6 1 0.329 3.44 5.34
#> # … with 294 more rows
- This data set is wide and first needs to be converted into a long format. Grouping variables should be converted into factors.
rt_long <- rt_wide %>%
pivot_longer(-ID, names_to = "condition", values_to = "rt") %>%
mutate(ID = factor(ID),
condition = factor(condition))
rt_long
#> # A tibble: 900 x 3
#> ID condition rt
#> <fct> <fct> <dbl>
#> 1 1 control 1.55
#> 2 1 slow 3.62
#> 3 1 fast 3.31
#> 4 1 control 0.782
#> 5 1 slow 1.50
#> 6 1 fast 1.57
#> # … with 894 more rows
- Compute the mean reaction time and standard deviation for each participant in each condition.
summary_rt <- rt_long %>%
group_by(ID, condition) %>%
summarise(mean_rt = mean(rt),
sd_rt = sd(rt))
summary_rt
#> # A tibble: 9 x 4
#> # Groups: ID [3]
#> ID condition mean_rt sd_rt
#> <fct> <fct> <dbl> <dbl>
#> 1 1 control 1.68 1.02
#> 2 1 fast 2.15 1.58
#> 3 1 slow 3.20 1.93
#> 4 2 control 2.12 1.31
#> 5 2 fast 1.69 1.21
#> 6 2 slow 3.23 1.69
#> # … with 3 more rows
- Compute the log RT and add this to the data frame as a new variable.
rt_long <- rt_long %>%
mutate(log_rt = log(rt))
rt_long
#> # A tibble: 900 x 4
#> ID condition rt log_rt
#> <fct> <fct> <dbl> <dbl>
#> 1 1 control 1.55 0.439
#> 2 1 slow 3.62 1.29
#> 3 1 fast 3.31 1.20
#> 4 1 control 0.782 -0.246
#> 5 1 slow 1.50 0.407
#> 6 1 fast 1.57 0.452
#> # … with 894 more rows
A further reason to use the long format is that the ggplot2
package requires this. The following code shows an example.
Adolescents in West and East Germany
library(dplyr)
library(tidyr)
library(stringr)
library(haven)
adolescents <- read_sav("data/beispieldaten.sav")
adolescents
#> # A tibble: 286 x 98
#> ID westost geschlecht alter swk1 swk2 swk3 swk4 swk5 swk6 swk7 swk8
#> <dbl> <dbl+l> <dbl+lbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 [Wes… 1 [weibli… 13 2 5 6 4 7 5 5 6
#> 2 2 0 [Wes… 0 [männli… 14 4 6 4 4 6 5 5 6
#> 3 10 0 [Wes… 1 [weibli… 14 5 4 5 6 5 7 7 6
#> 4 11 0 [Wes… 1 [weibli… 14 3 6 6 5 7 7 6 6
#> 5 12 0 [Wes… 1 [weibli… 14 5 6 6 5 7 6 5 5
#> 6 14 0 [Wes… 0 [männli… 14 3 5 5 4 5 7 5 6
#> # … with 280 more rows, and 86 more variables: swk9 <dbl>, swk10 <dbl>,
#> # swk11 <dbl>, swk12 <dbl>, swk13 <dbl>, swk14 <dbl>, swk15 <dbl>,
#> # swk16 <dbl>, swk17 <dbl>, swk18 <dbl>, swk19 <dbl>, swk20 <dbl>,
#> # swk21 <dbl>, swk22 <dbl>, swk23 <dbl>, swk24 <dbl>, swk25 <dbl>,
#> # swk26 <dbl>, swk27 <dbl>, swk28 <dbl>, swk29 <dbl>, swk30 <dbl>,
#> # swk31 <dbl>, swk32 <dbl>, swk33 <dbl>, swk34 <dbl>, swk35 <dbl>,
#> # swk36 <dbl>, unteltern1 <dbl>, unteltern2 <dbl>, unteltern3 <dbl>,
#> # unteltern4 <dbl>, unteltern5 <dbl>, unteltern6 <dbl>, untfreunde1 <dbl>,
#> # untfreunde2 <dbl>, untfreunde3 <dbl>, untfreunde4 <dbl>, untfreunde5 <dbl>,
#> # untfreunde6 <dbl>, leben1 <dbl>, leben2 <dbl>, leben3 <dbl>, leben4 <dbl>,
#> # leben5 <dbl>, leben6 <dbl>, leben7 <dbl>, leben8 <dbl>, leben9 <dbl>,
#> # leben10 <dbl>, leben11 <dbl>, stress1 <dbl>, stress2 <dbl>, stress3 <dbl>,
#> # stress4 <dbl>, stress5 <dbl>, stress6 <dbl>, stress7 <dbl>, stress8 <dbl>,
#> # stress9 <dbl>, stress10 <dbl>, stress11 <dbl>, stress12 <dbl>,
#> # Deutschnote <dbl>, Mathenote <dbl+lbl>, Fremdsprachenote <dbl+lbl>,
#> # Gesamtnote <dbl>, bildung_vater <dbl+lbl>, bildung_mutter <dbl+lbl>,
#> # bildung_vater_binaer <dbl+lbl>, bildung_mutter_binaer <dbl+lbl>,
#> # swk_neueslernen <dbl>, swk_lernregulation <dbl>, swk_motivation <dbl>,
#> # swk_durchsetzung <dbl>, swk_sozialkomp <dbl>, swk_beziehung <dbl>,
#> # unt_eltern <dbl>, unt_freunde <dbl>, leben_selbst <dbl>,
#> # leben_familie <dbl>, leben_schule <dbl>, leben_freunde <dbl>,
#> # leben_gesamt <dbl>, stress_somatisch <dbl>, stress_psychisch <dbl>
adolescents <- adolescents %>%
mutate(westost = factor(westost),
geschlecht = factor(geschlecht),
ID = factor(ID))
Mean age
Compute the mean age and standard deviation for both sexes.
Emotional support
Select the variables unt_freunde
(emotional support from friends) and unt_eltern
(emotional support from parents) and use these to create a long data frame.
Step 1:
# select variables
unterstuetzung <- adolescents %>%
dplyr::select(ID, westost, unt_freunde, unt_eltern)
Step 2:
# handle missing values: are there any?
unterstuetzung[!complete.cases(unterstuetzung), ]
#> # A tibble: 2 x 4
#> ID westost unt_freunde unt_eltern
#> <fct> <fct> <dbl> <dbl>
#> 1 332 1 4.83 NA
#> 2 16 0 NA 5.17
Step 3:
# convert wide to long
unterstuetzung <- unterstuetzung %>%
pivot_longer(starts_with("unt"), names_to = "freunde_eltern",
values_to = "unterstuetzung")
Step 4:
# convert freunde_eltern variable to a factor
unterstuetzung <- unterstuetzung %>%
mutate(freunde_eltern = factor(freunde_eltern))
Step 5 (advanced):
# rename factor levels (remove "unt_" prefix)
unterstuetzung <- unterstuetzung %>%
mutate(freunde_eltern = str_replace(freunde_eltern, ".*_", ""))
Step 6:
# inspect resulting data frame
unterstuetzung
#> # A tibble: 568 x 4
#> ID westost freunde_eltern unterstuetzung
#> <fct> <fct> <chr> <dbl>
#> 1 2 0 freunde 4.83
#> 2 2 0 eltern 5.17
#> 3 14 0 freunde 6.33
#> 4 14 0 eltern 6.5
#> 5 15 0 freunde 5
#> 6 15 0 eltern 5.83
#> # … with 562 more rows
And now all in one go:
unterstuetzung <- adolescents %>%
dplyr::select(ID, westost, unt_freunde, unt_eltern) %>%
# fehlende Werte ausschliessen
drop_na() %>%
# von wide zu long
pivot_longer(starts_with("unt"), names_to = "freunde_eltern",
values_to = "unterstuetzung") %>%
# freunde_eltern variable bearbeiten
mutate(freunde_eltern = factor(str_replace(freunde_eltern,
".*_", "")))
unterstuetzung
#> # A tibble: 568 x 4
#> ID westost freunde_eltern unterstuetzung
#> <fct> <fct> <fct> <dbl>
#> 1 1 0 freunde 6.67
#> 2 1 0 eltern 6.67
#> 3 2 0 freunde 4.83
#> 4 2 0 eltern 5.17
#> 5 10 0 freunde 4
#> 6 10 0 eltern 6.83
#> # … with 562 more rows
Life Satisfaction
We want to investigate life satisfaction in various contexts. The relevant variables
all begin with leben_
. We want to select all of these (excluding
leben_gesamt
, which is already a summary).
We could choose all variables starting with leben_
using
starts_with("leben_")
, and then exclude leben_gesamt
individually. We also
want to retain each person ID
.
zufriedenheit_wide <- adolescents %>%
dplyr::select(ID, starts_with("leben_"), -ends_with("gesamt"))
zufriedenheit_wide
#> # A tibble: 286 x 5
#> ID leben_selbst leben_familie leben_schule leben_freunde
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5.5 7 4.33 6.5
#> 2 2 4.5 4.67 4.67 5
#> 3 10 6.5 6.33 3.33 5
#> 4 11 6 7 4 6.5
#> 5 12 6 5.67 5.33 6.5
#> 6 14 5 5.33 5 6
#> # … with 280 more rows
Can you think of any other ways of achieving this?
Solution
adolescents %>%
dplyr::select(ID, starts_with("leben_"), -leben_gesamt)
#> # A tibble: 286 x 5
#> ID leben_selbst leben_familie leben_schule leben_freunde
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5.5 7 4.33 6.5
#> 2 2 4.5 4.67 4.67 5
#> 3 10 6.5 6.33 3.33 5
#> 4 11 6 7 4 6.5
#> 5 12 6 5.67 5.33 6.5
#> 6 14 5 5.33 5 6
#> # … with 280 more rows
adolescents %>%
dplyr::select(ID, leben_selbst, leben_familie, leben_schule, leben_freunde)
#> # A tibble: 286 x 5
#> ID leben_selbst leben_familie leben_schule leben_freunde
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5.5 7 4.33 6.5
#> 2 2 4.5 4.67 4.67 5
#> 3 10 6.5 6.33 3.33 5
#> 4 11 6 7 4 6.5
#> 5 12 6 5.67 5.33 6.5
#> 6 14 5 5.33 5 6
#> # … with 280 more rows
variablen <- c("leben_selbst", "leben_familie",
"leben_schule", "leben_freunde")
adolescents %>%
dplyr::select(ID, one_of(variablen))
#> # A tibble: 286 x 5
#> ID leben_selbst leben_familie leben_schule leben_freunde
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5.5 7 4.33 6.5
#> 2 2 4.5 4.67 4.67 5
#> 3 10 6.5 6.33 3.33 5
#> 4 11 6 7 4 6.5
#> 5 12 6 5.67 5.33 6.5
#> 6 14 5 5.33 5 6
#> # … with 280 more rows
# advanced (using a regular expression):
adolescents %>%
dplyr::select(ID, matches("^leben_[^g]"))
#> # A tibble: 286 x 5
#> ID leben_selbst leben_familie leben_schule leben_freunde
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5.5 7 4.33 6.5
#> 2 2 4.5 4.67 4.67 5
#> 3 10 6.5 6.33 3.33 5
#> 4 11 6 7 4 6.5
#> 5 12 6 5.67 5.33 6.5
#> 6 14 5 5.33 5 6
#> # … with 280 more rows
All selected variables can be considered as levels of an underlying (repeated-measures) factor. We can convert the data frame into a long format.
zufriedenheit_long <- zufriedenheit_wide %>%
pivot_longer(-ID, names_to = "lebensbereich", values_to = "zufriedenheit")
Now we need to consider what to do with missing values. Should we exclude all people with NAs
in one context, or should we keep all observations?
First, we look for rows with NAs
.
# head(20) displays the first 20 rows:
!complete.cases(zufriedenheit_wide) %>%
head(20)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
This returns a logical vector
, with either TRUE
or FALSE
for each row (it will be true if there is an NA
in that row).
zufriedenheit_wide[!complete.cases(zufriedenheit_wide), ]
#> # A tibble: 2 x 5
#> ID leben_selbst leben_fam leben_schule leben_freunde
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 175 NA NA NA NA
#> 2 206 5 NA 5.33 5
There are two participants with missing values - we decide to exclude these.
zufriedenheit_long <- zufriedenheit_wide %>%
drop_na() %>%
pivot_longer(-ID, names_to = "lebensbereich", values_to = "zufriedenheit")
The variable lebensbereich
(context) should be converted into a factor:
zufriedenheit_long <- zufriedenheit_long %>%
mutate(lebensbereich = as_factor(lebensbereich))
zufriedenheit_long
#> # A tibble: 1,136 x 3
#> ID lebensbereich zufriedenheit
#> <fct> <fct> <dbl>
#> 1 2 leben_selbst 4.5
#> 2 2 leben_fam 4.67
#> 3 2 leben_schule 4.67
#> 4 2 leben_freunde 5
#> 5 14 leben_selbst 5
#> 6 14 leben_fam 5.33
#> # … with 1,130 more rows
We can also remove the prefix leben_
from each fator level.
library(stringr)
zufriedenheit_long <- zufriedenheit_long %>%
mutate(lebensbereich = as.factor(str_replace(lebensbereich,
".*_", "")))
levels(zufriedenheit_long$lebensbereich)
#> [1] "fam" "freunde" "schule" "selbst"
Now we will sort by ID
in ascending order:
zufriedenheit_long <- zufriedenheit_long %>%
arrange(ID)
zufriedenheit_long
#> # A tibble: 1,136 x 3
#> ID lebensbereich zufriedenheit
#> <fct> <fct> <dbl>
#> 1 1 selbst 5.5
#> 2 1 fam 7
#> 3 1 schule 4.33
#> 4 1 freunde 6.5
#> 5 2 selbst 4.5
#> 6 2 fam 4.67
#> # … with 1,130 more rows
All in one code chunk:
Stress
Female adolescents are thought to be more prone to depressive disorders than
males. We can use a t-test to test this hypothesis, but first we want to look
at the data set and compute the mean, standard deviation and standard error for the variable
stress_psychisch
, for each sex separately (rounded to 4 decimal places).
stress <- adolescents %>%
dplyr::select(ID, geschlecht, stress_psychisch) %>%
mutate_at(vars(geschlecht, ID), as_factor)
stress
#> # A tibble: 286 x 3
#> ID geschlecht stress_psychisch
#> <fct> <fct> <dbl>
#> 1 1 1 1.67
#> 2 2 0 3.5
#> 3 10 1 3.67
#> 4 11 1 1.5
#> 5 12 1 2.5
#> 6 14 0 1
#> # … with 280 more rows
There might be missing values. We have two options:
- We remove all participants with NAs
- We can leave all participants in the data set and handle any NAs using the
na.rm
argument in themean()
andsd()
functions.
stress_summary <- stress %>%
drop_na() %>%
group_by(geschlecht) %>%
summarize_at("stress_psychisch", list(mean = mean, sd = sd, se = se)) %>%
mutate_if(is.numeric, round, digits = 4)
pivot_longer()
takes one or several columns as input and creates one or multiple columns from the column names. It also creates a column from the data stored in the cells.