6 Working with Data

6.1 Tidying data

In statistics and data science, you will often hear the term wide and long when referring to data. This refers to the shape of a dataset; a wide dataset has more columns than a corresponding long dataset. A long dataset has more rows than a corresponding wide dataset. However, the terms long and wide aren’t just about shape. In a long data set, each observation has its own row and each variable is a column. In a wide dataset, variables can be spread out over several columns.

Consider this example: we have two observations for two subjects, at three separate time points. Let’s call these t1, t2 and t3. The following representation is considered to be in the wide format:

#> # A tibble: 2 x 4
#>   name        t1    t2    t3
#>   <fct>    <dbl> <dbl> <dbl>
#> 1 Marshall     4     5     7
#> 2 Skye         3     6     7

Each row represents one subject, and the observations are spread over the three columns t1, t2 and t3. These columns actually represent three levels of a common factor, time. If you were going to perform a repeated-measures ANOVA in SPSS, this is the kind of dataset that you would use.

We can represent exactly the same data in a longer format, in which each row corresponds to an observation, and each column represents a variable. In total, there are three variables: name, time and score.

df %>% 
    pivot_longer(-name, names_to = "time", values_to = "score")
#> # A tibble: 6 x 3
#>   name     time  score
#>   <fct>    <chr> <dbl>
#> 1 Marshall t1        4
#> 2 Marshall t2        5
#> 3 Marshall t3        7
#> 4 Skye     t1        3
#> 5 Skye     t2        6
#> 6 Skye     t3        7

We no longer have one row for each subject. In fact, variables that do not change for each observation are repeated across rows.

A data set that is organized in this way is often referred to as tidy.

As we will see later, many kinds of statistical analysis and in particular plotting functions require a long data set, and thus, a surprisingly large amount of time has to be spent on organizing data for further analysis (this type of work is often called ‘data wrangling’).

In this course, we will be working with the tidyverse packages for manipulating and reshaping data. Of course, there are other ways of manipulating data, but in our view, the tidyverse method provides the most consistent way of working with data, and reduces the cognitive load on the user.

The packages used for data manipulations are tidyr for reshaping data frames, and dplyr for manipulating data frames. The functions we are going to look at are:

Package Function name
tidyr pivot_longer() increases the number of rows, decreases the number of columns
tidyr pivot_wider() decreases the number of rows, increases the number of columns
tidyr drop_na() omits all rows in a data frame containing NA values
dplyr filter() selects observations
dplyr select() selects variables
dplyr summarize() or summarise() summarizes data
dplyr mutate() creates new variables and changes already existing variables
dplyr group_by() lets you perform operations on subsets of the data

The functions filter(), select(), summarize() and mutate() have so-called scoped versions, which let you apply the same transformation to multiple variables. There are three variants of each of these functions:

  • *_all applies to every variable (e.g. mutate_all())

  • *_at applies to variables selected with a character vector or the helper function vars() (e.g. mutate_at())

  • *_if applies to variables selected with a predicate function (e.g. mutate_if())

These functions allow you to perform very complex data manipulation, using clean and elegant code. Besides the functions mentioned above, dplyr contains many more. These include functions that perform set operations, e.g. finding the intersection of two data frames, and allow you to join data frames together.

In addition tidyr and dplyr, we will also use the forcats package for manipulating factors, and the stringr package for working with strings.

6.2 The pipe operator: %>%

Before we begin manipulating data, we need to introduce a new operator. We have already seen that code can become hard to read when we have a sequence of operations, i.e. nested function calls. Suppose we have a numeric vector (n = 10), which we want to center (we will generate the data using rnorm). Then we want to compute and round the standard deviation to two decimal places.

set.seed(1283)
data <- rnorm(10, 24, 5)
data
#>  [1] 24.74984 21.91726 23.98551 19.63019 23.96428 22.83092 18.86240 19.08125
#>  [9] 23.76589 21.88846

We can use nested function calls:

round(sd(scale(data,
               center = TRUE,
               scale = FALSE)),
      digits = 2)
#> [1] 2.19

The functions scale(), sd() and round() are called in order (inner to outer). The output of one function is passed as an argument to the next function.

The functions scale() and round() have additional arguments: center = TRUE, scale = FALSE for scale, and digits = 2 for round.

This way of writing code can be difficult to read, although using indentation really helps here.

An alternative approach is to save each intermediate step:

data_z <- scale(data, center = TRUE,
                      scale = FALSE)

sd_data_z <- sd(data_z)
sd_data_z_rounded <- round(sd_data_z,
                           digits = 2)

sd_data_z_rounded
#> [1] 2.19

Doing this means that we create many variables that we don’t really need, cluttering up our workspace.

Luckily, there is an elegant solution to this problem. Using the pipe operator, we can pass the output of one function as an argument to the next function without using nested function calls. Note that this is really just a type of ‘syntactic sugar’; the result is exactly the same as using traditional function calls.

The pipe operator is available in the dplyr package.

library(dplyr)
%>%

It is defined as in infix operator, meaning that it is used in between two objects.

The pipe operator has its own key combo in RStudio: Cmd+Shift+M (MacOS) or Ctrl+Shift+M (Windows and Linux).

Using the %>% operator, our example from above:

round(sd(scale(data,
               center = TRUE,
               scale = FALSE)),
      digits = 2)
#> [1] 2.19

becomes:

library(dplyr)
data %>%
    scale(center = TRUE, scale = FALSE) %>%
    sd() %>%
    round(digits = 2)
#> [1] 2.19

This should be read as:

  1. We start with the object data and pass it to the function scale() as an argument
  2. We use scale(), with arguments center = TRUE, scale = FALSE and then pass the output to the function sd()
  3. We apply the function sd() and pass the output to the function round()
  4. round() is called, with the argument digits = 2. The output is printed to the console.

If we want to save the last output, we therefore have to assign the entire computation to a variable:

sd_data_z_rounded <- data %>%
    scale(center = TRUE, scale = FALSE) %>%
    sd() %>%
    round(digits = 2)

sd_data_z_rounded
#> [1] 2.19

%>% passes an object to a function as its first argument (unless specified otherwise). The advantages are:

  • code becomes easier to read
  • we don’t need intermediate variables

Pipe syntax

Suppose that f(), g() and h() are functions. Then, the %>% operator is used like this:

x %>% f
    or
x %>% f()

is equivalent to:

f(x)

if y is an argument of f():

x %>% f(y)
f(x, y)

Applying f(), g() and h() in sequence:

x %>% f() %>% g() %>% h()

# or

x %>%
  f() %>%
  g() %>%
  h()

is equivalent to:

h(g(f(x)))

We can use x as any argument, it doesn’t have to be the first. We just need to use the placeholder .:

x %>% f(y, .)

This is equivalent to:

f(y, x)
x %>% f(y, z = .)

is equivalent to:

f(y, z = x)

6.3 Reshaping: tidyr

In order to reshape a data set, we will need two functions: pivot_longer() and pivot_wider(), both of which are provided by the tidyr package.

library(tidyr)

6.3.1 pivot_longer()

pivot_longer() is used for converting a wide to a long data set. This means that pivot_longer() is used to combine columns, which may represent levels of a common factor, into one column. The values of the original columns are combined in a new value column. Another way of putting this is that pivot_longer()’ increases the number of rows and decreases the number of columns of a data frame.

pivot_longer() takes one or several columns as input and creates one or multiple columns from the column names. It also creates a column from the data stored in the cells.

pivot_longer()` is used like this:

pivot_longer(data, cols, names_to, values_to)

or, using %>%:

data %>%
    pivot_longer(cols, names_to, values_to)

The arguments are:

data:       a data frame
cols:       the columns you want to pivot 
names_to:   a string specifying the name of the new column created from the selected column names
values_to:  a string specifying the name of the column created from the observations in the cells

Let’s have another look at the example used above. We have a wide data set with two subjects, measured at three time points (the function tribble() lets you easily create a tibble by rows).

df <- tribble(
    ~name,    ~t1, ~t2, ~t3,
    "Marshall", 4,   5,  7,
    "Skye",     3,   6,  7
) %>% mutate_if(is.character, as_factor)

We assume that t1, t2 and t3 are not really separate variables, but should be considered as levels of a time factor (e.g. as three time points in a repeated measures data set). The values in the t1, t2 and t3 columns refer to the same kind of measurement (e.g. score) and therefore should be actually one variable in a long data set. We would like our new factor to be called time and the measured variable should be called score. The name variable should be ignored, i.e. it should not be pivoted.

library(tidyr)
df_long <- df %>%
  pivot_longer(-name, names_to = "time", values_to = "score")
df_long
#> # A tibble: 6 x 3
#>   name     time  score
#>   <fct>    <chr> <dbl>
#> 1 Marshall t1        4
#> 2 Marshall t2        5
#> 3 Marshall t3        7
#> 4 Skye     t1        3
#> 5 Skye     t2        6
#> 6 Skye     t3        7

time should be defined as a factor in the next step:

df_long$time <- as_factor(df_long$time)
df_long
#> # A tibble: 6 x 3
#>   name     time  score
#>   <fct>    <fct> <dbl>
#> 1 Marshall t1        4
#> 2 Marshall t2        5
#> 3 Marshall t3        7
#> 4 Skye     t1        3
#> 5 Skye     t2        6
#> 6 Skye     t3        7

Our call to pivot_longer() could also have been written more explicitly:

df %>% 
    pivot_longer(cols = -name, 
                 names_to = "time", 
                 values_to = "score")

In the cols argument we used -name, which tells the function that we want to pivot all columns except the column called name. We could also have selected the three time columns:

df %>% 
    pivot_longer(cols = c(t1, t2, t3), 
                 names_to = "time", 
                 values_to = "score")
#> # A tibble: 6 x 3
#>   name     time  score
#>   <fct>    <chr> <dbl>
#> 1 Marshall t1        4
#> 2 Marshall t2        5
#> 3 Marshall t3        7
#> 4 Skye     t1        3
#> 5 Skye     t2        6
#> 6 Skye     t3        7

We will see more examples of how to select columns when we look at the dplyr package.

Example

Let’s look at another example, this time using the Therapy dataset:

library(haven)
Therapy <- read_sav("data/Therapy.sav")
Therapy$Vpnr <- as_factor(Therapy$Vpnr)
Therapy$Gruppe <- as_factor(Therapy$Gruppe)
Therapy
#> # A tibble: 100 x 5
#>   Vpnr  Gruppe         Pretest Posttest Difference_PrePost
#>   <fct> <fct>            <dbl>    <dbl>              <dbl>
#> 1 1     Kontrollgruppe    4.29     3.21              1.08 
#> 2 2     Kontrollgruppe    6.18     5.99              0.190
#> 3 3     Kontrollgruppe    3.93     4.17             -0.239
#> 4 4     Kontrollgruppe    5.06     4.76              0.295
#> 5 5     Kontrollgruppe    6.45     5.64              0.814
#> 6 6     Kontrollgruppe    4.49     4.67             -0.180
#> # … with 94 more rows

The structure of this dataset is similar to that of the toy example given above. We want to combine the ‘variables’ Pretest and Posttest, as they represent levels of a common factor time. The data in the cells are repeated measurements. Vpnr and Gruppe should be ignored for pivoting but kept in the data set.

First we can get rid of the Difference_PrePost variable.

The following code selects all rows, and the columns specified in the vector c("Vpnr", "Gruppe", "Pretest", "Posttest").

Therapy <- Therapy[, c("Vpnr", "Gruppe", "Pretest", "Posttest")]

Now we are ready to pivot:

Therapy_long <- Therapy %>%
    pivot_longer(c(Pretest, Posttest),
           names_to = "time",
           values_to = "rating")

# time should be a factor
Therapy_long$time <- factor(Therapy_long$time,
                                     levels = c("Pretest", "Posttest"))
Therapy_long
#> # A tibble: 200 x 4
#>   Vpnr  Gruppe         time     rating
#>   <fct> <fct>          <fct>     <dbl>
#> 1 1     Kontrollgruppe Pretest    4.29
#> 2 1     Kontrollgruppe Posttest   3.21
#> 3 2     Kontrollgruppe Pretest    6.18
#> 4 2     Kontrollgruppe Posttest   5.99
#> 5 3     Kontrollgruppe Pretest    3.93
#> 6 3     Kontrollgruppe Posttest   4.17
#> # … with 194 more rows

6.3.2 pivot_wider()

pivot_wider() is the complement of pivot_longer(). This function takes a factor and ‘spreads’ the values of a measured variable over the levels of the factor, creating as many new variables as there are factor levels. Obviously, pivot_wider() is used to convert from long to wide.

The pivot_wider() syntax looks like this:

pivot_wider(data, names_from, values_from)

# or

data %>%
    pivot_wider(names_from, values_from)

Using our previous example, we want to create new variables for each level of the factor time, using the values in the score variable.

df_long
#> # A tibble: 6 x 3
#>   name     time  score
#>   <fct>    <fct> <dbl>
#> 1 Marshall t1        4
#> 2 Marshall t2        5
#> 3 Marshall t3        7
#> 4 Skye     t1        3
#> 5 Skye     t2        6
#> 6 Skye     t3        7
df_wide <- df_long %>%
    pivot_wider(names_from = time, values_from = score)
df_wide
#> # A tibble: 2 x 4
#>   name        t1    t2    t3
#>   <fct>    <dbl> <dbl> <dbl>
#> 1 Marshall     4     5     7
#> 2 Skye         3     6     7

df_wide looks exaclty the same as the original dataset df. In fact, we can test whether this is the case:

all_equal(df, df_wide)   
#> [1] TRUE

Example

Let’s convert the “Therapy” data set from long back to wide:

Therapy_long
#> # A tibble: 200 x 4
#>   Vpnr  Gruppe         time     rating
#>   <fct> <fct>          <fct>     <dbl>
#> 1 1     Kontrollgruppe Pretest    4.29
#> 2 1     Kontrollgruppe Posttest   3.21
#> 3 2     Kontrollgruppe Pretest    6.18
#> 4 2     Kontrollgruppe Posttest   5.99
#> 5 3     Kontrollgruppe Pretest    3.93
#> 6 3     Kontrollgruppe Posttest   4.17
#> # … with 194 more rows
Therapy_wide <- Therapy_long %>%
    pivot_wider(names_from = time, values_from = rating)
Therapy_wide
#> # A tibble: 100 x 4
#>   Vpnr  Gruppe         Pretest Posttest
#>   <fct> <fct>            <dbl>    <dbl>
#> 1 1     Kontrollgruppe    4.29     3.21
#> 2 2     Kontrollgruppe    6.18     5.99
#> 3 3     Kontrollgruppe    3.93     4.17
#> 4 4     Kontrollgruppe    5.06     4.76
#> 5 5     Kontrollgruppe    6.45     5.64
#> 6 6     Kontrollgruppe    4.49     4.67
#> # … with 94 more rows

6.3.3 Excluding missing values: drop_na()

A very important function from the tidyr package is drop_na(). It is used to omit all rows in a data frame containing missing values:

Let’s create another toy dataset containing NAs:

df_1 <- tibble(var1 = c(1, 2, NA), var2 = c("a", NA, "b"))
df_1
#> # A tibble: 3 x 2
#>    var1 var2 
#>   <dbl> <chr>
#> 1     1 a    
#> 2     2 <NA> 
#> 3    NA b

df_2 <- df_1 %>% 
    drop_na() 
df_2 
#> # A tibble: 1 x 2
#>    var1 var2 
#>   <dbl> <chr>
#> 1     1 a

This is very useful, but of course should be used carefully, otherwise you will end up omitting rows that contain NAs in columns that might not be relevant to the question at hand.

6.4 Data manipulation: dplyr

Transforming data sets from wide to long is only one step when ‘tidying’ data. We will often need to select and rename variables, select a subset of observations, recode values, create new variables and create summary variables, often grouped by certain categorical variables.

All these transformations can be performed by functions in the dplyr package. These can be seen as a kind of grammar of data manipulation, with each function performing a specific action. These actions can be combined in many ways, resulting in an extremely flexible yet highly consistent workflow.

We cannot cover all of the available functions in this course, but they are all described in the dplyr documentation.

We need to load the dplyr package (or the tidyverse package):

library(dplyr)

6.4.1 Selecting variables: select()

The function select() is used to select variables from a data frame.

Syntax:

# df is a data frame
select(df, variable1, variable2, -variable3)

# or

df %>% 
    select(variable1, variable2, -variable3)

select() selects the variables variable1 and variable2 from the data frame df. variable3 is omitted.

Examples

# only name
df_long %>% 
    select(name)
#> # A tibble: 6 x 1
#>   name    
#>   <fct>   
#> 1 Marshall
#> 2 Marshall
#> 3 Marshall
#> 4 Skye    
#> 5 Skye    
#> 6 Skye
# time and score
df_long %>% 
    select(time, score)
#> # A tibble: 6 x 2
#>   time  score
#>   <fct> <dbl>
#> 1 t1        4
#> 2 t2        5
#> 3 t3        7
#> 4 t1        3
#> 5 t2        6
#> 6 t3        7

# or

df_long %>% 
    select(-name)
#> # A tibble: 6 x 2
#>   time  score
#>   <fct> <dbl>
#> 1 t1        4
#> 2 t2        5
#> 3 t3        7
#> 4 t1        3
#> 5 t2        6
#> 6 t3        7
Therapy_long %>%
    select(Vpnr, Gruppe, rating)
#> # A tibble: 200 x 3
#>   Vpnr  Gruppe         rating
#>   <fct> <fct>           <dbl>
#> 1 1     Kontrollgruppe   4.29
#> 2 1     Kontrollgruppe   3.21
#> 3 2     Kontrollgruppe   6.18
#> 4 2     Kontrollgruppe   5.99
#> 5 3     Kontrollgruppe   3.93
#> 6 3     Kontrollgruppe   4.17
#> # … with 194 more rows
# selects all variables from Gruppe to rating (including everything in between)
Therapy_long %>%
    select(Gruppe:rating)
#> # A tibble: 200 x 3
#>   Gruppe         time     rating
#>   <fct>          <fct>     <dbl>
#> 1 Kontrollgruppe Pretest    4.29
#> 2 Kontrollgruppe Posttest   3.21
#> 3 Kontrollgruppe Pretest    6.18
#> 4 Kontrollgruppe Posttest   5.99
#> 5 Kontrollgruppe Pretest    3.93
#> 6 Kontrollgruppe Posttest   4.17
#> # … with 194 more rows

Changing the order of variables

select() can be used to rearrange (reorder) variables in a data frame:

Therapy_long %>%
    select(rating, time, Gruppe, Vpnr)
#> # A tibble: 200 x 4
#>   rating time     Gruppe         Vpnr 
#>    <dbl> <fct>    <fct>          <fct>
#> 1   4.29 Pretest  Kontrollgruppe 1    
#> 2   3.21 Posttest Kontrollgruppe 1    
#> 3   6.18 Pretest  Kontrollgruppe 2    
#> 4   5.99 Posttest Kontrollgruppe 2    
#> 5   3.93 Pretest  Kontrollgruppe 3    
#> 6   4.17 Posttest Kontrollgruppe 3    
#> # … with 194 more rows

“Helper” functions for select()

There are a number of “helper functions” we can use for selecting variables:

# including
Therapy_long %>% select(starts_with("Gr"))
#> # A tibble: 200 x 1
#>   Gruppe        
#>   <fct>         
#> 1 Kontrollgruppe
#> 2 Kontrollgruppe
#> 3 Kontrollgruppe
#> 4 Kontrollgruppe
#> 5 Kontrollgruppe
#> 6 Kontrollgruppe
#> # … with 194 more rows

Therapy_long %>% select(ends_with("ng"))
#> # A tibble: 200 x 1
#>   rating
#>    <dbl>
#> 1   4.29
#> 2   3.21
#> 3   6.18
#> 4   5.99
#> 5   3.93
#> 6   4.17
#> # … with 194 more rows

Therapy_long %>% select(contains("u"))
#> # A tibble: 200 x 1
#>   Gruppe        
#>   <fct>         
#> 1 Kontrollgruppe
#> 2 Kontrollgruppe
#> 3 Kontrollgruppe
#> 4 Kontrollgruppe
#> 5 Kontrollgruppe
#> 6 Kontrollgruppe
#> # … with 194 more rows


# OR using one_of()
vars <- c("Gruppe", "rating")
Therapy_long %>% select(one_of(vars))
#> # A tibble: 200 x 2
#>   Gruppe         rating
#>   <fct>           <dbl>
#> 1 Kontrollgruppe   4.29
#> 2 Kontrollgruppe   3.21
#> 3 Kontrollgruppe   6.18
#> 4 Kontrollgruppe   5.99
#> 5 Kontrollgruppe   3.93
#> 6 Kontrollgruppe   4.17
#> # … with 194 more rows

# excluding
Therapy_long %>% select(-starts_with("Gr"))
#> # A tibble: 200 x 3
#>   Vpnr  time     rating
#>   <fct> <fct>     <dbl>
#> 1 1     Pretest    4.29
#> 2 1     Posttest   3.21
#> 3 2     Pretest    6.18
#> 4 2     Posttest   5.99
#> 5 3     Pretest    3.93
#> 6 3     Posttest   4.17
#> # … with 194 more rows

Therapy_long %>% select(-ends_with("ng"))
#> # A tibble: 200 x 3
#>   Vpnr  Gruppe         time    
#>   <fct> <fct>          <fct>   
#> 1 1     Kontrollgruppe Pretest 
#> 2 1     Kontrollgruppe Posttest
#> 3 2     Kontrollgruppe Pretest 
#> 4 2     Kontrollgruppe Posttest
#> 5 3     Kontrollgruppe Pretest 
#> 6 3     Kontrollgruppe Posttest
#> # … with 194 more rows

Therapy_long %>% select(-contains("u"))
#> # A tibble: 200 x 3
#>   Vpnr  time     rating
#>   <fct> <fct>     <dbl>
#> 1 1     Pretest    4.29
#> 2 1     Posttest   3.21
#> 3 2     Pretest    6.18
#> 4 2     Posttest   5.99
#> 5 3     Pretest    3.93
#> 6 3     Posttest   4.17
#> # … with 194 more rows

vars <- c("Gruppe", "rating")
Therapy_long %>% select(-one_of(vars))
#> # A tibble: 200 x 2
#>   Vpnr  time    
#>   <fct> <fct>   
#> 1 1     Pretest 
#> 2 1     Posttest
#> 3 2     Pretest 
#> 4 2     Posttest
#> 5 3     Pretest 
#> 6 3     Posttest
#> # … with 194 more rows

6.4.1.1 Scoped versions of select

We can also use the scoped versions of select: select_if(), select_at() and select_all().

For instance, we can select the numeric variables in the wide Therapy data frame:

Therapy_wide %>% 
    select_if(is.numeric)
#> # A tibble: 100 x 2
#>   Pretest Posttest
#>     <dbl>    <dbl>
#> 1    4.29     3.21
#> 2    6.18     5.99
#> 3    3.93     4.17
#> 4    5.06     4.76
#> 5    6.45     5.64
#> 6    4.49     4.67
#> # … with 94 more rows

6.4.2 Renaming variables: rename()

Variables can be renamed using rename(). Variables that have not been selected are not changed.

Syntax:

df %>% rename(new_name = old_name)

Example

# rename "Vpnr" to "ID"
Therapy_long %>%
    rename(ID = Vpnr)
#> # A tibble: 200 x 4
#>   ID    Gruppe         time     rating
#>   <fct> <fct>          <fct>     <dbl>
#> 1 1     Kontrollgruppe Pretest    4.29
#> 2 1     Kontrollgruppe Posttest   3.21
#> 3 2     Kontrollgruppe Pretest    6.18
#> 4 2     Kontrollgruppe Posttest   5.99
#> 5 3     Kontrollgruppe Pretest    3.93
#> 6 3     Kontrollgruppe Posttest   4.17
#> # … with 194 more rows

Similarly to select, we can use a scoped version. Using rename_all, we can change all variable names to lower case.

Therapy_wide %>% 
    rename_all(tolower)
#> # A tibble: 100 x 4
#>   vpnr  gruppe         pretest posttest
#>   <fct> <fct>            <dbl>    <dbl>
#> 1 1     Kontrollgruppe    4.29     3.21
#> 2 2     Kontrollgruppe    6.18     5.99
#> 3 3     Kontrollgruppe    3.93     4.17
#> 4 4     Kontrollgruppe    5.06     4.76
#> 5 5     Kontrollgruppe    6.45     5.64
#> 6 6     Kontrollgruppe    4.49     4.67
#> # … with 94 more rows

6.4.3 Selecting observations (cases): filter()

filter() is used to select rows from a data frame that fulfill certain logical conditions (or not).

We can combine multiple logical operators.

Syntax:

df %>% 
    filter(variable1 < VALUE1 & variable2 == VALUE2)

Examples

# only control group
Therapy_long %>%
    filter(Gruppe == "Kontrollgruppe")
#> # A tibble: 100 x 4
#>   Vpnr  Gruppe         time     rating
#>   <fct> <fct>          <fct>     <dbl>
#> 1 1     Kontrollgruppe Pretest    4.29
#> 2 1     Kontrollgruppe Posttest   3.21
#> 3 2     Kontrollgruppe Pretest    6.18
#> 4 2     Kontrollgruppe Posttest   5.99
#> 5 3     Kontrollgruppe Pretest    3.93
#> 6 3     Kontrollgruppe Posttest   4.17
#> # … with 94 more rows
# only Posttest
Therapy_long %>%
    filter(time == "Posttest")
#> # A tibble: 100 x 4
#>   Vpnr  Gruppe         time     rating
#>   <fct> <fct>          <fct>     <dbl>
#> 1 1     Kontrollgruppe Posttest   3.21
#> 2 2     Kontrollgruppe Posttest   5.99
#> 3 3     Kontrollgruppe Posttest   4.17
#> 4 4     Kontrollgruppe Posttest   4.76
#> 5 5     Kontrollgruppe Posttest   5.64
#> 6 6     Kontrollgruppe Posttest   4.67
#> # … with 94 more rows
# only Pretest
Therapy_long %>%
    filter(time != "Posttest")
#> # A tibble: 100 x 4
#>   Vpnr  Gruppe         time    rating
#>   <fct> <fct>          <fct>    <dbl>
#> 1 1     Kontrollgruppe Pretest   4.29
#> 2 2     Kontrollgruppe Pretest   6.18
#> 3 3     Kontrollgruppe Pretest   3.93
#> 4 4     Kontrollgruppe Pretest   5.06
#> 5 5     Kontrollgruppe Pretest   6.45
#> 6 6     Kontrollgruppe Pretest   4.49
#> # … with 94 more rows
# only ratings >= 5
Therapy_long %>%
    filter(rating >= 5)
#> # A tibble: 66 x 4
#>   Vpnr  Gruppe         time     rating
#>   <fct> <fct>          <fct>     <dbl>
#> 1 2     Kontrollgruppe Pretest    6.18
#> 2 2     Kontrollgruppe Posttest   5.99
#> 3 4     Kontrollgruppe Pretest    5.06
#> 4 5     Kontrollgruppe Pretest    6.45
#> 5 5     Kontrollgruppe Posttest   5.64
#> 6 10    Kontrollgruppe Pretest    5.12
#> # … with 60 more rows
# ratings between 3 and 5
Therapy_long %>%
    filter(rating >= 3 & rating <= 5)
#> # A tibble: 130 x 4
#>   Vpnr  Gruppe         time     rating
#>   <fct> <fct>          <fct>     <dbl>
#> 1 1     Kontrollgruppe Pretest    4.29
#> 2 1     Kontrollgruppe Posttest   3.21
#> 3 3     Kontrollgruppe Pretest    3.93
#> 4 3     Kontrollgruppe Posttest   4.17
#> 5 4     Kontrollgruppe Posttest   4.76
#> 6 6     Kontrollgruppe Pretest    4.49
#> # … with 124 more rows
# only subjects 3 and 5
Therapy_long %>%
    filter(Vpnr == 3 | Vpnr == 5)
#> # A tibble: 4 x 4
#>   Vpnr  Gruppe         time     rating
#>   <fct> <fct>          <fct>     <dbl>
#> 1 3     Kontrollgruppe Pretest    3.93
#> 2 3     Kontrollgruppe Posttest   4.17
#> 3 5     Kontrollgruppe Pretest    6.45
#> 4 5     Kontrollgruppe Posttest   5.64

# alternative: use the %in% operator
Therapy_long %>%
    filter(Vpnr %in% c(3, 5))
#> # A tibble: 4 x 4
#>   Vpnr  Gruppe         time     rating
#>   <fct> <fct>          <fct>     <dbl>
#> 1 3     Kontrollgruppe Pretest    3.93
#> 2 3     Kontrollgruppe Posttest   4.17
#> 3 5     Kontrollgruppe Pretest    6.45
#> 4 5     Kontrollgruppe Posttest   5.64

6.4.4 Sorting data frames: arrange()

With the arrange() function, we can sort rows according to variables in a data frame, either in ascending or descending order.

# ascending
Therapy_long %>%
    arrange(Vpnr)
#> # A tibble: 200 x 4
#>   Vpnr  Gruppe         time     rating
#>   <fct> <fct>          <fct>     <dbl>
#> 1 1     Kontrollgruppe Pretest    4.29
#> 2 1     Kontrollgruppe Posttest   3.21
#> 3 2     Kontrollgruppe Pretest    6.18
#> 4 2     Kontrollgruppe Posttest   5.99
#> 5 3     Kontrollgruppe Pretest    3.93
#> 6 3     Kontrollgruppe Posttest   4.17
#> # … with 194 more rows
# descending
Therapy_long %>%
    arrange(desc(Vpnr))
#> # A tibble: 200 x 4
#>   Vpnr  Gruppe         time     rating
#>   <fct> <fct>          <fct>     <dbl>
#> 1 100   Therapiegruppe Pretest    4.77
#> 2 100   Therapiegruppe Posttest   4.50
#> 3 99    Therapiegruppe Pretest    4.66
#> 4 99    Therapiegruppe Posttest   3.80
#> 5 98    Therapiegruppe Pretest    4.36
#> 6 98    Therapiegruppe Posttest   3.91
#> # … with 194 more rows

6.4.5 Creating new variables: mutate()

The mutate() function can be used to either create new variables or to change already existing ones.

Syntax:

df %>% mutate(new_variable_1 = FORMULA_1,
              new_variable_2 = FORMULA_2,
              old_variable_1 = FORMULA_3)

Examples

# we want ratings in % (percent)
Therapy_long %>%
    mutate(rating_p = round(rating/7 * 100, digits = 1))
#> # A tibble: 200 x 5
#>   Vpnr  Gruppe         time     rating rating_p
#>   <fct> <fct>          <fct>     <dbl>    <dbl>
#> 1 1     Kontrollgruppe Pretest    4.29     61.2
#> 2 1     Kontrollgruppe Posttest   3.21     45.8
#> 3 2     Kontrollgruppe Pretest    6.18     88.2
#> 4 2     Kontrollgruppe Posttest   5.99     85.5
#> 5 3     Kontrollgruppe Pretest    3.93     56.2
#> 6 3     Kontrollgruppe Posttest   4.17     59.6
#> # … with 194 more rows

There is also a function called transmute(), which returns only the newly created variables.

Therapy_long %>%
    transmute(rating_p = round(rating/7 * 100, digits = 1))
#> # A tibble: 200 x 1
#>   rating_p
#>      <dbl>
#> 1     61.2
#> 2     45.8
#> 3     88.2
#> 4     85.5
#> 5     56.2
#> 6     59.6
#> # … with 194 more rows

6.4.5.1 Scoped versions of mutate

mutate has scoped versions that allows you to apply transformations to multiple variables at once.

As an example, say we want to convert the ratings in the Therapy data frame to a different scale. First, we create a function that does the job of converting the rating to a percentage.

change_scale <- function(x) 100 * x / 7

Then we select both Pretest and Posttest columns. They are the only numeric variables, so we can use mutate_if with is.numeric.

Therapy_wide %>% 
    mutate_if(is.numeric, change_scale)
#> # A tibble: 100 x 4
#>   Vpnr  Gruppe         Pretest Posttest
#>   <fct> <fct>            <dbl>    <dbl>
#> 1 1     Kontrollgruppe    61.2     45.8
#> 2 2     Kontrollgruppe    88.2     85.5
#> 3 3     Kontrollgruppe    56.2     59.6
#> 4 4     Kontrollgruppe    72.3     68.1
#> 5 5     Kontrollgruppe    92.2     80.6
#> 6 6     Kontrollgruppe    64.1     66.7
#> # … with 94 more rows

Alternatively, we can use mutate_at with the helper functions vars() and starts_with.

Therapy_wide %>% 
    mutate_at(vars(starts_with("P")), change_scale)
#> # A tibble: 100 x 4
#>   Vpnr  Gruppe         Pretest Posttest
#>   <fct> <fct>            <dbl>    <dbl>
#> 1 1     Kontrollgruppe    61.2     45.8
#> 2 2     Kontrollgruppe    88.2     85.5
#> 3 3     Kontrollgruppe    56.2     59.6
#> 4 4     Kontrollgruppe    72.3     68.1
#> 5 5     Kontrollgruppe    92.2     80.6
#> 6 6     Kontrollgruppe    64.1     66.7
#> # … with 94 more rows

6.4.6 Grouping data frames: group_by()

We often want to perform operations on subsets of a data frame, given by levels of a grouping factor. The function group_by() groups the data frame according to one or more grouping variables, applies a function to every group and then recombines the data frame (split-apply-combine). group_by() is usually used in combination with other functions.

Syntax:

df <- group_by(grouping_1, grouping_2, grouping_3)

Examples

Let’s import a new dataset. In this example, taken from the Statistics II course, 12 subjects were allocated to 4 experimental conditions, and the level of aggressive behaviour was measured. Suppose that we want to add a group-centred aggression variable.

library(readr)
alk_aggr <- read_csv("data/alkohol-aggression.csv") %>% 
            mutate(alkoholbedingung = as_factor(alkoholbedingung))
#> Parsed with column specification:
#> cols(
#>   aggressivitaet = col_double(),
#>   alkoholbedingung = col_character()
#> )

We can use group_by() to split the data frame into four separate pieces, compute the group mean for each, and put the pieces back together again. Before we do that, we will change the order of the columns using select.

alk_aggr %>%
    select(alkoholbedingung, aggressivitaet) %>% 
    group_by(alkoholbedingung) %>%
    mutate(group_mean = mean(aggressivitaet),
           aggr_c = aggressivitaet - group_mean)
#> # A tibble: 12 x 4
#> # Groups:   alkoholbedingung [4]
#>   alkoholbedingung aggressivitaet group_mean aggr_c
#>   <fct>                     <dbl>      <dbl>  <dbl>
#> 1 kein_alkohol                 64         62      2
#> 2 kein_alkohol                 58         62     -4
#> 3 kein_alkohol                 64         62      2
#> 4 placebo                      74         75     -1
#> 5 placebo                      79         75      4
#> 6 placebo                      72         75     -3
#> # … with 6 more rows

6.4.7 Recoding values: recode()

Together with mutate(), recode() can be used to change values of a variable. This can be done with numeric or character variables.

Syntax:

recode(variable,
       old_value_1 = "new_value_1",
       old_value_2 = "new_value_2")

If we want to change the levels of a factor, the forcats package provides more functionality. To manually recode the factor, we use fct_recode().

Syntax:

fct_recode(variable,
       new_value_1 = "old_value_1",
       new_value_2 = "old_value_2")
# we want to rename the control group to "control" ,
# and the therapy group to "treatment"
Therapy_long %>%
    mutate(Gruppe = fct_recode(Gruppe,
                           control = "Kontrollgruppe",
                           treatment = "Therapiegruppe"))
#> # A tibble: 200 x 4
#>   Vpnr  Gruppe  time     rating
#>   <fct> <fct>   <fct>     <dbl>
#> 1 1     control Pretest    4.29
#> 2 1     control Posttest   3.21
#> 3 2     control Pretest    6.18
#> 4 2     control Posttest   5.99
#> 5 3     control Pretest    3.93
#> 6 3     control Posttest   4.17
#> # … with 194 more rows

6.4.8 Summarizing variables: summarize()

Using the functions summarize() or summarise() (both spellings are possible) we can summarize variables and compute descriptive statistics. In contrast to mutate(), summarize() returns a value for each group, not each observation.

Syntax:

df %>% summarize(summary = FUNCTION(variable))

FUNCTION is a placeholder for any function that can be used to summarize data, e.g. mean(), sd(). If we want the mean and the standard error, we could use the function that we created in chapter 4:

se <- function(x) {
    n <- length(x)
    sd(x) / sqrt(n)
}

Let’s compute the group mean, standard deviation and standard error for each time point:

Therapy_long %>%
    group_by(Gruppe, time) %>%
    summarize(mean= mean(rating),
              sd = sd(rating),
              se = se(rating))
#> # A tibble: 4 x 5
#> # Groups:   Gruppe [2]
#>   Gruppe         time      mean    sd     se
#>   <fct>          <fct>    <dbl> <dbl>  <dbl>
#> 1 Kontrollgruppe Pretest   5.06 0.908 0.128 
#> 2 Kontrollgruppe Posttest  4.65 0.834 0.118 
#> 3 Therapiegruppe Pretest   4.82 0.691 0.0977
#> 4 Therapiegruppe Posttest  4.23 0.752 0.106

A more advanced example of this uses summarize_if. We can apply a list of functions to all variables that pass the test is.numeric. This gives the same output as the code above.

Therapy_long %>%
    group_by(Gruppe, time) %>%
    summarize_if(is.numeric, list(mean = mean, sd = sd, se = se))
#> # A tibble: 4 x 5
#> # Groups:   Gruppe [2]
#>   Gruppe         time      mean    sd     se
#>   <fct>          <fct>    <dbl> <dbl>  <dbl>
#> 1 Kontrollgruppe Pretest   5.06 0.908 0.128 
#> 2 Kontrollgruppe Posttest  4.65 0.834 0.118 
#> 3 Therapiegruppe Pretest   4.82 0.691 0.0977
#> 4 Therapiegruppe Posttest  4.23 0.752 0.106

Or we can use summarize_at and use vars() to select the rating variable.

Therapy_long %>%
    group_by(Gruppe, time) %>%
    summarize_at(vars(rating), list(~mean(.), ~sd(.), ~se(.)))
#> # A tibble: 4 x 5
#> # Groups:   Gruppe [2]
#>   Gruppe         time      mean    sd     se
#>   <fct>          <fct>    <dbl> <dbl>  <dbl>
#> 1 Kontrollgruppe Pretest   5.06 0.908 0.128 
#> 2 Kontrollgruppe Posttest  4.65 0.834 0.118 
#> 3 Therapiegruppe Pretest   4.82 0.691 0.0977
#> 4 Therapiegruppe Posttest  4.23 0.752 0.106

# or 

Therapy_long %>%
    group_by(Gruppe, time) %>%
    summarize_at("rating", list(~mean(.), ~sd(.), ~se(.)))
#> # A tibble: 4 x 5
#> # Groups:   Gruppe [2]
#>   Gruppe         time      mean    sd     se
#>   <fct>          <fct>    <dbl> <dbl>  <dbl>
#> 1 Kontrollgruppe Pretest   5.06 0.908 0.128 
#> 2 Kontrollgruppe Posttest  4.65 0.834 0.118 
#> 3 Therapiegruppe Pretest   4.82 0.691 0.0977
#> 4 Therapiegruppe Posttest  4.23 0.752 0.106

6.4.9 Summary

We can now start combining all functions. As an example, we can compute the difference between pretest and posttest measurements in the Therapy data set (we previously omitted this).

Therapy_long %>%
    pivot_wider(names_from = time, values_from = rating) %>%
    mutate(Difference = Pretest - Posttest)
#> # A tibble: 100 x 5
#>   Vpnr  Gruppe         Pretest Posttest Difference
#>   <fct> <fct>            <dbl>    <dbl>      <dbl>
#> 1 1     Kontrollgruppe    4.29     3.21      1.08 
#> 2 2     Kontrollgruppe    6.18     5.99      0.190
#> 3 3     Kontrollgruppe    3.93     4.17     -0.239
#> 4 4     Kontrollgruppe    5.06     4.76      0.295
#> 5 5     Kontrollgruppe    6.45     5.64      0.814
#> 6 6     Kontrollgruppe    4.49     4.67     -0.180
#> # … with 94 more rows

6.5 Row-wise operations

Suppose that you have your data in a wide data frame, and you want to create summary scores across a number of columns. In other words, you want to perform an operation on the rows of a data frame.

The operations in the tidyverse packages are intended to operate on columns, not rows, so this can be rather tricky.

Let’s create a toy example to illustrate this:

library(tidyverse)

df <- tribble(
    ~name,   ~n1, ~n2,  ~e1,  ~e2,
    "Diane",  7,   8,    9,    4, 
    "Todd",   3,   2,    3,    3)

We have two subjects, who responded to four questions. Two of those, n1 and n2 represent a common n factor, whereas the other two variables, e1 and e2, represent a common e factor.

It is tempting to try out something like this:

df %>% 
    mutate(n = (n1 + n2)/2,
           e = (e1 + e2)/2)
#> # A tibble: 2 x 7
#>   name     n1    n2    e1    e2     n     e
#>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane     7     8     9     4   7.5   6.5
#> 2 Todd      3     2     3     3   2.5   3

However, this solution is very specific to this particular example, and involves a lot of repetition.

dplyr does provide a function, rowwise() that will help us here:

df %>%
  rowwise() %>%
  mutate(n = mean(c(n1, n2)),
         e = mean(c(e1, e2)))
#> Source: local data frame [2 x 7]
#> Groups: <by row>
#> 
#> # A tibble: 2 x 7
#>   name     n1    n2    e1    e2     n     e
#>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane     7     8     9     4   7.5   6.5
#> 2 Todd      3     2     3     3   2.5   3

This is much better, in the sense that we can now apply a function rather than specifying the calculations by hand. A further advantage is that functions, such as mean or sd, can handle missing values.

Let’s add a missing value:

df <- tribble(
    ~name,   ~n1, ~n2,  ~e1,  ~e2,
    "Diane",  7,   8,    9,    4, 
    "Todd",   NA,   2,    3,    3)
df
#> # A tibble: 2 x 5
#>   name     n1    n2    e1    e2
#>   <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane     7     8     9     4
#> 2 Todd     NA     2     3     3

This approach will no longer work:

df %>% 
    mutate(n = (n1 + n2)/2,
           e = (e1 + e2)/2)
#> # A tibble: 2 x 7
#>   name     n1    n2    e1    e2     n     e
#>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane     7     8     9     4   7.5   6.5
#> 2 Todd     NA     2     3     3  NA     3

but this one will:

df %>%
  rowwise() %>%
  mutate(n = mean(c(n1, n2), na.rm = TRUE),
         e = mean(c(e1, e2), na.rm = TRUE))
#> Source: local data frame [2 x 7]
#> Groups: <by row>
#> 
#> # A tibble: 2 x 7
#>   name     n1    n2    e1    e2     n     e
#>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane     7     8     9     4   7.5   6.5
#> 2 Todd     NA     2     3     3   2     3

We could even recreate some missing functionality from SPSS here. Our summary should only be created for rows that do not exceed the maximum permissible number of missing values.

The goal is to compute some sort of row-wise score, i.e. a sum or an average, over several columns of a data frame, while being able to specify how many missing values you will allow per row.

We can create a function that does this:

score_items <- function(v, max_na = NULL) {
    if (is.null(max_na)) {
        max_na <- length(v)
    }

    n_na <- length(v[is.na(v)])

    if (n_na <= max_na) {
        out <- mean(v, na.rm = TRUE)
    } else {
        out <- NA
    }
    out
}

This function will compute the mean of a vector, but only if that vector contains fewer than a pre-determined number of missing values.

Let’s create a new data frame to test this:

df_2 <- tribble(
    ~name, ~n1, ~n2, ~n3, ~n4, ~e1, ~e2, ~e3, ~e4,
    "Diane", 7, 8, 9, 4, NA, 2, 3, 2,
    "Todd", NA, 2, NA, 3, 1, 4, 5, 1)
df_2
#> # A tibble: 2 x 9
#>   name     n1    n2    n3    n4    e1    e2    e3    e4
#>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane     7     8     9     4    NA     2     3     2
#> 2 Todd     NA     2    NA     3     1     4     5     1
df_2 %>%
  rowwise() %>%
  mutate(n = score_items(c(n1, n2, n3, n4), max_na = 1),
         e = score_items(c(e1, e2, e3, e4), max_na = 1))
#> Source: local data frame [2 x 11]
#> Groups: <by row>
#> 
#> # A tibble: 2 x 11
#>   name     n1    n2    n3    n4    e1    e2    e3    e4     n     e
#>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane     7     8     9     4    NA     2     3     2     7  2.33
#> 2 Todd     NA     2    NA     3     1     4     5     1    NA  2.75

This will compute the mean over the specified columns, but only if the subject has at most 1 missing value.

If we are not willing to accept any NAs:

df_2 %>%
  rowwise() %>%
  mutate(n = score_items(c(n1, n2, n3, n4), max_na = 0),
         e = score_items(c(e1, e2, e3, e4), max_na = 0))
#> Source: local data frame [2 x 11]
#> Groups: <by row>
#> 
#> # A tibble: 2 x 11
#>   name     n1    n2    n3    n4    e1    e2    e3    e4     n     e
#>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane     7     8     9     4    NA     2     3     2     7 NA   
#> 2 Todd     NA     2    NA     3     1     4     5     1    NA  2.75

This code still seems rather inelegant, and involves far too much much repetition. If the data frame were long, they would look like this:

df_2_long <- df_2 %>% 
    pivot_longer(-name,
        names_to = c(".value", "obs"),
        names_pattern = "(.)(.)")
df_2_long
#> # A tibble: 8 x 4
#>   name  obs       n     e
#>   <chr> <chr> <dbl> <dbl>
#> 1 Diane 1         7    NA
#> 2 Diane 2         8     2
#> 3 Diane 3         9     3
#> 4 Diane 4         4     2
#> 5 Todd  1        NA     1
#> 6 Todd  2         2     4
#> # … with 2 more rows

Since the columns in df_2 contain two variables, we need to split them. This is achieved by using the names_pattern argument.

In this case it would be pretty straightforward to summarize the variables n and e after using group_by(name).

df_2_long %>% 
group_by(name) %>% 
    select(-obs) %>% 
    summarize_if(is.numeric, mean, na.rm = TRUE)
#> # A tibble: 2 x 3
#>   name      n     e
#>   <chr> <dbl> <dbl>
#> 1 Diane   7    2.33
#> 2 Todd    2.5  2.75

We can also use our score_items() function:

df_2_long %>% 
group_by(name) %>% 
    select(-obs) %>% 
    summarize_if(is.numeric, score_items, max_na = 1)
#> # A tibble: 2 x 3
#>   name      n     e
#>   <chr> <dbl> <dbl>
#> 1 Diane     7  2.33
#> 2 Todd     NA  2.75

Here, we are using scoped version of summarize, summarize_if, which summarizes all columns for which the provided predicate (is.numeric in this case) evaluates to TRUE. The mean function is used to summarize.

df_2 %>%
    pivot_longer(-name,
        names_to = c(".value", "obs"),
        names_pattern = "(.)(.)") %>%
    select(-obs) %>%
    group_by(name) %>%
    summarize_if(is.numeric, ~score_items(.)) %>%
    right_join(df_2)
#> Joining, by = "name"
#> # A tibble: 2 x 11
#>   name      n     e    n1    n2    n3    n4    e1    e2    e3    e4
#>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Diane   7    2.33     7     8     9     4    NA     2     3     2
#> 2 Todd    2.5  2.75    NA     2    NA     3     1     4     5     1

6.6 Exercises

Download and import the data files “rtata.csv” and “beispieldaten.sav” from ILIAS.

library(tidyverse)

Reaction times

In this simulated data set, 3 participants performed a reaction time experiment in 3 conditions. In one condition, participants should respond quickly, in another they are expected to be slower. The third condition is a control.

library(readr)
rt_wide <- read_csv("data/rtdata.csv")
#> Parsed with column specification:
#> cols(
#>   ID = col_double(),
#>   control = col_double(),
#>   slow = col_double(),
#>   fast = col_double()
#> )
rt_wide
#> # A tibble: 300 x 4
#>      ID control  slow  fast
#>   <dbl>   <dbl> <dbl> <dbl>
#> 1     1   1.55   3.62  3.31
#> 2     1   0.782  1.50  1.57
#> 3     1   3.10   2.30  1.13
#> 4     1   2.97   1.83  1.97
#> 5     1   1.38   3.53  3.26
#> 6     1   0.329  3.44  5.34
#> # … with 294 more rows
  1. This data set is wide and first needs to be converted into a long format. Grouping variables should be converted into factors.
rt_long <- rt_wide %>%
    pivot_longer(-ID, names_to = "condition", values_to = "rt") %>%
    mutate(ID = factor(ID),
           condition = factor(condition))
rt_long
#> # A tibble: 900 x 3
#>   ID    condition    rt
#>   <fct> <fct>     <dbl>
#> 1 1     control   1.55 
#> 2 1     slow      3.62 
#> 3 1     fast      3.31 
#> 4 1     control   0.782
#> 5 1     slow      1.50 
#> 6 1     fast      1.57 
#> # … with 894 more rows
  1. Compute the mean reaction time and standard deviation for each participant in each condition.
summary_rt <- rt_long %>%
    group_by(ID, condition) %>%
    summarise(mean_rt = mean(rt),
              sd_rt = sd(rt))
summary_rt
#> # A tibble: 9 x 4
#> # Groups:   ID [3]
#>   ID    condition mean_rt sd_rt
#>   <fct> <fct>       <dbl> <dbl>
#> 1 1     control      1.68  1.02
#> 2 1     fast         2.15  1.58
#> 3 1     slow         3.20  1.93
#> 4 2     control      2.12  1.31
#> 5 2     fast         1.69  1.21
#> 6 2     slow         3.23  1.69
#> # … with 3 more rows
  1. Compute the log RT and add this to the data frame as a new variable.
rt_long <- rt_long %>%
    mutate(log_rt = log(rt))
rt_long
#> # A tibble: 900 x 4
#>   ID    condition    rt log_rt
#>   <fct> <fct>     <dbl>  <dbl>
#> 1 1     control   1.55   0.439
#> 2 1     slow      3.62   1.29 
#> 3 1     fast      3.31   1.20 
#> 4 1     control   0.782 -0.246
#> 5 1     slow      1.50   0.407
#> 6 1     fast      1.57   0.452
#> # … with 894 more rows

A further reason to use the long format is that the ggplot2 package requires this. The following code shows an example.

library(ggplot2)

rt_long %>%
    ggplot(aes(x = rt, y = ..density.., fill = condition)) +
    geom_histogram(bins = 50) +
    facet_grid(condition ~ ID) +
    scale_fill_viridis_d() +
    theme_bw()

Adolescents in West and East Germany

library(dplyr)
library(tidyr)
library(stringr)
library(haven)

adolescents <- read_sav("data/beispieldaten.sav")
adolescents
#> # A tibble: 286 x 98
#>      ID westost geschlecht alter  swk1  swk2  swk3  swk4  swk5  swk6  swk7  swk8
#>   <dbl> <dbl+l>  <dbl+lbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1 0 [Wes… 1 [weibli…    13     2     5     6     4     7     5     5     6
#> 2     2 0 [Wes… 0 [männli…    14     4     6     4     4     6     5     5     6
#> 3    10 0 [Wes… 1 [weibli…    14     5     4     5     6     5     7     7     6
#> 4    11 0 [Wes… 1 [weibli…    14     3     6     6     5     7     7     6     6
#> 5    12 0 [Wes… 1 [weibli…    14     5     6     6     5     7     6     5     5
#> 6    14 0 [Wes… 0 [männli…    14     3     5     5     4     5     7     5     6
#> # … with 280 more rows, and 86 more variables: swk9 <dbl>, swk10 <dbl>,
#> #   swk11 <dbl>, swk12 <dbl>, swk13 <dbl>, swk14 <dbl>, swk15 <dbl>,
#> #   swk16 <dbl>, swk17 <dbl>, swk18 <dbl>, swk19 <dbl>, swk20 <dbl>,
#> #   swk21 <dbl>, swk22 <dbl>, swk23 <dbl>, swk24 <dbl>, swk25 <dbl>,
#> #   swk26 <dbl>, swk27 <dbl>, swk28 <dbl>, swk29 <dbl>, swk30 <dbl>,
#> #   swk31 <dbl>, swk32 <dbl>, swk33 <dbl>, swk34 <dbl>, swk35 <dbl>,
#> #   swk36 <dbl>, unteltern1 <dbl>, unteltern2 <dbl>, unteltern3 <dbl>,
#> #   unteltern4 <dbl>, unteltern5 <dbl>, unteltern6 <dbl>, untfreunde1 <dbl>,
#> #   untfreunde2 <dbl>, untfreunde3 <dbl>, untfreunde4 <dbl>, untfreunde5 <dbl>,
#> #   untfreunde6 <dbl>, leben1 <dbl>, leben2 <dbl>, leben3 <dbl>, leben4 <dbl>,
#> #   leben5 <dbl>, leben6 <dbl>, leben7 <dbl>, leben8 <dbl>, leben9 <dbl>,
#> #   leben10 <dbl>, leben11 <dbl>, stress1 <dbl>, stress2 <dbl>, stress3 <dbl>,
#> #   stress4 <dbl>, stress5 <dbl>, stress6 <dbl>, stress7 <dbl>, stress8 <dbl>,
#> #   stress9 <dbl>, stress10 <dbl>, stress11 <dbl>, stress12 <dbl>,
#> #   Deutschnote <dbl>, Mathenote <dbl+lbl>, Fremdsprachenote <dbl+lbl>,
#> #   Gesamtnote <dbl>, bildung_vater <dbl+lbl>, bildung_mutter <dbl+lbl>,
#> #   bildung_vater_binaer <dbl+lbl>, bildung_mutter_binaer <dbl+lbl>,
#> #   swk_neueslernen <dbl>, swk_lernregulation <dbl>, swk_motivation <dbl>,
#> #   swk_durchsetzung <dbl>, swk_sozialkomp <dbl>, swk_beziehung <dbl>,
#> #   unt_eltern <dbl>, unt_freunde <dbl>, leben_selbst <dbl>,
#> #   leben_familie <dbl>, leben_schule <dbl>, leben_freunde <dbl>,
#> #   leben_gesamt <dbl>, stress_somatisch <dbl>, stress_psychisch <dbl>
adolescents <- adolescents %>%
    mutate(westost = factor(westost),
           geschlecht = factor(geschlecht),
           ID = factor(ID))

Mean age

Compute the mean age and standard deviation for both sexes.

adolescents %>%
    group_by(geschlecht) %>%
    summarise(alter_mw = round(mean(alter), 1),
              alter_sd = round(sd(alter), 1))
#> # A tibble: 2 x 3
#>   geschlecht alter_mw alter_sd
#>   <fct>         <dbl>    <dbl>
#> 1 0              14.8      0.7
#> 2 1              14.6      0.8

Emotional support

Select the variables unt_freunde (emotional support from friends) and unt_eltern (emotional support from parents) and use these to create a long data frame.

Step 1:

# select variables
unterstuetzung <- adolescents %>%
    dplyr::select(ID, westost, unt_freunde, unt_eltern)

Step 2:

# handle missing values: are there any?
unterstuetzung[!complete.cases(unterstuetzung), ]
#> # A tibble: 2 x 4
#>   ID    westost unt_freunde unt_eltern
#>   <fct> <fct>         <dbl>      <dbl>
#> 1 332   1              4.83      NA   
#> 2 16    0             NA          5.17
# exclude missing values
unterstuetzung <- unterstuetzung %>%
    drop_na()

Step 3:

# convert wide to long
unterstuetzung <- unterstuetzung %>%
    pivot_longer(starts_with("unt"), names_to = "freunde_eltern",
           values_to = "unterstuetzung")

Step 4:

# convert freunde_eltern variable to a factor
unterstuetzung <- unterstuetzung %>%
    mutate(freunde_eltern = factor(freunde_eltern))

Step 5 (advanced):

# rename factor levels (remove "unt_" prefix)
unterstuetzung <- unterstuetzung %>%
        mutate(freunde_eltern = str_replace(freunde_eltern, ".*_", ""))

Step 6:

# inspect resulting data frame
unterstuetzung
#> # A tibble: 568 x 4
#>   ID    westost freunde_eltern unterstuetzung
#>   <fct> <fct>   <chr>                   <dbl>
#> 1 2     0       freunde                  4.83
#> 2 2     0       eltern                   5.17
#> 3 14    0       freunde                  6.33
#> 4 14    0       eltern                   6.5 
#> 5 15    0       freunde                  5   
#> 6 15    0       eltern                   5.83
#> # … with 562 more rows

And now all in one go:

unterstuetzung <- adolescents %>%
    dplyr::select(ID, westost, unt_freunde, unt_eltern) %>%
    # fehlende Werte ausschliessen
    drop_na() %>%
    # von wide zu long
    pivot_longer(starts_with("unt"), names_to = "freunde_eltern",
           values_to = "unterstuetzung") %>%
    # freunde_eltern variable bearbeiten
    mutate(freunde_eltern = factor(str_replace(freunde_eltern,
                                               ".*_", "")))
unterstuetzung
#> # A tibble: 568 x 4
#>   ID    westost freunde_eltern unterstuetzung
#>   <fct> <fct>   <fct>                   <dbl>
#> 1 1     0       freunde                  6.67
#> 2 1     0       eltern                   6.67
#> 3 2     0       freunde                  4.83
#> 4 2     0       eltern                   5.17
#> 5 10    0       freunde                  4   
#> 6 10    0       eltern                   6.83
#> # … with 562 more rows

Life Satisfaction

We want to investigate life satisfaction in various contexts. The relevant variables all begin with leben_. We want to select all of these (excluding leben_gesamt, which is already a summary).

We could choose all variables starting with leben_ using starts_with("leben_"), and then exclude leben_gesamt individually. We also want to retain each person ID.

zufriedenheit_wide <- adolescents %>%
    dplyr::select(ID, starts_with("leben_"), -ends_with("gesamt"))
zufriedenheit_wide
#> # A tibble: 286 x 5
#>   ID    leben_selbst leben_familie leben_schule leben_freunde
#>   <fct>        <dbl>         <dbl>        <dbl>         <dbl>
#> 1 1              5.5          7            4.33           6.5
#> 2 2              4.5          4.67         4.67           5  
#> 3 10             6.5          6.33         3.33           5  
#> 4 11             6            7            4              6.5
#> 5 12             6            5.67         5.33           6.5
#> 6 14             5            5.33         5              6  
#> # … with 280 more rows

Can you think of any other ways of achieving this?

Solution

adolescents %>%
    dplyr::select(ID, starts_with("leben_"), -leben_gesamt)
#> # A tibble: 286 x 5
#>   ID    leben_selbst leben_familie leben_schule leben_freunde
#>   <fct>        <dbl>         <dbl>        <dbl>         <dbl>
#> 1 1              5.5          7            4.33           6.5
#> 2 2              4.5          4.67         4.67           5  
#> 3 10             6.5          6.33         3.33           5  
#> 4 11             6            7            4              6.5
#> 5 12             6            5.67         5.33           6.5
#> 6 14             5            5.33         5              6  
#> # … with 280 more rows
adolescents %>%
    dplyr::select(ID, leben_selbst, leben_familie, leben_schule, leben_freunde)
#> # A tibble: 286 x 5
#>   ID    leben_selbst leben_familie leben_schule leben_freunde
#>   <fct>        <dbl>         <dbl>        <dbl>         <dbl>
#> 1 1              5.5          7            4.33           6.5
#> 2 2              4.5          4.67         4.67           5  
#> 3 10             6.5          6.33         3.33           5  
#> 4 11             6            7            4              6.5
#> 5 12             6            5.67         5.33           6.5
#> 6 14             5            5.33         5              6  
#> # … with 280 more rows
variablen <- c("leben_selbst", "leben_familie",
               "leben_schule", "leben_freunde")
adolescents %>%
    dplyr::select(ID, one_of(variablen))
#> # A tibble: 286 x 5
#>   ID    leben_selbst leben_familie leben_schule leben_freunde
#>   <fct>        <dbl>         <dbl>        <dbl>         <dbl>
#> 1 1              5.5          7            4.33           6.5
#> 2 2              4.5          4.67         4.67           5  
#> 3 10             6.5          6.33         3.33           5  
#> 4 11             6            7            4              6.5
#> 5 12             6            5.67         5.33           6.5
#> 6 14             5            5.33         5              6  
#> # … with 280 more rows
# advanced (using a regular expression):
adolescents %>%
    dplyr::select(ID, matches("^leben_[^g]"))
#> # A tibble: 286 x 5
#>   ID    leben_selbst leben_familie leben_schule leben_freunde
#>   <fct>        <dbl>         <dbl>        <dbl>         <dbl>
#> 1 1              5.5          7            4.33           6.5
#> 2 2              4.5          4.67         4.67           5  
#> 3 10             6.5          6.33         3.33           5  
#> 4 11             6            7            4              6.5
#> 5 12             6            5.67         5.33           6.5
#> 6 14             5            5.33         5              6  
#> # … with 280 more rows

All selected variables can be considered as levels of an underlying (repeated-measures) factor. We can convert the data frame into a long format.

zufriedenheit_long <- zufriedenheit_wide %>%
    pivot_longer(-ID, names_to = "lebensbereich", values_to = "zufriedenheit")

Now we need to consider what to do with missing values. Should we exclude all people with NAs in one context, or should we keep all observations?

First, we look for rows with NAs.

# head(20) displays the first 20 rows:
!complete.cases(zufriedenheit_wide) %>%
    head(20)
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

This returns a logical vector, with either TRUE or FALSE for each row (it will be true if there is an NA in that row).

zufriedenheit_wide[!complete.cases(zufriedenheit_wide), ]
#> # A tibble: 2 x 5
#>   ID    leben_selbst leben_fam leben_schule leben_freunde
#>   <fct>        <dbl>     <dbl>        <dbl>         <dbl>
#> 1 175             NA        NA        NA               NA
#> 2 206              5        NA         5.33             5

There are two participants with missing values - we decide to exclude these.

zufriedenheit_long <- zufriedenheit_wide %>%
    drop_na() %>%
    pivot_longer(-ID, names_to = "lebensbereich", values_to = "zufriedenheit")

The variable lebensbereich (context) should be converted into a factor:

zufriedenheit_long <- zufriedenheit_long %>%
    mutate(lebensbereich = as_factor(lebensbereich))
zufriedenheit_long
#> # A tibble: 1,136 x 3
#>   ID    lebensbereich zufriedenheit
#>   <fct> <fct>                 <dbl>
#> 1 2     leben_selbst           4.5 
#> 2 2     leben_fam              4.67
#> 3 2     leben_schule           4.67
#> 4 2     leben_freunde          5   
#> 5 14    leben_selbst           5   
#> 6 14    leben_fam              5.33
#> # … with 1,130 more rows

We can also remove the prefix leben_ from each fator level.

library(stringr)
zufriedenheit_long <- zufriedenheit_long %>%
    mutate(lebensbereich = as.factor(str_replace(lebensbereich,
                                       ".*_", "")))

levels(zufriedenheit_long$lebensbereich)
#> [1] "fam"     "freunde" "schule"  "selbst"

Now we will sort by ID in ascending order:

zufriedenheit_long <- zufriedenheit_long %>%
    arrange(ID)
zufriedenheit_long
#> # A tibble: 1,136 x 3
#>   ID    lebensbereich zufriedenheit
#>   <fct> <fct>                 <dbl>
#> 1 1     selbst                 5.5 
#> 2 1     fam                    7   
#> 3 1     schule                 4.33
#> 4 1     freunde                6.5 
#> 5 2     selbst                 4.5 
#> 6 2     fam                    4.67
#> # … with 1,130 more rows

All in one code chunk:

library(stringr) # for the str_replace() function

zufriedenheit_long <- zufriedenheit_wide %>%
    drop_na() %>%
    pivot_longer(-ID, names_to = "lebensbereich", values_to = "zufriedenheit") %>%
    mutate(lebensbereich = as.factor(str_replace(lebensbereich,
                                       ".*_", ""))) %>%
    arrange(ID)

Stress

Female adolescents are thought to be more prone to depressive disorders than males. We can use a t-test to test this hypothesis, but first we want to look at the data set and compute the mean, standard deviation and standard error for the variable stress_psychisch, for each sex separately (rounded to 4 decimal places).

stress <- adolescents %>% 
    dplyr::select(ID, geschlecht, stress_psychisch) %>% 
    mutate_at(vars(geschlecht, ID), as_factor)
stress
#> # A tibble: 286 x 3
#>   ID    geschlecht stress_psychisch
#>   <fct> <fct>                 <dbl>
#> 1 1     1                      1.67
#> 2 2     0                      3.5 
#> 3 10    1                      3.67
#> 4 11    1                      1.5 
#> 5 12    1                      2.5 
#> 6 14    0                      1   
#> # … with 280 more rows

There might be missing values. We have two options:

  1. We remove all participants with NAs
  2. We can leave all participants in the data set and handle any NAs using the na.rm argument in the mean() and sd() functions.
stress_summary <- stress %>%
    drop_na() %>% 
    group_by(geschlecht) %>%
    summarize_at("stress_psychisch", list(mean = mean, sd = sd, se = se)) %>%
    mutate_if(is.numeric, round, digits = 4)
stress_summary
#> # A tibble: 2 x 4
#>   geschlecht  mean    sd     se
#>   <fct>      <dbl> <dbl>  <dbl>
#> 1 0           2.89  1.25 0.102 
#> 2 1           3.10  1.12 0.0968
library(ggplot2)

stress %>%
    drop_na() %>%
    ggplot(aes(x = geschlecht, y = stress_psychisch, fill = geschlecht)) +
    geom_boxplot() +
    scale_fill_viridis_d() +
    theme_bw(base_size = 18)