5 Working with Data

In this course, we will be working with the tidyverse packages for manipulating and reshaping data. Of course, there are other ways of manipulating data, but in our view, the tidyverse method provides the most consistent way of working with data, and reduces the cognitive load on the user.

The packages used for data manipulations are tidyr for reshaping data frames, and dplyr for manipulating data frames. The functions we are going to look at are:

Package Function name
tidyr pivot_longer() increases the number of rows, decreases the number of columns
tidyr pivot_wider() decreases the number of rows, increases the number of columns
tidyr drop_na() omits all rows in a data frame containing NA values
dplyr filter() selects observations
dplyr select() selects variables
dplyr summarize() or summarise() summarizes data
dplyr mutate() creates new variables and changes already existing variables
dplyr group_by() lets you perform operations on subsets of the data

The functions filter(), select(), summarize() and mutate() have so-called scoped versions, which let you apply the same transformation to multiple variables. There are three variants of each of these functions:

  • *_all applies to every variable (e.g. mutate_all())

  • *_at applies to variables selected with a character vector or the helper function vars() (e.g. mutate_at())

  • *_if applies to variables selected with a predicate function (e.g. mutate_if())

These functions allow you to perform very complex data manipulation, using clean and elegant code. Besides the functions mentioned above, dplyr contains many more. These include functions that perform set operations, e.g. finding the intersection of two data frames, and allow you to join data frames together.

In addition tidyr and dplyr, we will also use the forcats package for manipulating factors, and the stringr package for working with strings.

5.1 The pipe operator: %>%

Before we begin manipulating data, we need to introduce a new operator. We have already seen that code can become hard to read when we have a sequence of operations, i.e. nested function calls. Suppose we have a numeric vector (n = 10), which we want to center (we will generate the data using rnorm()). Then we want to compute and round the standard deviation to two decimal places.

set.seed(1283)
data <- rnorm(n = 10, mean = 24, sd = 5)
data
##  [1] 24.74984 21.91726 23.98551 19.63019 23.96428 22.83092 18.86240 19.08125
##  [9] 23.76589 21.88846

We can use nested function calls:

round(sd(scale(data,
               center = TRUE,
               scale = FALSE)),
      digits = 2)
## [1] 2.19

The functions scale(), sd() and round() are called in order (inner to outer). The output of one function is passed as an argument to the next function.

The functions scale() and round() have additional arguments: center = TRUE, scale = FALSE for scale, and digits = 2 for round.

This way of writing code can be difficult to read, although using indentation (with line breaks as in the code above) really helps here.

An alternative approach is to save each intermediate step:

data_z <- scale(data, center = TRUE,
                      scale = FALSE)

sd_data_z <- sd(data_z)
sd_data_z_rounded <- round(sd_data_z,
                           digits = 2)

sd_data_z_rounded
## [1] 2.19

Doing this means that we create many variables that we don’t really need, cluttering up our workspace.

Luckily, there is an elegant solution to this problem. Using the pipe operator, we can pass the output of one function as an argument to the next function without using nested function calls. Note that this is really just a type of ‘syntactic sugar’; the result is exactly the same as using traditional function calls.

The pipe operator is available in the dplyr package.

library(dplyr)
%>%

It is defined as in infix operator, meaning that it is used in between two objects.

The pipe operator has its own key combo in RStudio: Cmd+Shift+M (MacOS) or Ctrl+Shift+M (Windows and Linux).

Using the %>% operator, our example from above:

round(sd(scale(data,
               center = TRUE,
               scale = FALSE)),
      digits = 2)
## [1] 2.19

becomes:

library(dplyr)
data %>%
    scale(center = TRUE, scale = FALSE) %>%
    sd() %>%
    round(digits = 2)
## [1] 2.19

This should be read as:

  1. We start with the object data and pass it to the function scale() as an argument
  2. We use scale(), with arguments center = TRUE, scale = FALSE and then pass the output to the function sd()
  3. We apply the function sd() and pass the output to the function round()
  4. round() is called, with the argument digits = 2. The output is printed to the console.

Per default, %>% passes an object to a function as its first argument. Many functions have the data argument as its first argument. In our example, the first argument of scale(x, center = TRUE, scale = TRUE) is the data argument and it is data that we pipe to scale(). The same holds for the functions sd() and round(): we pipe the results of the previous function as the (first and) data argument to the following function.

However, it doesn’t have to be the first argument the we “pipe to”. In this case we need to use the placeholder .:

x %>% f(y, .)

This is equivalent to:

f(y, x)

Now x is piped to the second argument of the function f().

5.2 Reshaping: tidyr

In statistics and data science, you will often hear the term wide and long when referring to data. This refers to the shape of a dataset; a wide dataset has more columns than a corresponding long dataset. A long dataset has more rows than a corresponding wide dataset. However, the terms long and wide aren’t just about shape. In a long data set, each observation has its own row and each variable is a column. In a wide dataset, variables can be spread out over several columns.

Consider this example: we have two observations for two subjects, at three separate time points. Let’s call these t1, t2 and t3. The following representation is considered to be in the wide format (the function tribble() lets you easily create a tibble by rows):

df <- tribble(
    ~name,    ~t1, ~t2, ~t3,
    "Marshall", 4,   5,  7,
    "Skye",     3,   6,  7
) %>% mutate_if(is.character, as_factor)

Each row represents one subject, and the observations are spread over the three columns t1, t2 and t3. These columns actually represent three levels of a common factor, time. If you were going to perform a repeated-measures ANOVA in SPSS, this is the kind of dataset that you would use.

We can represent exactly the same data in a longer format, in which each row corresponds to an observation, and each column represents a variable. In total, there are three variables: name, time and score.

## # A tibble: 6 × 3
##   name     time  score
##   <fct>    <chr> <dbl>
## 1 Marshall t1        4
## 2 Marshall t2        5
## 3 Marshall t3        7
## 4 Skye     t1        3
## 5 Skye     t2        6
## 6 Skye     t3        7

We no longer have one row for each subject. In fact, variables that do not change for each observation are repeated across rows.

A data set that is organized in this way is often referred to as tidy. As we will see later, many kinds of statistical analysis and in particular plotting functions require a long data set, and thus, a surprisingly large amount of time has to be spent on organizing data for further analysis (this type of work is often called ‘data wrangling’).

In order to reshape a data set from wide to long, we will need the function pivot_longer(). The function for the opposite operation (long to wide) is pivot_wider(), both of which are provided by the tidyr package.

5.2.1 pivot_longer()

pivot_longer() is used to combine columns, which may represent levels of a common factor, into one column. The values of the original columns are combined in a new value column. Another way of putting this is that pivot_longer() increases the number of rows and decreases the number of columns of a data frame.

pivot_longer()` is used like this:

pivot_longer(data, cols, names_to, values_to)

or, using %>%:

data %>%
    pivot_longer(cols, names_to, values_to)

The arguments are:

data:       a data frame
cols:       the columns you want to pivot 
names_to:   a string specifying the name of the new column created from 
            the selected column names
values_to:  a string specifying the name of the column created from the 
            observations in the cells

We assume that t1, t2 and t3 are not really separate variables, but should be considered as levels of a time factor (e.g. as three time points in a repeated measures data set). The values in the t1, t2 and t3 columns refer to the same kind of measurement (some score) and therefore should be actually one variable in a long data set. We would like our new factor to be called time and the measured variable should be called score. The name variable should be ignored, i.e. it should not be pivoted.

library(tidyr)
df_long <- df %>%
  pivot_longer(-name, names_to = "time", values_to = "score")
df_long
## # A tibble: 6 × 3
##   name     time  score
##   <fct>    <chr> <dbl>
## 1 Marshall t1        4
## 2 Marshall t2        5
## 3 Marshall t3        7
## 4 Skye     t1        3
## 5 Skye     t2        6
## 6 Skye     t3        7

time should be defined as a factor in the next step:

df_long$time <- as_factor(df_long$time)
df_long
## # A tibble: 6 × 3
##   name     time  score
##   <fct>    <fct> <dbl>
## 1 Marshall t1        4
## 2 Marshall t2        5
## 3 Marshall t3        7
## 4 Skye     t1        3
## 5 Skye     t2        6
## 6 Skye     t3        7

Our call to pivot_longer() could also have been written more explicitly:

df %>% 
    pivot_longer(cols = -name, 
                 names_to = "time", 
                 values_to = "score")

In the cols argument we used -name, which tells the function that we want to pivot all columns except the column called name. We could also have selected the three time columns:

df %>% 
    pivot_longer(cols = c(t1, t2, t3), 
                 names_to = "time", 
                 values_to = "score")
## # A tibble: 6 × 3
##   name     time  score
##   <fct>    <chr> <dbl>
## 1 Marshall t1        4
## 2 Marshall t2        5
## 3 Marshall t3        7
## 4 Skye     t1        3
## 5 Skye     t2        6
## 6 Skye     t3        7

We will see more examples of how to select columns when we look at the dplyr package.

5.2.2 pivot_wider()

pivot_wider() is the complement of pivot_longer(). This function takes a factor and ‘spreads’ the values of a measured variable over the levels of the factor, creating as many new variables as there are factor levels. Obviously, pivot_wider() is used to convert from long to wide.

The pivot_wider() syntax looks like this:

pivot_wider(data, names_from, values_from)

# or

data %>%
    pivot_wider(names_from, values_from)

Using our previous example, we want to create new variables for each level of the factor time, using the values in the score variable.

df_long
## # A tibble: 6 × 3
##   name     time  score
##   <fct>    <fct> <dbl>
## 1 Marshall t1        4
## 2 Marshall t2        5
## 3 Marshall t3        7
## 4 Skye     t1        3
## 5 Skye     t2        6
## 6 Skye     t3        7
df_wide <- df_long %>%
    pivot_wider(names_from = time, values_from = score)
df_wide
## # A tibble: 2 × 4
##   name        t1    t2    t3
##   <fct>    <dbl> <dbl> <dbl>
## 1 Marshall     4     5     7
## 2 Skye         3     6     7

df_wide looks exactly the same as the original dataset df. In fact, we can test whether this is the case:

all_equal(df, df_wide)   
## [1] TRUE

5.2.3 Excluding missing values: drop_na()

A very important function from the tidyr package is drop_na(). It is used to omit all rows in a data frame containing missing values:

Let’s create another toy dataset containing NAs:

df_1 <- tibble(var1 = c(1, 2, NA), var2 = c("a", NA, "b"))
df_1
## # A tibble: 3 × 2
##    var1 var2 
##   <dbl> <chr>
## 1     1 a    
## 2     2 <NA> 
## 3    NA b
df_2 <- df_1 %>% 
    drop_na() 
df_2 
## # A tibble: 1 × 2
##    var1 var2 
##   <dbl> <chr>
## 1     1 a

This is very useful, but of course should be used carefully, otherwise you will end up omitting rows that contain NAs in columns that might not be relevant to the question at hand.

5.3 Data manipulation: dplyr

Transforming data sets from wide to long is only one step when ‘tidying’ data. We will often need to select and rename variables, select a subset of observations, recode values, create new variables and create summary variables, often grouped by certain categorical variables.

All these transformations can be performed by functions in the dplyr package. These can be seen as a kind of grammar of data manipulation, with each function performing a specific action. These actions can be combined in many ways, resulting in an extremely flexible yet highly consistent workflow.

We cannot cover all of the available functions in this course, but they are all described in the dplyr documentation.

We need to load the dplyr package (or the tidyverse package):

library(dplyr)

5.3.1 Selecting variables: select()

The function select() is used to select variables from a data frame.

Syntax:

# df is a data frame
select(df, variable1, variable2, -variable3)

# or

df %>% 
    select(variable1, variable2, -variable3)

select() selects the variables variable1 and variable2 from the data frame df. variable3 is omitted.

Examples

# only name
df_long %>% 
    select(name)
## # A tibble: 6 × 1
##   name    
##   <fct>   
## 1 Marshall
## 2 Marshall
## 3 Marshall
## 4 Skye    
## 5 Skye    
## 6 Skye
# time and score
df_long %>% 
    select(time, score)
## # A tibble: 6 × 2
##   time  score
##   <fct> <dbl>
## 1 t1        4
## 2 t2        5
## 3 t3        7
## 4 t1        3
## 5 t2        6
## 6 t3        7
# or

df_long %>% 
    select(-name)
## # A tibble: 6 × 2
##   time  score
##   <fct> <dbl>
## 1 t1        4
## 2 t2        5
## 3 t3        7
## 4 t1        3
## 5 t2        6
## 6 t3        7

Changing the order of variables

select() can also be used to rearrange (reorder) variables in a data frame (without actually selecting anything):

df_long %>%
    select(score, time, name)
## # A tibble: 6 × 3
##   score time  name    
##   <dbl> <fct> <fct>   
## 1     4 t1    Marshall
## 2     5 t2    Marshall
## 3     7 t3    Marshall
## 4     3 t1    Skye    
## 5     6 t2    Skye    
## 6     7 t3    Skye

“Helper” functions for select()

There are a number of “helper functions” we can use for selecting variables:

# including
beispieldaten %>% select(starts_with("leben"))
## # A tibble: 286 × 16
##    leben1 leben2 leben3 leben4 leben5 leben6 leben7 leben8 leben9 leben10
##     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
##  1      5      4      6      2      6      7      7      7      7       7
##  2      5      4      5      4      5      5      5      5      5       4
##  3      7      7      2      1      5      6      5      7      6       6
##  4      1      6      6      5      6      6      7      7      7       7
##  5      5      6      6      5      6      6      7      6      5       6
##  6      4      5      6      5      6      5      6      6      5       5
##  7      6      5      4      2      5      7      5      6      7       4
##  8      7      5      6      6      5      7     NA      6      6       5
##  9      5      6      5      5      5      6      7      6      6       6
## 10      5      5      4      4      5      6      5      5      4       5
## # … with 276 more rows, and 6 more variables: leben11 <dbl>,
## #   leben_selbst <dbl>, leben_familie <dbl>, leben_schule <dbl>,
## #   leben_freunde <dbl>, leben_gesamt <dbl>
beispieldaten %>% select(ends_with("note"))
## # A tibble: 286 × 4
##    Deutschnote Mathenote Fremdsprachenote Gesamtnote
##          <dbl>     <dbl>            <dbl>      <dbl>
##  1           4         3                5         NA
##  2           4         4                3          4
##  3           5         5                4          5
##  4           4         2                3          4
##  5           4         4                4          5
##  6           4         4                3          4
##  7           4         4                4          4
##  8           5         5                5          5
##  9           4         4                3          4
## 10           5         5                4          4
## # … with 276 more rows
beispieldaten %>% select(contains("mutter"))
## # A tibble: 286 × 2
##    bildung_mutter                                bildung_mutter_binaer
##    <fct>                                         <fct>                
##  1 Fachhochschulabschluss, Universitätsabschluss hoch                 
##  2 Realschulabschluss (mittlere Reife)           niedrig              
##  3 Hauptschulabschluss oder niedriger            niedrig              
##  4 Fachhochschulabschluss, Universitätsabschluss hoch                 
##  5 Realschulabschluss (mittlere Reife)           niedrig              
##  6 Realschulabschluss (mittlere Reife)           niedrig              
##  7 Realschulabschluss (mittlere Reife)           niedrig              
##  8 Fachhochschulabschluss, Universitätsabschluss hoch                 
##  9 Realschulabschluss (mittlere Reife)           niedrig              
## 10 Realschulabschluss (mittlere Reife)           niedrig              
## # … with 276 more rows
# OR using one_of()
vars <- c("westost", "geschlecht", "alter")
beispieldaten %>% select(one_of(vars))
## # A tibble: 286 × 3
##    westost geschlecht alter
##    <fct>   <fct>      <dbl>
##  1 West    weiblich      13
##  2 West    männlich      14
##  3 West    weiblich      14
##  4 West    weiblich      14
##  5 West    weiblich      14
##  6 West    männlich      14
##  7 West    männlich      14
##  8 West    weiblich      15
##  9 West    männlich      15
## 10 West    männlich      14
## # … with 276 more rows
# excluding
beispieldaten %>% select(-starts_with(c("swk", "unt", "leben", "stress")))
## # A tibble: 286 × 12
##       ID westost geschlecht alter Deutschnote Mathenote Fremdsprachenote
##    <dbl> <fct>   <fct>      <dbl>       <dbl>     <dbl>            <dbl>
##  1     1 West    weiblich      13           4         3                5
##  2     2 West    männlich      14           4         4                3
##  3    10 West    weiblich      14           5         5                4
##  4    11 West    weiblich      14           4         2                3
##  5    12 West    weiblich      14           4         4                4
##  6    14 West    männlich      14           4         4                3
##  7    15 West    männlich      14           4         4                4
##  8    16 West    weiblich      15           5         5                5
##  9    17 West    männlich      15           4         4                3
## 10    18 West    männlich      14           5         5                4
## # … with 276 more rows, and 5 more variables: Gesamtnote <dbl>,
## #   bildung_vater <fct>, bildung_mutter <fct>, bildung_vater_binaer <fct>,
## #   bildung_mutter_binaer <fct>

5.3.1.1 Scoped versions of select

We can also use the scoped versions of select: select_if(), select_at() and select_all().

For instance, we can select only the factors from the beispieldaten data frame with select_if() combined with the function is.factor() (in this case no parentheses are included in the function call) :

beispieldaten %>% 
    select_if(is.factor)
## # A tibble: 286 × 6
##    westost geschlecht bildung_vater        bildung_mutter       bildung_vater_b…
##    <fct>   <fct>      <fct>                <fct>                <fct>           
##  1 West    weiblich   Fachhochschulabschl… Fachhochschulabschl… hoch            
##  2 West    männlich   Fachabitur, Abitur   Realschulabschluss … hoch            
##  3 West    weiblich   Realschulabschluss … Hauptschulabschluss… niedrig         
##  4 West    weiblich   Realschulabschluss … Fachhochschulabschl… niedrig         
##  5 West    weiblich   Fachhochschulabschl… Realschulabschluss … hoch            
##  6 West    männlich   Realschulabschluss … Realschulabschluss … niedrig         
##  7 West    männlich   Realschulabschluss … Realschulabschluss … niedrig         
##  8 West    weiblich   Fachhochschulabschl… Fachhochschulabschl… hoch            
##  9 West    männlich   Realschulabschluss … Realschulabschluss … niedrig         
## 10 West    männlich   Fachabitur, Abitur   Realschulabschluss … hoch            
## # … with 276 more rows, and 1 more variable: bildung_mutter_binaer <fct>

5.3.2 Renaming variables: rename()

Variables can be renamed using rename(). Variables that have not been selected are not changed.

Syntax:

df %>% rename(new_name = old_name)

For the following examples we first create a subdataset from beispieldaten to obtain less cluttered output:

beispieldaten_red <- beispieldaten %>% 
  select(ID:alter, unteltern1:unteltern3)

Example

# rename westost to region, geschlecht to gender, and alter to age
beispieldaten_red <- beispieldaten_red %>%
  rename(region = westost,
         gender = geschlecht,
         age = alter)

Similarly to select, we can use a scoped version. Using rename_all, we can - for example - change all variable names to upper case using the base R function toupper().

beispieldaten_red %>% 
    rename_all(toupper)
## # A tibble: 286 × 7
##       ID REGION GENDER     AGE UNTELTERN1 UNTELTERN2 UNTELTERN3
##    <dbl> <fct>  <fct>    <dbl>      <dbl>      <dbl>      <dbl>
##  1     1 West   weiblich    13          7          6          6
##  2     2 West   männlich    14          6          6          3
##  3    10 West   weiblich    14          7          7          6
##  4    11 West   weiblich    14          7          7          7
##  5    12 West   weiblich    14          6          7          7
##  6    14 West   männlich    14          6          7          6
##  7    15 West   männlich    14          6          5          7
##  8    16 West   weiblich    15          5          5          6
##  9    17 West   männlich    15          6          6          7
## 10    18 West   männlich    14          4          5          6
## # … with 276 more rows

5.3.3 Selecting observations (cases): filter()

filter() is used to select rows from a data frame that fulfill certain logical conditions (or not).

We can combine multiple logical operators.

Syntax:

df %>% 
    filter(variable1 < VALUE1 & variable2 == VALUE2)

Examples

# only adolescents from Eastern Germany
beispieldaten_red %>%
    filter(region == "Ost")
## # A tibble: 143 × 7
##       ID region gender     age unteltern1 unteltern2 unteltern3
##    <dbl> <fct>  <fct>    <dbl>      <dbl>      <dbl>      <dbl>
##  1   201 Ost    weiblich    14          7          5          7
##  2   204 Ost    weiblich    15          7          6          5
##  3   205 Ost    männlich    15          7          7          7
##  4   206 Ost    weiblich    14          6          5          7
##  5   207 Ost    weiblich    15          7          6          7
##  6   209 Ost    weiblich    15          6          4          5
##  7   210 Ost    weiblich    15          6          6          7
##  8   211 Ost    weiblich    15          6          4          7
##  9   212 Ost    weiblich    15          7          6          7
## 10   213 Ost    weiblich    14          6          5          5
## # … with 133 more rows
# only only adolescents from Western Germany
beispieldaten_red %>%
    filter(region != "Ost")
## # A tibble: 143 × 7
##       ID region gender     age unteltern1 unteltern2 unteltern3
##    <dbl> <fct>  <fct>    <dbl>      <dbl>      <dbl>      <dbl>
##  1     1 West   weiblich    13          7          6          6
##  2     2 West   männlich    14          6          6          3
##  3    10 West   weiblich    14          7          7          6
##  4    11 West   weiblich    14          7          7          7
##  5    12 West   weiblich    14          6          7          7
##  6    14 West   männlich    14          6          7          6
##  7    15 West   männlich    14          6          5          7
##  8    16 West   weiblich    15          5          5          6
##  9    17 West   männlich    15          6          6          7
## 10    18 West   männlich    14          4          5          6
## # … with 133 more rows
# only age 13 or younger
beispieldaten_red %>%
    filter(age <= 13)
## # A tibble: 6 × 7
##      ID region gender     age unteltern1 unteltern2 unteltern3
##   <dbl> <fct>  <fct>    <dbl>      <dbl>      <dbl>      <dbl>
## 1     1 West   weiblich    13          7          6          6
## 2    49 West   männlich    13          6          5          6
## 3   138 West   weiblich    13          4          3          5
## 4   142 West   weiblich    13          3          3          2
## 5   159 West   weiblich    13          4          3          5
## 6   321 Ost    weiblich    13          5          5          7
# ages between 15 and 17
beispieldaten_red %>%
    filter(age >= 15 & age <= 17)
## # A tibble: 163 × 7
##       ID region gender     age unteltern1 unteltern2 unteltern3
##    <dbl> <fct>  <fct>    <dbl>      <dbl>      <dbl>      <dbl>
##  1    16 West   weiblich    15          5          5          6
##  2    17 West   männlich    15          6          6          7
##  3    19 West   männlich    15          4          6          7
##  4    20 West   weiblich    15          6          5          4
##  5    25 West   weiblich    15          7          7          7
##  6    26 West   weiblich    15          7          7          6
##  7    27 West   männlich    15          7          6          6
##  8    29 West   weiblich    15          5          3          4
##  9    31 West   weiblich    16          7          7          6
## 10    33 West   weiblich    15          7          7          7
## # … with 153 more rows
# only subjects with IDs 1 and 49
beispieldaten_red %>%
    filter(ID == 1 | ID == 49)
## # A tibble: 2 × 7
##      ID region gender     age unteltern1 unteltern2 unteltern3
##   <dbl> <fct>  <fct>    <dbl>      <dbl>      <dbl>      <dbl>
## 1     1 West   weiblich    13          7          6          6
## 2    49 West   männlich    13          6          5          6
# alternative: use the %in% operator
beispieldaten_red %>%
    filter(ID %in% c(1, 49))
## # A tibble: 2 × 7
##      ID region gender     age unteltern1 unteltern2 unteltern3
##   <dbl> <fct>  <fct>    <dbl>      <dbl>      <dbl>      <dbl>
## 1     1 West   weiblich    13          7          6          6
## 2    49 West   männlich    13          6          5          6

5.3.4 Sorting data frames: arrange()

With the arrange() function, we can sort rows according to variables in a data frame, either in ascending or descending order.

# ascending
beispieldaten_red %>%
    arrange(ID)
## # A tibble: 286 × 7
##       ID region gender     age unteltern1 unteltern2 unteltern3
##    <dbl> <fct>  <fct>    <dbl>      <dbl>      <dbl>      <dbl>
##  1     1 West   weiblich    13          7          6          6
##  2     2 West   männlich    14          6          6          3
##  3    10 West   weiblich    14          7          7          6
##  4    11 West   weiblich    14          7          7          7
##  5    12 West   weiblich    14          6          7          7
##  6    14 West   männlich    14          6          7          6
##  7    15 West   männlich    14          6          5          7
##  8    16 West   weiblich    15          5          5          6
##  9    17 West   männlich    15          6          6          7
## 10    18 West   männlich    14          4          5          6
## # … with 276 more rows
# descending
beispieldaten_red %>%
    arrange(desc(ID))
## # A tibble: 286 × 7
##       ID region gender     age unteltern1 unteltern2 unteltern3
##    <dbl> <fct>  <fct>    <dbl>      <dbl>      <dbl>      <dbl>
##  1   367 Ost    weiblich    14          5          6          7
##  2   364 Ost    männlich    15          6          6          6
##  3   363 Ost    männlich    15          6          7          6
##  4   362 Ost    weiblich    15          4          3          4
##  5   361 Ost    weiblich    14          3          1          4
##  6   359 Ost    weiblich    14          5          1          4
##  7   356 Ost    männlich    15          6          4          6
##  8   355 Ost    weiblich    14          6          4          4
##  9   353 Ost    männlich    14          6          6          7
## 10   352 Ost    weiblich    14          6          6          6
## # … with 276 more rows

5.3.5 Creating new variables: mutate()

The mutate() function can be used to either create new variables or to change already existing ones.

Syntax:

df %>% mutate(new_variable_1 = FORMULA_1,
              new_variable_2 = FORMULA_2,
              old_variable_1 = FORMULA_3)

For example, let’s compute a variable with an average score for the first three items measuring the support received from parents (unteltern1, unteltern2, unteltern3):

beispieldaten_red <- beispieldaten_red %>%
  mutate(unteltern_1to3 = (unteltern1 + unteltern2 + unteltern3)/3) 

We just reassigned beispieldaten_red to beispieldaten_red because we want to save the new variable. Now we realize that we did not save earlier operations since we did not reassign. For the examples above with arrange() we may have merely wanted to look at the data set in a different order and not save it in that order.

Below we will see more examples where we will not reassign, because we only want to illustrate how the functions work, thus we do not need to save the manipulations every time, we only create the respective output. In “real (data-analytic) life” we will usually reassign to the current data frame since we want to work with the transformed or “mutated” data set.

There is also a function called transmute(), which returns only the newly created variables. (In this case reassigning would mean erasing the content of beispieldaten_red and replacing it by the newly created variable.)

beispieldaten_red %>%
    transmute(unteltern_1to3 = (unteltern1 + unteltern2 + unteltern3)/3)
## # A tibble: 286 × 1
##    unteltern_1to3
##             <dbl>
##  1           6.33
##  2           5   
##  3           6.67
##  4           7   
##  5           6.67
##  6           6.33
##  7           6   
##  8           5.33
##  9           6.33
## 10           5   
## # … with 276 more rows

This looks all quite easy and functional for the often needed operation of aggregating over items (per subject) to compute a scale mean. However, this works only for subjects who have no missing values in any of the variables/items involved. As soon as missing values exist, the computation will return NA for the respective subject. As to our knowledge there is not yet a package/function that easily circumvents this problem. We will therefore introduce rowwise() for row-wise operations at the end of this chapter and present a custom-made function to deal with the issue.

5.3.5.1 Scoped versions of mutate

mutate has scoped versions that allows you to apply transformations to multiple variables at once.

As an example, say we want to convert all ratings (Likert-scale from 1-7) in the beispieldaten data frame to a different scale. First, we create a function that does the job of converting the rating to a percentage.

change_scale <- function(x) 100 * x / 7

Then we indicate using mutate_if together with is.numeric that the function should be applied only to the numeric variables in the data frame.

beispieldaten_red %>% 
    mutate_if(is.numeric, change_scale)
## # A tibble: 286 × 8
##       ID region gender     age unteltern1 unteltern2 unteltern3 unteltern_1to3
##    <dbl> <fct>  <fct>    <dbl>      <dbl>      <dbl>      <dbl>          <dbl>
##  1  14.3 West   weiblich  186.      100         85.7       85.7           90.5
##  2  28.6 West   männlich  200        85.7       85.7       42.9           71.4
##  3 143.  West   weiblich  200       100        100         85.7           95.2
##  4 157.  West   weiblich  200       100        100        100            100  
##  5 171.  West   weiblich  200        85.7      100        100             95.2
##  6 200   West   männlich  200        85.7      100         85.7           90.5
##  7 214.  West   männlich  200        85.7       71.4      100             85.7
##  8 229.  West   weiblich  214.       71.4       71.4       85.7           76.2
##  9 243.  West   männlich  214.       85.7       85.7      100             90.5
## 10 257.  West   männlich  200        57.1       71.4       85.7           71.4
## # … with 276 more rows

Now we accidentally also converted ID and alter to percentages! ID should actually not be a numeric variable, we should convert it to a factor! We can do this also using mutate():

beispieldaten_red <- beispieldaten_red %>% mutate(ID = as.factor(ID))

We can also use mutate_at with the helper function vars() and starts_with to select only those variables we really want to convert into percentages:

beispieldaten_red %>% 
    mutate_at(vars(starts_with("unt")), change_scale)
## # A tibble: 286 × 8
##    ID    region gender     age unteltern1 unteltern2 unteltern3 unteltern_1to3
##    <fct> <fct>  <fct>    <dbl>      <dbl>      <dbl>      <dbl>          <dbl>
##  1 1     West   weiblich    13      100         85.7       85.7           90.5
##  2 2     West   männlich    14       85.7       85.7       42.9           71.4
##  3 10    West   weiblich    14      100        100         85.7           95.2
##  4 11    West   weiblich    14      100        100        100            100  
##  5 12    West   weiblich    14       85.7      100        100             95.2
##  6 14    West   männlich    14       85.7      100         85.7           90.5
##  7 15    West   männlich    14       85.7       71.4      100             85.7
##  8 16    West   weiblich    15       71.4       71.4       85.7           76.2
##  9 17    West   männlich    15       85.7       85.7      100             90.5
## 10 18    West   männlich    14       57.1       71.4       85.7           71.4
## # … with 276 more rows

5.3.6 Grouping data frames: group_by()

We often want to perform operations on subsets of a data frame, given by levels of a grouping factor. The function group_by() groups the data frame according to one or more grouping variables, applies a function to every group and then recombines the data frame (split-apply-combine). group_by() is usually used in combination with other functions.

Syntax:

df <- group_by(grouping_1, grouping_2, grouping_3)

Example

Let’s import a new dataset. In this example, 12 subjects were allocated to 4 experimental conditions, and the level of aggressive behaviour was measured. Suppose that we want to add a group-centred aggression variable.

library(readr)
alk_aggr <- read_csv("data/alkohol-aggression.csv") %>% 
            mutate(alkoholbedingung = as.factor(alkoholbedingung))
## Rows: 12 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): alkoholbedingung
## dbl (1): aggressivitaet
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We can use group_by() to split the data frame into two separate pieces, compute the group mean for each, and put the pieces back together again. Before we do that, we will select only the relevant variables and remove all cases with missing values on those:

beispieldaten_red %>%
  select(ID, gender, unteltern_1to3) %>% 
  drop_na() %>% 
  group_by(gender) %>% 
  mutate(group_mean = mean(unteltern_1to3),
         unteltern_c = unteltern_1to3 - group_mean)
## # A tibble: 283 × 5
## # Groups:   gender [2]
##    ID    gender   unteltern_1to3 group_mean unteltern_c
##    <fct> <fct>             <dbl>      <dbl>       <dbl>
##  1 1     weiblich           6.33       5.60       0.732
##  2 2     männlich           5          5.88      -0.882
##  3 10    weiblich           6.67       5.60       1.07 
##  4 11    weiblich           7          5.60       1.40 
##  5 12    weiblich           6.67       5.60       1.07 
##  6 14    männlich           6.33       5.88       0.451
##  7 15    männlich           6          5.88       0.118
##  8 16    weiblich           5.33       5.60      -0.268
##  9 17    männlich           6.33       5.88       0.451
## 10 18    männlich           5          5.88      -0.882
## # … with 273 more rows

5.3.7 Recoding values: recode()

Together with mutate(), recode() can be used to change values of a variable. This can be done with numeric or character variables.

Syntax:

recode(variable,
       old_value_1 = "new_value_1",
       old_value_2 = "new_value_2")

If we want to change the levels of a factor, the forcats package provides more functionality. To manually recode the factor, we use fct_recode().

Syntax:

fct_recode(variable,
       new_value_1 = "old_value_1",
       new_value_2 = "old_value_2")

Let us recode the German factor levels of region and gender to English:

5.3.8 Summarizing variables: summarize()

Using the functions summarize() or summarise() (both spellings are possible) we can summarize variables and compute descriptive statistics. In contrast to mutate(), summarize() returns a value for each group, not each observation.

Syntax:

df %>% summarize(summary = FUNCTION(variable))

FUNCTION is a placeholder for any function that can be used to summarize data, e.g. mean(), sd(). If we want the mean and the standard error, we could use the function that we created earlier:

se <- function(x) {
    n <- length(x)
    sd(x) / sqrt(n)
}

Let’s compute the group mean, standard deviation and standard error of unteltern_1to3 for male and female as well as for east and west German adolescents separately:

beispieldaten_red %>%
  drop_na() %>% 
  group_by(region, gender) %>%
  summarize(mean = mean(unteltern_1to3),
              sd = sd(unteltern_1to3),
              se = se(unteltern_1to3))
## `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
## # A tibble: 4 × 5
## # Groups:   region [2]
##   region gender  mean    sd     se
##   <fct>  <fct>  <dbl> <dbl>  <dbl>
## 1 west   male    5.68 1.10  0.120 
## 2 west   female  5.51 1.22  0.158 
## 3 east   male    6.13 0.711 0.0868
## 4 east   female  5.68 1.10  0.129

A more advanced example of this uses summarize_if. We can apply a list of functions to all variables that pass the test is.numeric. This gives the same output as the code above.

beispieldaten_red %>%
  drop_na() %>% 
  group_by(region, gender) %>%
  summarize_if(is.numeric, list(mean = mean, sd = sd, se = se))
## # A tibble: 4 × 17
## # Groups:   region [2]
##   region gender age_mean unteltern1_mean unteltern2_mean unteltern3_mean
##   <fct>  <fct>     <dbl>           <dbl>           <dbl>           <dbl>
## 1 west   male       14.5            5.63            5.45            5.96
## 2 west   female     14.5            5.73            5.2             5.58
## 3 east   male       15.0            6.13            5.97            6.30
## 4 east   female     14.7            5.86            5.29            5.89
## # … with 11 more variables: unteltern_1to3_mean <dbl>, age_sd <dbl>,
## #   unteltern1_sd <dbl>, unteltern2_sd <dbl>, unteltern3_sd <dbl>,
## #   unteltern_1to3_sd <dbl>, age_se <dbl>, unteltern1_se <dbl>,
## #   unteltern2_se <dbl>, unteltern3_se <dbl>, unteltern_1to3_se <dbl>

5.4 Row-wise operations

We already saw above how we can create a scale mean across multiple items using mutate() together with a FORMULA. Unfortunately this did not work with missing values present. We could think that we could just use mean() together with its argument na.rm = TRUE. However, operations in R (and the tidyverse packages) are intended to operate on columns of data frames, not on rows, so this will also not work.

Let’s create a toy example to illustrate this:

library(tidyverse)

df <- tribble(
    ~name,   ~n1, ~n2,  ~n3,  ~n4,
    "Lydia",  7,   8,    9,    4, 
    "Todd",   3,   2,    3,    3)

We have two subjects, who responded to four questions n1 to n4 that represent a common n factor.

Like above, as long as there are no missing values, this will work:

df %>% 
    mutate(n = (n1 + n2 + n3 + n4)/4)
## # A tibble: 2 × 6
##   name     n1    n2    n3    n4     n
##   <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia     7     8     9     4  7   
## 2 Todd      3     2     3     3  2.75

Let’s add a missing value:

dfNA <- tribble(
    ~name,   ~n1, ~n2,  ~n3,  ~n4,
    "Lydia",  7,   8,    9,    4, 
    "Todd",   NA,  2,    3,    3)
dfNA
## # A tibble: 2 × 5
##   name     n1    n2    n3    n4
##   <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia     7     8     9     4
## 2 Todd     NA     2     3     3

This approach will no longer work (at least not for Todd):

dfNA %>% 
    mutate(n = (n1 + n2 + n3 + n4)/4)
## # A tibble: 2 × 6
##   name     n1    n2    n3    n4     n
##   <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia     7     8     9     4     7
## 2 Todd     NA     2     3     3    NA

One can argue that someone who did not answer to one of four n items should not be included in the analysis of an n scale. However, often we have much more than 4 items and anyone who did not answer to only one of - say - 10 items would not be included.

We could use a multiple imputation approach that predicts missing values from other variables in the data set and where the uncertainty associated with the imputation is taken into account by creating multiple versions of imputed data sets, analyzing them separately and combining their results afterwards.

This is not always possible, though, and one might rather compute the scale mean for this person by averaging across the existing items (i.e. imputing/replacing missing values with the mean of existing values of a scale). This is what we try to do here…

Unfortunately, this does also not work:

dfNA %>% 
  mutate(n = mean(c(n1, n2, n3, n4), na.rm = TRUE))
## # A tibble: 2 × 6
##   name     n1    n2    n3    n4     n
##   <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia     7     8     9     4  5.14
## 2 Todd     NA     2     3     3  5.14

Here we get the mean of all 7 values for both of our participants…

Fortunately, though, dplyr does provide a function, rowwise() that will help us here:

dfNA %>%
  rowwise() %>%
 mutate(n = mean(c(n1, n2, n3, n4), na.rm = TRUE))
## # A tibble: 2 × 6
## # Rowwise: 
##   name     n1    n2    n3    n4     n
##   <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia     7     8     9     4  7   
## 2 Todd     NA     2     3     3  2.67

We could even recreate some missing functionality from SPSS here. Our summary should only be created for rows that do not exceed the maximum permissible number of missing values.

The goal is to compute some sort of row-wise score, i.e. a sum or an average, over several columns of a data frame, while being able to specify how many missing values you will allow per row.

We can create a function that does this:

score_items <- function(v, max_na = NULL) {
    if (is.null(max_na)) {
        max_na <- length(v)
    }

    n_na <- length(v[is.na(v)])

    if (n_na <= max_na) {
        out <- mean(v, na.rm = TRUE)
    } else {
        out <- NA
    }
    out
}

This function will compute the mean of a vector, but only if that vector contains fewer than a pre-determined number of missing values.

dfNA %>%
  rowwise() %>%
  mutate(n = score_items(c(n1, n2, n3, n4), max_na = 1))
## # A tibble: 2 × 6
## # Rowwise: 
##   name     n1    n2    n3    n4     n
##   <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia     7     8     9     4  7   
## 2 Todd     NA     2     3     3  2.67

This will compute the mean over the specified columns, but only if the subject has at most 1 missing value.

If we are not willing to accept any NAs:

dfNA %>%
  rowwise() %>%
  mutate(n = score_items(c(n1, n2, n3, n4), max_na = 0))
## # A tibble: 2 × 6
## # Rowwise: 
##   name     n1    n2    n3    n4     n
##   <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lydia     7     8     9     4     7
## 2 Todd     NA     2     3     3    NA