9 Functions

So far, we have neglected one of the most useful aspects of using R: the ability to automate our workflow and to let the computer do our work for us. Learning to write your own functions is one of the most important steps in this direction. Fortunately, R makes it very easy to create your own functions.

The basic idea of a function is that it takes one or more inputs, does something with these, and then returns an output. The inputs are called the function arguments, the code that does something with the arguments is called the function body, and the output is called the return value.

Suppose you want to write a function that squares a number and then adds 5. Let’s call the input argument x. First, we have to choose a name for our new function. We’ll call it square_add_five. Then, the function definition looks like this:

square_add_five <- function(x) {
    x^2 + 5
}

Once we have evaluated this code in the R console, we can call our function:

square_add_five(4)
#> [1] 21

or

square_add_five(x = 4)
#> [1] 21

The code x^2 + 5 is the function body. This takes the argument x, computes x^ and then adds 5 to the results. The result of this computation is returned, and can be assigned to a new variable:

result <- square_add_five(4)

That’s the basic idea. Now we will take a closer look at how to translate a formula into R code, and then define a new function.

9.1 Skewness

In psychological research, we often measure response times. We have already looked at an example data set.

library(tidyverse)
rt_wide <- read_csv("data/RTdata.csv")
#> Parsed with column specification:
#> cols(
#>   ID = col_double(),
#>   control = col_double(),
#>   slow = col_double(),
#>   fast = col_double()
#> )

rt_long <- rt_wide %>%
    gather(key = condition, value = rt,
           control, slow, fast, -ID) %>%
    mutate(ID = factor(ID),
           condition = factor(condition))
rt_long %>%
    ggplot(aes(x = rt)) +
    geom_histogram(bins = 30, fill = "steelblue") +
    facet_grid(condition ~ ID) +
    xlab("Respone time") +
    ylab("") +
    theme_classic()

These data are right-skewed. We can calculate the skewness

\[ \mathrm{skewness} = \frac{\sum_{m=1}^{n}{(x_m - \bar{x})^3}} {n \cdot s_X^3} \]

\(x\) is a vector of observations with length \(n\), \(\bar{x}\) is its mean and \(s_X\) its empirical standard deviation \(s_X = \sqrt{\frac{1}{n}\sum_{m=1}^n{(x_m - \bar{x})^2}}\). For symmetrical distributions, the skewness is \(0\). For left-skewed distributions, the skewness is negative, for right-skewed distributions it’s positive. For RT data, we expect a value \(> 0\).

Unfortunately, there is no function in base R that can calculate this, but we can easily do so ourselves.

First, let’s translate the formula into R code. The formula can be simplified:

\[ \mathrm{skewness} = \frac{1}{n} \sum_{m=1}^{n} {z_m^3} \] with \[ z_m = \frac{x_m - \bar{x}} {s_X} \] To compute the skewness of a vector, we first standardize it, i.e. subtract its mean and divide by its standard deviation. Then we compute its third power and divide by \(n\).

Let’s generate some random numbers from a Gamma distribution. The Gamma distribition is only defined for positive numbers, and is right skewed.

set.seed(43772)
x1 <- 0.5 + rgamma(100, shape = 1.0, scale = .9)

ggplot(data = NULL, aes(x = x1)) +
    geom_histogram(bins = 30) + theme_classic()

We need the lenght:

n <- length(x1)

and the mean and standard deviation:

mean_x1 <- mean(x1)
sd_x1 <- sqrt(sum((x1 - mean_x1)^2) / (n))

Then we can standardize:

z1 <- (x1 - mean_x1) / sd_x1

Why don’t we use the built-in sd() function? The reason for this is that this function computes the sample standard deviation, by dividing by \(n-1\). We want to divive by \(n\), because we don’t want to estimate the population standard deviation. We are merely using the standard deviation descriptively.

Now we can compute the skewness:

skewness <- sum(z1^3) / n
skewness
#> [1] 1.599983

The value we get is \(> 0\).

All the above steps in one code chunk:

n <- length(x1)
mean_x1 <- mean(x1)
sd_x1 <- sqrt(sum((x1 - mean_x1)^2) / (n))
z1 <- (x1 - mean_x1) / sd_x1
skewness <- sum(z1^3) / n
skewness
#> [1] 1.599983

So far so good, but what if we want to to this for another vector?

set.seed(43772)
x2 <- 0.6 + rgamma(120, shape = 0.8, scale = .95)

ggplot(data = NULL, aes(x = x2)) +
    geom_histogram(bins = 30) + theme_classic()

An obvious solution is to copy/paste our code from above, substituting x2 for x1.

n <- length(x2)
mean_x2 <- mean(x2)
sd_x2 <- sqrt(sum((x2 - mean_x2)^2) / (n))
z2 <- (x2 - mean_x2) / sd_x2
skewness <- sum(z2^3) / n
skewness
#> [1] 1.748647

This can be very error-prone.

Now we have a third vector:

set.seed(43772)
x3 <- 0.6 + rgamma(120, shape = 1.3, scale = 0.7)

ggplot(data = NULL, aes(x = x3)) +
    geom_histogram(bins = 30) + theme_classic()

n <- length(x3)
mean_x3 <- mean(x3)
sd_x3 <- sqrt(sum((x2 - mean_x3)^2) / (n))
z3 <- (x3 - mean_x3) / sd_x3
skewness <- sum(z2^3) / n
skewness
#> [1] 1.748647

Can you spot two typos in the last code chunk?

It would be better to write a function. Then we only write the code once, and we can reuse it again and again without rewriting anything.

9.2 Defining a function

A function has a name and arguments:

function_name(arg1, arg2 = val2)

The function function_name has two arguments, arg1 and arg2. Only arg2 has a default value. This means that arg1 is required.

Let’s give our skewness function the very descriptive name skewness. It will take one argument, x.

We use the function function() to define a new function:

skewness <- function(x) {
    # Code
}

Now we need the function body. This is the code between the curly braces {}. The last statement in the function body that is evaluated is automatically returned.

9.3 Skewness function

Now we are ready to define the skewness function:

skewness <- function(x) {

    n <- length(x)

    mean_x <- mean(x)
    sd_x <- sqrt(sum((x - mean_x)^2) / (n))

    # z-transform x and assign to z
    z <- (x - mean_x) / sd_x

    # this is not an evaluation, the result is assigned to skewness
    skewness <- sum(z^3) / n

    # we need to evaulate skewness
    # we could write 
    # print(skewness)
    # or
    # return(skewness)
    skewness
}

You can send the code to the R console, or evaulate the function definition using a keyboard short cut.

Now you call the function:

skewness(x1)
#> [1] 1.599983
skewness(x2)
#> [1] 1.748647

And you call assign the returned value to another variable:

skew_x3 <- skewness(x3)
skew_x3
#> [1] 1.241044

9.4 Further examples

Let’s look at some more examples.

9.4.1 Multiplication

We want a function that takes two numbers as arguments, and multiplies both numbers. We will choose a default value of 1 for the second argument. This means that, unless we specify otherwise, the first arguments will be multiplied by \(1\), i.e. the function will return that number itself.

multiply <- function(x, k = 1) {
    k * x
}
# mit default value für k
value <- multiply(2)
value
#> [1] 2
multiply(2, 3)
#> [1] 6
multiply(333, 43)
#> [1] 14319

We can also call the function with named arguments:

multiply(x = 4, k = 2)
#> [1] 8

9.4.2 Standard deviation

In our next example, we will turn our standard devation code into a function. We also want to handle any missing values, so we will use the function na.omit().

standard_deviation <- function(x) {
    # remove NAs
    x <- na.omit(x)
    
    sqrt(sum((x - mean(x))^2) / (length(x)))
}
standard_deviation(x1)
#> [1] 1.003002

Does our function handle NAs correctly?

x <- rnorm(12)
# the first element of x is replaced by NA
x[1] <- NA
x
#>  [1]         NA -0.2933334 -1.1924669  3.4490925 -0.6391923  0.4956852
#>  [7] -1.5414534  0.9317232 -2.2546778 -0.4523381  0.5655404 -1.2481360
standard_deviation(x)
#> [1] 1.478227

9.5 Exercise

9.5.1 Calculate skewness for RT data set

We can use our skewness() function just as any other function in R. This means that we can combine it with group_by() and summarise(), for example.

Let’s compute the skewness for each person in each condition in the rt_long data set.

rt_long %>%
    group_by(ID, condition) %>%
    summarise(skewness = skewness(rt))
#> # A tibble: 9 x 3
#> # Groups:   ID [3]
#>   ID    condition skewness
#>   <fct> <fct>        <dbl>
#> 1 1     control      0.559
#> 2 1     fast         1.45 
#> 3 1     slow         0.987
#> 4 2     control      1.15 
#> 5 2     fast         0.777
#> 6 2     slow         1.02 
#> # … with 3 more rows