4 Functions

One of the most useful aspects of using a programming language is the ability to automate your workflow and to let the computer do our work for you. Learning to write your own functions is one of the most important steps in this direction. Fortunately, R makes it very easy to create your own functions.

A further motivation for learning how to write functions is that it is often necessary to do so when you need something that is missing in base R, and you do not want to rely one some third-party package. For instance, there is no function in R to compute the standard error of the mean. Although it is pretty easy to do so manually, there are good reasons to automate calculations that have to be performed repeatedly.

The basic idea of a function is that it takes one or more inputs, does something with these, and then returns an output. The inputs are called the function arguments, the code that does something with the arguments is called the function body, and the output is called the return value.

Suppose you want to write a function that squares a number and then adds 5. Let’s call the input argument x. First, we have to choose a name for our new function. We’ll call it square_add_five. Then, the function definition looks like this:

square_add_five <- function(x) {
    x^2 + 5
}

Once we have evaluated this code in the R console, we can call our function:

square_add_five(4)
#> [1] 21

square_add_five(x = 4)
#> [1] 21

The code x^2 + 5 is the function body. This takes the argument x, computes x^2 and then adds 5 to the result. The result of this computation is returned, and can be assigned to a new variable:

result <- square_add_five(4)

That’s the basic idea. Now we will take a closer look at how to translate a formula into R code, and then define a new function.

4.1 Standard error

When summarizing data sets, we often need to compute the standard error \(\sigma_{\bar{x}} = \frac{s}{\sqrt{n}}\), where \(s\) is the sample standard deviation and \(n\) is the sample size.

Translating this formula into R code is fairly straightforward. If we have an imput vector x, we just need the standard deviation and the sample size. The standard deviation can be computed using the function sd(), and the sample size the number of observations in x, which is just the length of x.

We can compute \(\frac{s}{\sqrt{n}}\):

sd(x)/sqrt(length(x))

Let’s try this out with an example: we have a sample x consisting of ten observations.

x <- rnorm(10, mean = 30, sd = 3)

mean(x)
#> [1] 28.67591
sd(x)
#> [1] 3.352982

se <- sd(x) / sqrt(length(x))
se
#> [1] 1.060306

Now suppose that we want to compute the standard error for several samples, or for several groups within a data set. We would have to duplicate our code sd(x)/sqrt(length(x)) and adapt it to each new input vector. Even though this code looks manageable, code duplication is bad practice, because it can lead to errors. Another common source of errors are copy/paste errors: you might copy sd(x)/sqrt(length(x)) with the intention of changing the x to y , but you would have to remember to change each instance of x into a y. For more complex code, this is certainly not a good idea.

A much better approach is to create a function that can apply the code to any input vector.

4.2 Defining a function

Recall that a function has a name and arguments, which may have default values:

f(arg1, arg2 = val2)

The function f has two arguments, arg1 and arg2. Only arg2 has a default value. This means that arg1 is required.

Let’s give our standard error function the descriptive name standard_error. It will take one argument, x.

We use the function function() to define a new function:

standard_error <- function(x) {
    # Code
}

Now we need the function body. This is the code between the curly braces {}. The last statement in the function body that is evaluated is automatically returned.

4.3 Standard error function

Now we are ready to define the standard error function:

standard_error <- function(x) {

    n <- length(x)
    sd(x)/sqrt(n)
}

We can use this function with any valid input vector:

standard_error(x)
#> [1] 1.060306

4.4 Standard deviation function: handling NA values

In our next example, we will learn how to handle NA’s. We will implement our own standard deviation function. R’s sd() computes the sample standard deviation, by dividing by n-1. If we merely want to use the standard deviation descriptively, we need to divide by n.

Let’s create a function. This time, we also want to be able to handle any missing values, so we will use the function na.omit().

The formula for the standard deviation is

\[\sigma = \sqrt{\frac{\sum_{i}(x_i-\bar{x})^2}{n}}\]

To compute this, we subtract the mean \(\bar{x}\) from each element in \(x\), square the differences, and sum over all elements. Translated into R code: sum((x - mean(x))^2). Next, we need the number of observations again, which is length(x).

standard_deviation <- function(x, na.rm = FALSE) {
    # remove NAs
    if (na.rm) {
        x <- na.omit(x)
    }
    sqrt(sum((x - mean(x))^2) / (length(x)))
}

standard_deviation(x)
#> [1] 3.180918

Does our function handle NAs correctly?

y <- rnorm(12)

# let's replace the elements 1, 4 and 8 by NAs
y[c(1, 4, 8)] <- NA
y
#>  [1]         NA  0.6289820  2.0650249         NA  0.5124269 -1.8630115
#>  [7] -0.5220125         NA  0.5429963 -0.9140748  0.4681544  0.3629513

Without na.rm, we get NA.

standard_deviation(y)
#> [1] NA

With na.rm set to true, the missing values are omitted, and we get the standard deviation of the remaining elements.

standard_deviation(y, na.rm = TRUE)
#> [1] 1.052228

4.5 Exercises

4.5.1 Multiplication

Can you write a function that takes two numbers as arguments, and multiplies them. Choose a default value of 1 for the second argument. This means that, unless we specify otherwise, the first arguments will be multiplied by \(1\), i.e. the function will return that number itself.

Solution

multiply <- function(x, k = 1) {
    k * x
}

# with default value für k
value <- multiply(2)
value
#> [1] 2

multiply(2, 3)
#> [1] 6
multiply(333, 43)
#> [1] 14319

We can also call the function with named arguments:

multiply(x = 4, k = 2)
#> [1] 8

4.5.2 Skewness

In psychological research, we often measure response times. These data are often right-skewed. The skewness can be caluclated with the following formula:

\[\mathrm{skewness} = \frac{\sum_{m=1}^{n}{(x_m - \bar{x})^3}} {n \cdot s_X^3}\]

\(x\) is a vector of observations with length \(n\), \(\bar{x}\) is its mean and \(s_X\) its empirical standard deviation \(s_X = \sqrt{\frac{1}{n}\sum_{m=1}^n{(x_m - \bar{x})^2}}\). For symmetrical distributions, the skewness is \(0\). For left-skewed distributions, the skewness is negative and for right-skewed distributions it’s positive. For response time data, we expect a value \(> 0\).

The formula can be simplified: \[\mathrm{skewness} = \frac{1}{n} \sum_{m=1}^{n} {z_m^3}\] with \[z_m =\frac{x_m - \bar{x}} {s_X}\]. To compute the skewness of a vector, we first standardize it, i.e. subtract its mean and divide by its standard deviation. Then we compute its third power and divide by \(n\).

Let’s generate some random numbers from a Gamma distribution. The Gamma distribition is only defined for positive numbers, and is right skewed.

set.seed(43772)
rts <- 0.5 + rgamma(100, shape = 1.0, scale = .9)

Unfortunately, there is no function in base R that can calculate skewness. Can you write a function that does this?

Solution

First we’ll translate the formula into R code.

n <- length(rts)
mean_rt <- mean(rts)
sd_rt <- sqrt(sum((rts - mean_rt)^2) / (n))
z <- (rts - mean_rt) / sd_rt
skewness <- sum(z^3) / n
skewness
#> [1] 1.599983

skewness <- function(x) {

    n <- length(x)

    mean_x <- mean(x)
    sd_x <- sqrt(sum((x - mean_x)^2) / (n))

    # z-transform x and assign to z
    z <- (x - mean_x) / sd_x

    # this is not an evaluation, the result is assigned to skewness
    skewness <- sum(z^3) / n

    # we need to evaulate skewness
    # we could write 
    # print(skewness)
    # or
    # return(skewness)
    skewness
}

skewness(rts)
#> [1] 1.599983