# 6 Statistical Distributions

R provides a very simple and effective way of calculating distribution characteristics for a number of distributions (we only present part of it here).

An overview of all available distributions is can be found via help(“Distributions”).

## 6.1 Normal distribution

We will introduce the different statistical functions using the normal distribution and then look at other distributions. With help("Normal") we get an overview of the statistical functions for the normal distribution:

dnorm(x, mean = 0, sd = 1,      # Probability density function
log = FALSE)
pnorm(q, mean = 0, sd = 1,      # Cumulative distribution function (computing
lower.tail = TRUE,        # tail probabilities for z scores)
log.p = FALSE)
qnorm(p, mean = 0, sd = 1,      # Quantile function (computing z scores for probabilities)
lower.tail = TRUE,        # -> Inverse cumulative distribution function
log.p = FALSE)
rnorm(n, mean = 0, sd = 1)      # Pseudo-random number generator of normally distributed scores

Arguments

x, q   vector of quantiles.     # Normally distributed value(s), for which the probability density
# [dnorm()] or the distribution function [pnorm()] will be computed.
p      vector of probabilities. # Cumulative probability or probabilities, for which
# the quantile of a normal distribution will be computed
n      number of observations.  # Number of random numbers to be generated with rnorm()
mean   vector of means.         # Expected value(s) of one or more normal distributions
sd     vector of SDs.           # Standard deviation(s) of one or more normal distributions
log, log.p   logical;           # computes all statistics for lognormal distributions
are given as log(p).
lower.tail   logical;           # if FALSE: 1) with pnorm() instead of the cumulative
if TRUE (default),           #  distribution function P[X ≤ x] the complementary
# function P[X > x] will be computed;
probabilities are P[X ≤ x]   # 2) when computing quantiles using qnorm() the argument
# p is interpreted as 1-p

### Cumulative distribution function (CDF)

The distribution function pnorm() is needed to calculate p-values. The simplest case concerns the left tail (one-tailed) p-value for a given standard-normally distributed empirical z score.

If we do not specify the arguments mean and sd, the default values mean = 0 and sd = 1 are used and we get the standard normal distribution.

Example:

pnorm(q = -1.73)
#> [1] 0.04181514

# or simply

pnorm(-1.73)
#> [1] 0.04181514

The p-value for a one-tailed (left-tailed) significance test for z = -1.73 is 0.042.

If we need a right-tailed p-value (area to the right of a certain empirical z-score) we use the argument lower.tail = FALSE.

pnorm(1.645)  # this is 1-p
#> [1] 0.9500151
pnorm(1.645, lower.tail = FALSE)
#> [1] 0.04998491

The value of 1.645 of the standard normal distribution is often used as a critical value for the one-tailed significance test. So to left of this value should be 95% of the distribution and to the right of it 5% of the distribution. Here we see that this is not exactly true.

For a two-tailed p-value we’ll have to double the one-tailed p-value:

# 2-tailed p-value for a negative z-score
2*pnorm(-1.73)
#> [1] 0.08363028

# 2-tailed p-value for a negative z-score
2*pnorm(1.73, lower.tail = FALSE)
#> [1] 0.08363028
2*(1 - pnorm(1.73))
#> [1] 0.08363028

We can also use pnorm() to get the percentiles of a series of normally distributed values. For example, we want to know which percentiles correspond to a set of IQ values:

IQ_Scores <- c(74, 87, 100, 110, 128, 142)
pnorm(q = IQ_Scores, mean = 100, sd = 15)
#> [1] 0.04151822 0.19306234 0.50000000 0.74750746 0.96902592 0.99744487

The IQ scores 74, 87, 100, 110, 128, 142 thus correspond to the percentiles 4.2%, 19.3%, 50%, 74.8%, 96.9%, 99.7% of the IQ distribution.

It is also possible to compute the probability of any interval between two normally distributed scores by subtracting the two respective cumulative distribution functions.

# Probability of z-scores between -0.5 and + 0.5
pnorm(0.5) - pnorm(-0.5)
#> [1] 0.3829249

# Probability of IQ-scores between 130 and 140
pnorm(140, mean = 100, sd = 15) - pnorm(130, mean = 100, sd = 15)
#> [1] 0.01891975

### Quantile function

The quantile function qnorm() is the complement to the distribution function. It is also called inverse cumulative distribution function (ICDF). With qnorm() we obtain a z-score (i.e., a quantile of the standard normal distribution) for a given area p representing the first argument of the function. In practice, we often need this function to calculate critical values for a given $$\alpha$$-level for one-tailed and two-tailed significance tests.

The most frequently used z quantiles for $$\alpha$$ = 0.01 and $$\alpha$$ = 0.05 are:

# alpha = 0.01
qnorm(p = 0.005)    # two-tailed test, lower critical value
#> [1] -2.575829
qnorm(p = 0.995)    # two-tailed test, upper critical value
#> [1] 2.575829
qnorm(p = 0.01)     # one-tailed test, left critical value
#> [1] -2.326348
qnorm(p = 0.99)     # one-tailed test, right critical value
#> [1] 2.326348

# alpha = 0.05
qnorm(p = 0.025)    # two-tailed test, lower critical value
#> [1] -1.959964
qnorm(p = 0.975)    # two-tailed test, upper critical value
#> [1] 1.959964
qnorm(p = 0.05)     # one-tailed test, left critical value
#> [1] -1.644854
qnorm(p = 0.95)     # one-tailed test, right critical value
#> [1] 1.644854

Alternatively, the upper or right-tailed critical values can be computed with the lower quantile and the argument lower.tail = FALSE: qnorm (p = 0.975) is equivalent to qnorm (p = 0.025, lower.tail = FALSE )

We can use the quantile function also more generally, e.g., for the deciles of the IQ distribution:

deciles <- seq(0, 1, by = 0.1)
qnorm(p = deciles, mean = 100, sd = 15)
#>  [1]      -Inf  80.77673  87.37568  92.13399  96.19979 100.00000 103.80021
#>  [8] 107.86601 112.62432 119.22327       Inf

With an IQ of just under 120, you thus belong to the smartest 10% of the population!

### Probability density function (PDF)

We actually rarely need the probability density function (PDF) of a normally distributed score. As a reminder: the probability of individual values of a continuous random variable is 0, only intervals of values have a probability greater than 0 (corresponding to the area under the curve, the integral). Individual values have a probability density, which corresponds to the height of the curve of the probability distribution at the point X = x. Thus, we need the probability density function if we want to plot a normal distribution.

Before we start plotting, let’s have a look at the definition of the probability density function of the standard normal distribution: $$\phi(z) = \frac{1}{\sqrt{2\cdot\pi\cdot e^{z^{2}}}}$$. The density for the mean ($$\mu$$ = 0) is therefore $$\phi(z) = \frac{1}{\sqrt{2\cdot\pi\cdot e^{0^{2}}}} = \frac{1}{\sqrt{2\cdot\pi}}$$:

1/(sqrt(2*pi))
#> [1] 0.3989423

# and that is:

dnorm(x = 0)
#> [1] 0.3989423

We then get the probability density function for z = 1 as:$$\phi(z) = \frac{1}{\sqrt{2\cdot\pi\cdot e^{1^{2}}}} = \frac{1}{\sqrt{2\cdot\pi\cdot e}}$$:

1/(sqrt(2*pi*exp(1)))
#> [1] 0.2419707

# or as:

dnorm(1)
#> [1] 0.2419707

#### Plotting the standard normal distribution

library(tidyverse)
p <- data_frame(x = c(-4,4)) %>%
ggplot(aes(x = x))
#> Warning: data_frame() is deprecated, use tibble().
#> This warning is displayed once per session.

p + geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.4) +
stat_function(fun = dnorm, color = "#84CA72", size = 1.5) +
ggtitle("Standard Normal Distribution") +
xlab("x") +
ylab("dnorm(x)") +
theme_classic()

We can also draw a certain z score in the diagram and let us calculate a p-value for this with pnorm():

my_z <- 1.73

p_value <- round(pnorm(my_z, lower.tail = FALSE), 3)

p + geom_segment(aes(x = my_z,
y = 0,
xend = my_z,
yend = (dnorm(my_z))),
color = "grey70",
size = 1.5,
alpha = 0.8) +
stat_function(fun = dnorm, color = "#84CA72", size = 1.5) +
ggtitle("Standard Normal Distribution") +
xlab("x") +
ylab("dnorm(x)") +
annotate("text", label = paste0("p = ", p_value),
x = my_z + 0.7,
y = 0.1,
color = "black") +
theme_classic()

### Data simulation

With rnorm() we can generate normally distributed pseudo-random numbers. It is a pseudo-random number generator because the function only simulates randomness. An actual random number generator is a bit complicated and can not be easily implemented using a deterministic computer. The advantage of the pseudo-random generator is that we can get a reproducible “random” draw using the seed value function. set.seed () starts the draw algorithm at a specific location, so that the same starting numbers always return the same “random” numbers.

Let’s generate 22 “random” IQ scores:

set.seed(534)  # or any other number
r_IQ_scores <- rnorm(n = 22, mean = 100, sd = 15)
r_IQ_scores
#>  [1]  91.85366 112.20598  93.59749  95.60904  89.71878  93.37631  87.61408
#>  [8]  92.83598  93.02380  88.35326 132.49322  87.30990 107.95051  95.45687
#> [15]  92.59292  84.40998 135.42350  67.04528  91.38498 130.92657  81.36571
#> [22] 102.44571
mean(r_IQ_scores)
#> [1] 97.59062
sd(r_IQ_scores)
#> [1] 16.91699

## 6.2t distribution

With help("TDist") we obtain an overview of the functions for the t distribution:

dt(x, df, ncp               # Probability density function of the t distribution
log = FALSE)
pt(q, df, ncp,              # Cumulative distribution function (computing tail
lower.tail = TRUE,    # probabilities for t scores)
log.p = FALSE)
qt(p, df, ncp               # Quantile function (computing t scores for probabilities)
lower.tail = TRUE,    # -> Inverse cumulative distribution function
log.p = FALSE)
rt(n, df, ncp)              # Pseudo-random number generator of t distributed scores


The function arguments are the same as for the normal distribution with the following differences: The degrees of freedom (df) of the t distribution must be specified. The arguments mean and sd are omitted, but there is the non-centrality parameter ncp. This is needed (as the name suggests) for non-central t distributions that do not have a mean of 0. While the central t distribution (with $$\mu = 0$$) is symmetric and corresponds to the H0 distribution required for t tests, the non-central t distributions (H1 distributions) are asymmetric and required for power analysis. We will not look at these distributions here, so we will not use the ncp argument.

### Cumulative distribution function (CDF)

Example for a one-tailed (left-tailed) p-value:

pt(q = -1.73, df = 6)
#> [1] 0.06717745
pt(-1.73, 6)
#> [1] 0.06717745

Example for a one-tailed (right-tailed) p-value

pt(1.73, df = 6, lower.tail = FALSE)
#> [1] 0.06717745

For a two-tailed significance test we would need to double the one-tailed p-value:

# 2-tailed p-value for a negative t
2 * pt(-1.73, df = 6)
#> [1] 0.1343549

# 2-tailed p-value for a positive t
2 * pt(1.73, df = 6, lower.tail = FALSE)
#> [1] 0.1343549

### Quantile function

With qt() we get a t score for a given area p. In most t distribution tables, only specific t quantiles are tabulated. In the textbook by Eid, Gollwitzer & Schmitt (2015), e.g. the t scores for the distribution functions p of 0.6, 0.8, 0.9, 0.95, 0.975, 0.99 and 0.995 are given for t distributions with various degrees of freedom.

Let’s get these for a t-distribution with df = 6:

my_ps <- c(0.6, 0.8, 0.9, 0.95, 0.975, 0.99, 0.995)
qt(my_ps, df = 6)
#> [1] 0.2648345 0.9057033 1.4397557 1.9431803 2.4469119 3.1426684 3.7074280

Compare these t quantiles with those from the t table! Calculate additional quantiles for t distributions with df > 30 that are not available in the table. (You will never again have to approximate a critical t score with a z score because the table does not include the specific t distribution you need…)

### Probability density function (PDF)

Let’s plot a t distribution:

library(tidyverse)
p <- data_frame(x = c(-4,4)) %>%
ggplot(aes(x = x))

p + stat_function(fun = dt, args = list(df = 6), color = "#84CA72", size = 1.5) +
ggtitle("Central t-Distribution (df = 6)") +
xlab("x") +
ylab("dt(x, df = 6)") +
theme_classic()

library(tidyverse)
p <- data_frame(x = c(-4,4)) %>%
ggplot(aes(x = x))

p + stat_function(fun = dt, args = list(df = 6), color = "#84CA72", size = 1.5) +
stat_function(fun = dnorm, color = "grey70", size = 1.5) +
ggtitle("Central t-Distribution (df = 6) compared to z-Distribution") +
xlab("x") +
ylab("dt(x, 6) / dnorm(x)") +
theme_classic()

### Data simulation

With rt() we can randomly generate t distributed scores. The functionality is the same as for the normal distribution.

## 6.3 Chi-square distribution

For the chi-square distribution (help ("chisquare")) the functionality is the same as for the t distribution. The respective degrees of freedom must be specified and there are also non-central distributions, which we will not consider further. The functions are called pchisq() (cumulative distribution function), qchisq() (quantile function), dchisq() (density function) and rchisq() (random generation of $$\chi^2$$-distributed scores).

### Cumulative distribution function (CDF)

When calculating p-values for the chi-square distribution, in most cases only the upper (right) end of the distribution is needed. This is because most statistical tests are designed in such a way that all deviations from H0 (no matter in which direction) lead to a larger chi-square value (that’s why these tests always test a non-directional hypothesis, and the p-value therefore corresponds to a two-tailed test). Thus, we always use the argument lower.tail = FALSE to calculate p-values for chi-square distributions.

There are some exceptions: One is the chi-square test for comparing a sample variance to a known population variance (one-sample variance test). For this test in some cases the lower (left) end of chi-square distribution is needed. Since the chi-square one-sample variance test is not implemented in most statistical packages, one could actually be confronted with the need to compute such a p-value by hand (using pchisq() with the default value lower.tail = TRUE)…

Example for a p-value for $$\chi^2(2) = 4.56$$

pchisq(q = 4.86, df = 2, lower.tail = FALSE)
#> [1] 0.08803683

### Quantile function

With qchisq() we obtain the quantiles of the chi-square distribution, e.g. the critical values for $$\alpha = 0.05$$. Some statisticians are very proud to know the critical values for the 1, 2, 3, 4, and 5 degrees of freedom of the chi-square distributions by heart, because these are often needed for model comparisons via likelihood ratio tests.

my_dfs <- 1:5
round(qchisq(0.95, df = my_dfs), 2)
#> [1]  3.84  5.99  7.81  9.49 11.07

### Probability density function (PDF)

Again, we can use the probability density function dchisq() to plot one or more chi-square distributions:

library(tidyverse)
p <- data_frame(x = c(0, 25)) %>%
ggplot(aes(x = x))

p + stat_function(fun = dchisq, args = list(df = 2), color = "#84CA72", size = 1.5) +
stat_function(fun = dchisq, args = list(df = 4), color = "grey70", size = 1.5) +
stat_function(fun = dchisq, args = list(df = 8), color = "blue", size = 1.5, alpha = 0.7) +
stat_function(fun = dchisq, args = list(df = 14), color = "red", size = 1.5, alpha = 0.7) +
ggtitle("Chi-Square Distributions with dfs = 2, 4, 8, and 14") +
xlab("x") +
ylab("dchisq(x, c(2, 4, 8, 14))") +
theme_classic()

### Data simulation

With rchisq() we can generate $$\chi^2$$-distributed random numbers. The functionality is the same as for the other distributions.

## 6.4F distribution

For the F distribution (help ("FDist")), the same rules apply as for the $$\chi^2$$-distribution and for the t distribution. We have to specify both numerator degrees of freedom (df1) and denominator degrees of freedom (df2). The functions are pf() (cumulative distribution function),qf() (quantile function), df() (probability density function), and rf() (random generation of F distributed scores).

### Cumulative Distribution Function

When calculating p-values for the F distribution, in most cases only the upper (right) end of the distribution is needed (as with the $$\chi^2$$-distribution). We therefore always use the argument lower.tail = FALSE to calculate p-values. Probably the most typical application for the F test is the one-factorial analysis of variance. Here we test whether there are any significant differences in the mean values between several groups. All deviations from H0 lead to a larger SSQbetween and thus to a higher empirical F-statistic. That’s also why we consider the entire $$\alpha$$ at the upper end of the distribution when computing critical F scores (see below “Quantile Function”).

Example for a p-value for F = 3.89 (df1 = 2, df2 = 40):

pf(q = 3.89, df1 = 2, df2 = 40, lower.tail = FALSE)
#> [1] 0.02859413

### Quantile function

If we compute an ANOVA by hand ;) often F distribution tables do not contain all combinations of degrees of freedoms. With qf() we can obtain the quantiles of the F distribution, thus the critical values for a significance test.

Example: Critical values for $$\alpha = 0.05$$ and $$\alpha = 0.01$$ for an F distribution with df1 = 2 and df2 = 40:

qf(c(0.95, 0.99), df1 = 2, df2 = 40)
#> [1] 3.231727 5.178508

# or also

qf(c(0.05, 0.01), df1 = 2, df2 = 40, lower.tail = FALSE)
#> [1] 3.231727 5.178508

### Probability density function (PDF)

Let’s plot some F distributions using df():

library(tidyverse)
p <- data_frame(x = c(0, 5)) %>%
ggplot(aes(x = x))

p + stat_function(fun = df, args = list(df1 = 2, df2 = 40),
color = "#84CA72", size = 1.5) +
stat_function(fun = df, args = list(df1 = 5, df2 = 80),
color = "grey70", size = 1.5) +
stat_function(fun = df, args = list(df1 = 8, df2 = 120),
color = "blue", size = 1.5, alpha = 0.7) +
ggtitle("F-Distributions with df1 = 2, 5, 8 and df2 = 40, 80, 120") +
xlab("x") +
ylab("df(x, c(2, 5, 8), c(40, 80, 120))") +
theme_classic()

### Data simulation

With rf() we can generate F distributed random numbers. The functionality is the same as for the other distributions.

## 6.5 Exercises

### Densities and plots

Compute the probability densities of the following IQ-scores: 70, 80, 90, 100, 110, 120, 130

Plot the following two normal distributions into a common plot: $$N(\mu = 0,\sigma^{2} = 1)$$ und $$N(\mu = 3,\sigma^{2} = 0.75)$$. Use different colors for the two distributions. Hint: variance $$\neq$$ SD!

library(tidyverse)
p <- data_frame(x = c(-4,7)) %>%
ggplot(aes(x = x))

p + stat_function(fun = dnorm, color = "#84CA72", size = 1.5) +
stat_function(fun = dnorm, args = list(3, sqrt(0.75)), color = "purple", size = 1.5, alpha = 0.6) +
ggtitle("2 Normal Distributions") +
xlab("x")+
ylab("dnorm(x)") +
theme_classic()`

### Cumulative distribution functions and p-values

Compute the cumulative distribution function for z = 1.

Compute the probability of values +/- 1 SD of the mean for any normal distribution (i.e. the area of this interval under the normal curve).

What is the percentange of the population that has IQs between 105 and 120?

Compute a one-tailed (right-tailed) p-value for a test statistic t= 2.045 (df = 8).

Compute a two-tailed p-value for a test statistic t= 0.73 (df = 14)

Compute a two-tailed p-value for a test statistic t= -0.73 (df = 14)

You got a test statistic of F= 4.36. Is the test significant (df1 = 2, df2 = 16)?

### Quantiles and critical values

Sixty percent of the population are below or equal to which IQ-score? Calculate the 60%-quantile of the IQ distribution.

Five percent of the population are above or equal to which IQ-score?

Advanced: Between which two IQ scores lie the middle 82% of the population? (We are looking for the 82%- Central Probability Interval)

Calculate the critical t-score (df = 28) for a one-tailed (right-tailed) test with $$\alpha = 0.05$$.

Calculate the critical t-scores (df = 28) for a two-tailed test with $$\alpha = 0.05$$.

Calculate the critical F-score for df1 = 5 and df2 = 77 ($$\alpha = 0.01$$).