10 Linear models

10.1 Multiple regression

In a linear (multiple) regression model we are interested in the joint linear prediction of a dependent variable by one or more independent variables. By taking into account the interdependencies among the predictor (independent) variables we aim to assess the unique effect of each predictor in the context of all other predictors.

The regression equation for three predictor variables is:

Population model:

\(Y = \beta_0 + \beta_1 \cdot X_1 + \beta_2 \cdot X_2 + \beta_3 \cdot X_3 + E\)

Sample model:

\(Y = b_0 + b_1 \cdot X_1 + b_2 \cdot X_2 + b_3 \cdot X_3 + E\)

We can also write the regression equation for the predicted values (\(\hat Y\)) by omitting the residual variable \(E\):

\(\hat Y = b_0 + b_1 \cdot X_1 + b_2 \cdot X_2 + b_3 \cdot X_3\)

The regression model above is specified as follows using the lm()function:

lm(Y ~ X1 + X2 + X3, data = mydata).

The formula notation is the same as used earlier with aov() to analyze factorial ANOVAS. Connecting the predictor variables using + means that only main effects will be estimated (multiple regression). If we want a model including all possible (two-way and three-way interactions) the predictors have to be linked using *. If only certain interaction effects but not others should be in the model, the respective predictors are connected using : and the corresponding terms are added using +.

For example, lm(Y ~ X1 + X2 + X3 + X1:X2 + X2:X3 + X1:X3, data = mydata) represents a model with all main effects and all two-way interactions, but without the three-way interaction X1:X2:X3.

Before we start with an example, a new way of loading and installing packages is introduced. The function p_load() from the pacman package allows loading a number of packages and automatically installs packages that cannot be found on the system:

install.packages("pacman")

pacman::p_load(haven, interactions, memisc, QuantPsyc, psych, tidyverse, wec)

As a first example, let us predict adolescents’ overall grade average by the three subscales from the academic self-efficacy domain. The data set should thus include the variables ID and Gesamtnote, as well as swk_neueslernen, swk_lernregulation, and swk_motivation (the three academic self-efficacy subscales). Since we will be adding the three subscales from the social self-efficacy domain in a subsequent example the scales swk_durchsetzung, swk_sozialkomp, and swk_beziehung are also included.

Loading and preparing the data set grades:

load(file = "data/adolescents.Rdata")
grades <- adolescents %>%
    dplyr::select(ID, Gesamtnote, starts_with("swk_")) %>% 
    drop_na()
# because there is also a select() function in the QuantPsych package it is 
# safer to call dplyr::select() directly to avoid error messages

Let’s have a look at the data and get some descriptive statistics plus a bivariate correlation matrix:

head(grades)
#> # A tibble: 6 x 8
#>   ID    Gesamtnote swk_neueslernen swk_lernregulat… swk_motivation
#>   <fct>      <dbl>           <dbl>            <dbl>          <dbl>
#> 1 2              4            5                4              4.6 
#> 2 10             5            5                4.88           6   
#> 3 11             4            4.57             5.38           5.4 
#> 4 12             5            5.33             5.29           4.67
#> 5 14             4            4.86             4.5            4.6 
#> 6 15             4            4                4.38           5.6 
#> # … with 3 more variables: swk_durchsetzung <dbl>, swk_sozialkomp <dbl>,
#> #   swk_beziehung <dbl>
describe(grades, fast = TRUE)
#> Warning in FUN(newX[, i], ...): kein nicht-fehlendes Argument für min; gebe Inf
#> zurück
#> Warning in FUN(newX[, i], ...): kein nicht-fehlendes Argument für max; gebe -Inf
#> zurück
#>                    vars   n mean   sd  min  max range   se
#> ID                    1 279  NaN   NA  Inf -Inf  -Inf   NA
#> Gesamtnote            2 279 4.46 0.69 3.00    6  3.00 0.04
#> swk_neueslernen       3 279 5.09 0.78 2.43    7  4.57 0.05
#> swk_lernregulation    4 279 5.23 0.80 2.88    7  4.12 0.05
#> swk_motivation        5 279 5.44 0.76 3.20    7  3.80 0.05
#> swk_durchsetzung      6 279 5.28 0.99 1.00    7  6.00 0.06
#> swk_sozialkomp        7 279 5.28 0.79 1.57    7  5.43 0.05
#> swk_beziehung         8 279 5.61 0.75 3.33    7  3.67 0.05
grades %>%
    dplyr::select(-ID) %>% 
    cor() %>% 
    round(2)
#>                    Gesamtnote swk_neueslernen swk_lernregulation swk_motivation
#> Gesamtnote               1.00            0.42               0.39           0.34
#> swk_neueslernen          0.42            1.00               0.56           0.64
#> swk_lernregulation       0.39            0.56               1.00           0.72
#> swk_motivation           0.34            0.64               0.72           1.00
#> swk_durchsetzung         0.05            0.34               0.31           0.42
#> swk_sozialkomp           0.11            0.36               0.33           0.41
#> swk_beziehung            0.24            0.39               0.46           0.41
#>                    swk_durchsetzung swk_sozialkomp swk_beziehung
#> Gesamtnote                     0.05           0.11          0.24
#> swk_neueslernen                0.34           0.36          0.39
#> swk_lernregulation             0.31           0.33          0.46
#> swk_motivation                 0.42           0.41          0.41
#> swk_durchsetzung               1.00           0.59          0.32
#> swk_sozialkomp                 0.59           1.00          0.49
#> swk_beziehung                  0.32           0.49          1.00

Estimation of the regression model:

model_1 <- lm(Gesamtnote ~ swk_neueslernen + swk_lernregulation + swk_motivation,
              data = grades)

Results with summary(), confidence intervals with confint(), and standardized regression coefficients with QuantPsyc::lm.beta()

summary(model_1)
#> 
#> Call:
#> lm(formula = Gesamtnote ~ swk_neueslernen + swk_lernregulation + 
#>     swk_motivation, data = grades)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.63155 -0.42393  0.06732  0.46759  1.27122 
#> 
#> Coefficients:
#>                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         2.13666    0.28957   7.379 1.90e-12 ***
#> swk_neueslernen     0.26849    0.06352   4.227 3.23e-05 ***
#> swk_lernregulation  0.20282    0.06773   2.995    0.003 ** 
#> swk_motivation     -0.01869    0.07679  -0.243    0.808    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.6176 on 275 degrees of freedom
#> Multiple R-squared:  0.2107, Adjusted R-squared:  0.2021 
#> F-statistic: 24.47 on 3 and 275 DF,  p-value: 4.582e-14
confint(model_1) %>% round(2)
#>                    2.5 % 97.5 %
#> (Intercept)         1.57   2.71
#> swk_neueslernen     0.14   0.39
#> swk_lernregulation  0.07   0.34
#> swk_motivation     -0.17   0.13
lm.beta(model_1) %>% round(2)
#>    swk_neueslernen swk_lernregulation     swk_motivation 
#>               0.30               0.23              -0.02

While the predictors swk_neueslernen \((X_1)\) and swk_lernregulation \((X_2)\) showed significant positive effects on adolescents’ grade average (\(b_1 = 0.268, p = 0\) and \(b_2 = 0.203, p = 0.003\), respectively), the effect of swk_motivation \((X_3)\) was non-significant (\(b_1 = -0.019, p = 0.808\)).

Together, the three academic self-efficacy scales explained \(R^2 = 0.211\) of the variance of Gesamtnote.

10.2 Hierarchical regression

In a second step we would like to find out whether the three social self-efficacy scales can significantly explain variance of Gesamtnote over and above the three academic self-efficacy scales. Therefore, these three additional predictors are added in a second block (hierarchical regression). We thus compute a model_2 with all six predictors and subsequently compare this model using anova() for an ANOVA F test to the reduced model (model_1) with only three predictors.

model_2 <- lm(Gesamtnote ~ swk_neueslernen + swk_lernregulation + swk_motivation + 
                swk_durchsetzung + swk_sozialkomp + swk_beziehung, data = grades)

Using update() the same model can be defined in a way more reflective of the blocked/nested structure of predictors:

model_2a <- update(model_1, . ~ . + swk_durchsetzung + swk_sozialkomp + swk_beziehung)

Extracting results:

summary(model_2)
#> 
#> Call:
#> lm(formula = Gesamtnote ~ swk_neueslernen + swk_lernregulation + 
#>     swk_motivation + swk_durchsetzung + swk_sozialkomp + swk_beziehung, 
#>     data = grades)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.6421 -0.4048  0.0764  0.4614  1.2687 
#> 
#> Coefficients:
#>                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         2.25324    0.33953   6.636 1.73e-10 ***
#> swk_neueslernen     0.28045    0.06385   4.392 1.61e-05 ***
#> swk_lernregulation  0.18604    0.06907   2.693  0.00751 ** 
#> swk_motivation      0.02777    0.07831   0.355  0.72310    
#> swk_durchsetzung   -0.09844    0.04743  -2.075  0.03889 *  
#> swk_sozialkomp     -0.04015    0.06315  -0.636  0.52548    
#> swk_beziehung       0.06945    0.06012   1.155  0.24897    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.6121 on 272 degrees of freedom
#> Multiple R-squared:  0.2333, Adjusted R-squared:  0.2164 
#> F-statistic:  13.8 on 6 and 272 DF,  p-value: 1.097e-13
confint(model_2) %>% round(2)
#>                    2.5 % 97.5 %
#> (Intercept)         1.58   2.92
#> swk_neueslernen     0.15   0.41
#> swk_lernregulation  0.05   0.32
#> swk_motivation     -0.13   0.18
#> swk_durchsetzung   -0.19  -0.01
#> swk_sozialkomp     -0.16   0.08
#> swk_beziehung      -0.05   0.19
lm.beta(model_2) %>% round(2)
#>    swk_neueslernen swk_lernregulation     swk_motivation   swk_durchsetzung 
#>               0.31               0.22               0.03              -0.14 
#>     swk_sozialkomp      swk_beziehung 
#>              -0.05               0.08

The only additional significant predictor is swk_durchsetzung \((X_4)\) which showed a negative effect on adolescents’ grade average (\(b_4 = -0.098, p = 0.039\)). All six self-efficacy scales together explained \(R^2 = 0.233\) of the variance of Gesamtnote.

An additional important question is whether the three social self-efficacy variables as a set or block have a significant effect over and above the three academic self-efficacy scales:

anova(model_1, model_2)
#> Analysis of Variance Table
#> 
#> Model 1: Gesamtnote ~ swk_neueslernen + swk_lernregulation + swk_motivation
#> Model 2: Gesamtnote ~ swk_neueslernen + swk_lernregulation + swk_motivation + 
#>     swk_durchsetzung + swk_sozialkomp + swk_beziehung
#>   Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
#> 1    275 104.9                              
#> 2    272 101.9  3    3.0055 2.6743 0.04767 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model comparison was significant and \(\Delta{R^2} = R_2^2 - R_1^2 = 0.233 - 0.211 = 0.022\).

How could we possibly explain the negative effect of swk_durchsetzung on Gesamtnote? The more an adolescent believes to be able to assert his- or herself, the more likely he or she is to have a (slightly) worse grade average when considering the effect in the context of the five other self-efficacy variables. A possible interpretation is that those parts of assertiveness that are independent of the other self-efficacy dimensions reflect a kind of non-academic orientation of the adolescent.

10.3 Leave-one-out cross-validation

Regression parameters were determined using a sample with a specific sampling error. \(R^2\) increases even when (certain) predictors possess only chance covariation with the criterion. But how does the model fare with regard to out-of-sample prediction? This can be studied using (K-fold) cross-validation: The simplest method is splitting the data in two halves and fit a regression model on the first halve (training sample), then predict the \(y\) values of the second half (test sample) using the regressions coefficients obtained in the training sample (2-fold cross-validation). If the correlation between the predicted values and the actual values in the test sample is considerably lower than the multiple correlation \(R = \sqrt {R^2}\) in the training sample, the prediction is poor.

A disadvantage is that the actual analysis is done only with half the sample size, leading to lower power and higher data acquisition costs. Therefore, sometimes the training sample is set to for instance 80% of the data and the test sample to 20% of the data. When this is done five times each with different 20% of the data as test sample, this represents a 5-fold cross-validation procedure.

Leave-one-out-cross-validation (LOOCV): only leave out one person for the training sample and then predict this person’s \(y\) value from the regression model fitted on the rest of the sample + repeat this \(n\) times (each time a different person is left out from the training sample).

The LOOCV error of a regression model is then given by: \[LOOCVE=\frac{1}{n} \sum_{m=1}^{n}\left(y_{m}-\hat{y}_{m \backslash(m)}\right)^{2}\] with \(\hat{y}_{m \backslash(m)}\) representing the predicted value of person \(m\) obtained from a prediction using the regression estimates from a model estimated without this specific person. The LOOCVE can then be compared across several regression models (with different predictor sets) and the model with the lowest LOOCVE is the one to be preferred.

This sounds tedious, since we would have to compute \(n\) regression models for each of several predictor sets. Fortunately, there is a trick: the \((y_{m}-\hat{y}_{m \backslash(m)})\) (deleted residuals) can be alternatively computed using the formula \[\left(\frac{y_{m}-\hat{y}_{m}}{1-h_{m}}\right)\] with the \(h_m\) values obtainable from the output object of a regular regression model via the hatvalues() function. The \(h_m\) values are the so called hat values or leverage values, a measure of the influence a certain observation has on the regression model based on its multivariate outlier characteristics on the independent variables. For more on leverage see below in the chapter on regression diagnostics.

Let’s look at the first few \(h\)s for model_2:

head(hatvalues(model_2))
#>          1          2          3          4          5          6 
#> 0.02356778 0.02210151 0.02165885 0.02163575 0.01454884 0.03668533

Using our function programming skills we can now define our own LOOCVE function:

loocve <- function(x) {
    h <-  hatvalues(x)
    mean((residuals(x)/(1-h))^2)}

Let’s apply this function first to model_2 (all six predictors):

loocve(model_2)
#> [1] 0.3853007

Now let’s drop all predictors that did not reach significance in model_2:

model_3 <- lm(Gesamtnote ~ swk_neueslernen + swk_lernregulation + swk_durchsetzung, data = grades)
summary(model_3)
#> 
#> Call:
#> lm(formula = Gesamtnote ~ swk_neueslernen + swk_lernregulation + 
#>     swk_durchsetzung, data = grades)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.61950 -0.39923  0.07386  0.48857  1.25280 
#> 
#> Coefficients:
#>                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         2.37820    0.29227   8.137 1.41e-14 ***
#> swk_neueslernen     0.29416    0.05845   5.032 8.77e-07 ***
#> swk_lernregulation  0.21593    0.05604   3.853 0.000145 ***
#> swk_durchsetzung   -0.10272    0.03997  -2.570 0.010702 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.6104 on 275 degrees of freedom
#> Multiple R-squared:  0.229,  Adjusted R-squared:  0.2206 
#> F-statistic: 27.23 on 3 and 275 DF,  p-value: 1.881e-15
loocve(model_3)
#> [1] 0.3789464

The LOOCVE is clearly lower for this model. Here, we used a strategy of deleting non-significant predictors and refitting the model without these.

The LOOCV-logic prevents overfitting and oftentimes will lead to a final model without the non-significant predictors. This is in line with the fact that the significance tests already failed to establish an effect for these predictors. However, the LOOCV will not always lead to this solution since the out-of-sample prediction follows a different logic than null-hypothesis testing. For a full evaluation we would have to carry out LOOCVE comparisons of all possible subsets (power set) of our original predictors.

For demonstration purposes let’s finally drop swk_durchsetzung as well:

model_4 <- lm(Gesamtnote ~ swk_neueslernen + swk_lernregulation, data = grades)
loocve(model_4)
#> [1] 0.3849899

The LLOCVE went up again, thus swk_durchsetzung should definitely stay in the model. It enhances the prediction of adolescents’ grade average not only from a null-hypothesis testing perspective but also from a cross-validation (out-of-sample prediction) perspective.

10.4 Regression diagnostics

Diagnosing a regression model will be demonstrated using a further example: adolescents’ life satisfaction (leben_gesamt) is predicted by the three social self-efficacy scales swk_durchsetzung, swk_sozialkomp, and swk_beziehung.

lifesat <- adolescents %>%
    dplyr::select(ID, leben_gesamt, swk_durchsetzung, swk_sozialkomp, swk_beziehung) %>% 
    drop_na()

describe(lifesat, fast = TRUE)
#>                  vars   n mean   sd  min  max range   se
#> ID                  1 284  NaN   NA  Inf -Inf  -Inf   NA
#> leben_gesamt        2 284 5.42 0.69 2.91 6.91  4.00 0.04
#> swk_durchsetzung    3 284 5.28 0.99 1.00 7.00  6.00 0.06
#> swk_sozialkomp      4 284 5.28 0.79 1.57 7.00  5.43 0.05
#> swk_beziehung       5 284 5.62 0.75 3.33 7.00  3.67 0.04
lifesat %>% 
    dplyr::select(-ID) %>% 
    cor() %>% 
    round(2)
#>                  leben_gesamt swk_durchsetzung swk_sozialkomp swk_beziehung
#> leben_gesamt             1.00             0.36           0.48          0.63
#> swk_durchsetzung         0.36             1.00           0.59          0.33
#> swk_sozialkomp           0.48             0.59           1.00          0.49
#> swk_beziehung            0.63             0.33           0.49          1.00
model_life <- lm(leben_gesamt ~ swk_durchsetzung + swk_sozialkomp + swk_beziehung, data = lifesat)
summary(model_life)
#> 
#> Call:
#> lm(formula = leben_gesamt ~ swk_durchsetzung + swk_sozialkomp + 
#>     swk_beziehung, data = lifesat)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.41818 -0.30125  0.00031  0.33167  1.19368 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       1.64634    0.26153   6.295 1.18e-09 ***
#> swk_durchsetzung  0.05897    0.03911   1.508  0.13273    
#> swk_sozialkomp    0.15249    0.05330   2.861  0.00454 ** 
#> swk_beziehung     0.47285    0.04743   9.970  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.5224 on 280 degrees of freedom
#> Multiple R-squared:  0.4359, Adjusted R-squared:  0.4299 
#> F-statistic: 72.12 on 3 and 280 DF,  p-value: < 2.2e-16

The overall model is significant with \(R^2 = 0.436\). Self-efficacy with regard to social competence in a peer context (swk_sozialkomp) and self-efficacy with regard to dealing with important relationships (friends, parents, teachers etc.) (swk_beziehung) both have significant positive effects on life satisfaction, with the effect of relationship self-efficacy being the strongest. In contrast, self-efficacy regarding one’s self-assertiveness (swk_durchsetzung) has no significant effect in the context of the other two predictors.

10.4.1 Multicollinearity

The multicollinearity diagnosis is performed by the function vif() (variance inflation factor) from the car package.

car::vif(mod = model_life)
#> swk_durchsetzung   swk_sozialkomp    swk_beziehung 
#>         1.543432         1.820825         1.324467

Conventionally, VIFs > 4 should lead to further investigation and VIF > 10 indicate serious multicollinearity in need of correction. Here, none of the coefficients comes close to VIF = 4 or even VIF = 10.

Let’s once compute a VIF by hand: The predictor for which we want to compute the VIF is regressed on the other predictors in the model. For example, for swk_sozialkomp:

sozialkomp <- lm(swk_sozialkomp ~ swk_durchsetzung + swk_beziehung, data = lifesat)
# Tolerance = 1 - R^2 from this model:
TOL <- 1-summary(sozialkomp)$r.squared
# VIF = 1/TOL
1/TOL
#> [1] 1.820825

The \(1-R^2\) of this regression is the so-called tolerance TOL, and its inverse is the VIF. Thus, TOLs of 1/4 and 1/10 correspond to VIFs of 4 and 10, respectively.

\(~\)

Thus, if a specific predictor shares less than 75% or 90% of its variance with all other predictors in the model (i.e., TOLs > 0.25/0.10 or VIFs < 4/10), it is concluded that multicollinerity is no (serious) issue for this predictor. However, even in the case of non-problematic TOL/VIF values a predictor may exhibit too strong correlations with only one or two of the other predictors. Multicollinearity problems may result from these “local” multicollinearity issues as well. It is therefore important to also inspect the bivariate correlation matrix of the predictors (a bivariate correlation of around 0.85 would roughly correspond to the VIF = 4 cut-off).

There is a more sophisticated method to analyze predictor variable combinations exhibiting multicollinearity: the condition index. This index first decomposes the predictor variable correlation matrix in a set of eigenvalues from which condition indices are computed. Condition indices > 30 are clearly considered large, but there are also lower cut-offs of 10 or 15 mentioned in the literature. But the identification of one or more large condition indices is only the first step and not sufficient for multicollinearity: only if two or more of the predictor coefficients display variances of .90 or above on the respective large condition index do these predictor combinations show a multicollinearity issue. TOL/VIF and the condition index statistics are nicely implemented in the function ols_coll_diag from the package olsrr:

install.packages("olsrr")

library(olsrr)
#> 
#> Attaching package: 'olsrr'
#> The following object is masked from 'package:MASS':
#> 
#>     cement
#> The following object is masked from 'package:datasets':
#> 
#>     rivers
ols_coll_diag(model_life)
#> Tolerance and Variance Inflation Factor
#> ---------------------------------------
#>          Variables Tolerance      VIF
#> 1 swk_durchsetzung 0.6479067 1.543432
#> 2   swk_sozialkomp 0.5492016 1.820825
#> 3    swk_beziehung 0.7550204 1.324467
#> 
#> 
#> Eigenvalue and Condition Index
#> ------------------------------
#>    Eigenvalue Condition Index    intercept swk_durchsetzung swk_sozialkomp
#> 1 3.962004323         1.00000 0.0008872043      0.001366918   0.0007516754
#> 2 0.020240873        13.99081 0.1175815508      0.661172925   0.0042713744
#> 3 0.009602574        20.31251 0.6683465017      0.141070761   0.4853345420
#> 4 0.008152230        22.04547 0.2131847432      0.196389395   0.5096424082
#>   swk_beziehung
#> 1  0.0008403094
#> 2  0.1368536902
#> 3  0.1228769499
#> 4  0.7394290506

There exist condition indices > 15 in this model but for none of those are there two ore more predictor variables (the intercept does not count) for which the shared variance is > .90. Thus, also from this perspective multicollinearity is not an issue for this model.

10.4.2 Leverage values, residuals and influential data points

Outlier statistics on the independent variables: centered leverage values

The uncentered leverage values can be obtained with the hatvalues() function. For the better interpretable centered leverage values we have to subtract \(1/n\) from the uncentered leverage values (see below leveragec).

The distribution of the centered leverage values is obtained using the quantile() function. We would like to see the minimum and maximum values \((p = 0\) and \(p = 1)\), the median \((p=0.5)\), as well as quantiles returning the cut-offs for the 5% and 1% largest leverage values \((p = 0.95\) and \(p = 0.99)\).

leverage <- hatvalues(model_life)
leveragec <- leverage - 1/(length(leverage))
quantile(leveragec, p = c(0, 0.5, 0.95, 0.99, 1)) %>% round(3)
#>    0%   50%   95%   99%  100% 
#> 0.000 0.008 0.027 0.053 0.100
ggplot(mapping = aes(x = leveragec)) +
  geom_histogram(fill=I("skyblue")) +
  scale_x_continuous(breaks = seq(0, 0.15, 0.01)) +
  scale_y_continuous(breaks = seq(0, 60, 5)) +
  xlab("Centered Leverage") +
  ylab("Frequency") +
  theme_classic()

Since the recommendations for cut-offs for regression diagnostic statistics are partly inconsistent, we will use the empirical distribution (histogram) to identify persons with extreme values in the current model.

Identification of observations with high leverage values (here we select all \(\geq 0.05\)):

lifesat %>% 
  filter(leveragec > 0.05)
#> # A tibble: 4 x 5
#>   ID    leben_gesamt swk_durchsetzung swk_sozialkomp swk_beziehung
#>   <fct>        <dbl>            <dbl>          <dbl>         <dbl>
#> 1 21            3.45             1              1.57          4.83
#> 2 168           4.64             2.33           5             4   
#> 3 283           5.18             3              6.14          5.83
#> 4 313           2.91             1.33           2.57          3.83

These are the persons with the IDs 21, 168, 283, and 313.

Outlier statistics on the dependent variable: residuals

Raw residuals are obtained using residuals(model) and studentized residuals via rstandard(model). The most relevant studentized deleted residuals can be obtained with rstudent(model).

We want to identify thresholds of extreme residuals, thus we will look at the 0.5%, 2.5%, 50%, 97.5% and 99.5% quantiles of all residual measures. For reasons of space we will obtain the histogram with the full distribution only for studentized deleted residuals (stud_del).

res <- residuals(model_life)
stud_res <- rstandard(model_life)
stud_del <- rstudent(model_life)

ps <- c(0, 0.005, 0.025, 0.5, 0.975, 0.995, 1)
quantile(res, ps) %>% round(3)
#>     0%   0.5%   2.5%    50%  97.5%  99.5%   100% 
#> -2.418 -1.586 -1.014  0.000  0.951  1.152  1.194
quantile(stud_res, ps) %>% round(3)
#>     0%   0.5%   2.5%    50%  97.5%  99.5%   100% 
#> -4.671 -3.075 -1.959  0.001  1.827  2.224  2.308
quantile(stud_del, ps) %>% round(3)
#>     0%   0.5%   2.5%    50%  97.5%  99.5%   100% 
#> -4.855 -3.124 -1.969  0.001  1.835  2.240  2.326
ggplot(mapping = aes(x = stud_del)) +
  geom_histogram(fill=I("skyblue")) +
  scale_x_continuous(breaks = seq(-5.5, 3, 0.5)) +
  scale_y_continuous(breaks = seq(0, 35, 5)) +
  xlab("Studentized Deleted Residual") +
  ylab("Frequency") +
  theme_classic()

To only identify people with really large studentized deleted residuals the rule-of-thumb threshold of absolute values \(\geq 3\) (Eid, Gollwitzer, & Schmitt, 2017) is applied here.

lifesat %>% 
  filter(abs(stud_del) >= 3)
#> # A tibble: 2 x 5
#>   ID    leben_gesamt swk_durchsetzung swk_sozialkomp swk_beziehung
#>   <fct>        <dbl>            <dbl>          <dbl>         <dbl>
#> 1 60            3.27             5.33           3.86          5.17
#> 2 339           3                6.33           4.71          5.67

Applying this criterion two outliers (IDs 60 and 339) are identified.

Influential data points

Influential data points for particular regression coefficients

The DfBeta values correspond to the difference between a certain regression coefficient for a model calculated including a certain person and a model calculated without this person. They can be obtained using the function dfbeta(model). However, only the standardised DfBetaS values (obtained with dfbetas(model)) can be interpreted independently of the measurement scale of the variables.

The output of dfbeta(model) and dfbetas(model) is a matrix with persons in the rows and the variables/model effects in the columns. To be able to look at the distribution of the coefficients for each predictor (swk_durchsetzung, swk_sozialkomp, swk_beziehung), the matrix must first be converted into a dataframe.

In the following the quantiles for both DfBeta and DfBetaS will be requested, but histograms will be created only for the more relevant DfBetaS:

my_dfbeta <- data.frame(dfbeta(model_life))
my_dfbetas <- data.frame(dfbetas(model_life))

quantile(my_dfbeta$swk_durchsetzung, ps) %>% round(3)
#>     0%   0.5%   2.5%    50%  97.5%  99.5%   100% 
#> -0.020 -0.010 -0.005  0.000  0.005  0.011  0.015
quantile(my_dfbetas$swk_durchsetzung, ps) %>% round(3)
#>     0%   0.5%   2.5%    50%  97.5%  99.5%   100% 
#> -0.532 -0.252 -0.130  0.001  0.135  0.288  0.376
ggplot(mapping = aes(x = my_dfbetas$swk_durchsetzung)) +
  geom_histogram(fill=I("#CC79A7")) +
  scale_x_continuous(breaks = seq(-0.5, 0.5, 0.1)) +
  scale_y_continuous(breaks = seq(0, 150, 10)) +
  xlab("DfBetaS swk_durchsetzung") +
  ylab("Frequency") +
  theme_classic()

quantile(my_dfbeta$swk_sozialkomp, ps) %>% round(3)
#>     0%   0.5%   2.5%    50%  97.5%  99.5%   100% 
#> -0.010 -0.008 -0.005  0.000  0.006  0.020  0.025
quantile(my_dfbetas$swk_sozialkomp, ps) %>% round(3)
#>     0%   0.5%   2.5%    50%  97.5%  99.5%   100% 
#> -0.190 -0.152 -0.100 -0.006  0.121  0.388  0.493
ggplot(mapping = aes(x = my_dfbetas$swk_sozialkomp)) +
  geom_histogram(fill=I("#CC79A7")) +
  scale_x_continuous(breaks = seq(-0.5, 0.5, 0.1)) +
  scale_y_continuous(breaks = seq(0, 100, 5)) +
  xlab("DfBetaS swk_sozialkomp") +
  ylab("Frequency") +
  theme_classic()

quantile(my_dfbeta$swk_beziehung, ps) %>% round(3)
#>     0%   0.5%   2.5%    50%  97.5%  99.5%   100% 
#> -0.014 -0.010 -0.007  0.000  0.005  0.010  0.014
quantile(my_dfbetas$swk_beziehung, ps) %>% round(3)
#>     0%   0.5%   2.5%    50%  97.5%  99.5%   100% 
#> -0.303 -0.219 -0.147  0.000  0.111  0.219  0.288
ggplot(mapping = aes(x = my_dfbetas$swk_beziehung)) +
  geom_histogram(fill=I("#CC79A7")) +
  scale_x_continuous(breaks = seq(-0.5, 0.5, 0.1)) +
  scale_y_continuous(breaks = seq(0, 100, 5)) +
  xlab("DfBetaS swk_beziehung") +
  ylab("Frequency") +
  theme_classic()

To identify only observations with substantial DfBetaS, we only consider those with an absolute value \(\geq 0.2\) (cut-off determined ad hoc through inspection of the histograms):

lifesat %>% 
  filter(abs(my_dfbetas$swk_durchsetzung) >= 0.2)
#> # A tibble: 6 x 5
#>   ID    leben_gesamt swk_durchsetzung swk_sozialkomp swk_beziehung
#>   <fct>        <dbl>            <dbl>          <dbl>         <dbl>
#> 1 60            3.27             5.33           3.86          5.17
#> 2 160           3.91             2.67           4.29          5.83
#> 3 173           3.64             6              4.43          4.17
#> 4 303           6.55             7              5.57          5.17
#> 5 313           2.91             1.33           2.57          3.83
#> 6 339           3                6.33           4.71          5.67

lifesat %>% 
  filter(abs(my_dfbetas$swk_sozialkomp) >= 0.2)
#> # A tibble: 5 x 5
#>   ID    leben_gesamt swk_durchsetzung swk_sozialkomp swk_beziehung
#>   <fct>        <dbl>            <dbl>          <dbl>         <dbl>
#> 1 21            3.45             1              1.57          4.83
#> 2 60            3.27             5.33           3.86          5.17
#> 3 159           3                5              3.71          4   
#> 4 239           4.64             6              4.29          6.17
#> 5 339           3                6.33           4.71          5.67

lifesat %>% 
  filter(abs(my_dfbetas$swk_beziehung) >= 0.2)
#> # A tibble: 6 x 5
#>   ID    leben_gesamt swk_durchsetzung swk_sozialkomp swk_beziehung
#>   <fct>        <dbl>            <dbl>          <dbl>         <dbl>
#> 1 35            5.91             5.67           5.29          4.17
#> 2 146           3.82             6              4.71          3.83
#> 3 159           3                5              3.71          4   
#> 4 172           3.64             5.67           5.14          3.33
#> 5 173           3.64             6              4.43          4.17
#> 6 180           5.36             6              4.86          3.83

Substantial DFBetaS were observed for IDs 60, 160, 173, 303, 313, 339 (swk_durchsetzung), IDs 21, 60, 159, 239, 339 (swk_sozialkomp) and IDs 35, 146, 159, 172, 173, 180 (swk_beziehung).

Subjects with IDs 60, 159, 173, 339 showed conspicuous DfBetaS on two of the three predictors.

Influential data points for the overall model: DfFitS and Cook’s distance

The DfFitS (function dffits()) represents a standardized difference between a person’s predicted values from models estimated with and without the person, and the Cook’s distance is a further standardized measure of overall influence: the latter (function cooks.distance()) is obtained by multiplying DfFitS\(^2\) with the residual variance of the model without a specific observation/person and then deviding it by the residual variance \(\times (k+1)\) of the model including this specific observation/person. Cook’s distances thus cannot be negative (because of the squred DfFitS\(^2\) in the formula).

my_dffits <- dffits(model_life)
quantile(my_dffits, ps) %>% round(3)
#>     0%   0.5%   2.5%    50%  97.5%  99.5%   100% 
#> -0.656 -0.544 -0.311  0.000  0.216  0.314  0.338
ggplot(mapping = aes(x = my_dffits)) +
  geom_histogram(fill=I("#009E73")) +
  scale_x_continuous(breaks = seq(-1, 1, 0.1)) +
  scale_y_continuous(breaks = seq(0, 100, 5)) +
  xlab("DfFitS") +
  ylab("Frequency") +
  theme_classic()

To identify only observations with substantial DfFitS, we consider those with an absolute value \(\geq 0.4\) (cut-off determined ad hoc):

lifesat %>% 
  filter(abs(my_dffits) >= 0.4)
#> # A tibble: 6 x 5
#>   ID    leben_gesamt swk_durchsetzung swk_sozialkomp swk_beziehung
#>   <fct>        <dbl>            <dbl>          <dbl>         <dbl>
#> 1 21            3.45             1              1.57          4.83
#> 2 60            3.27             5.33           3.86          5.17
#> 3 159           3                5              3.71          4   
#> 4 160           3.91             2.67           4.29          5.83
#> 5 313           2.91             1.33           2.57          3.83
#> 6 339           3                6.33           4.71          5.67

Let’s look at Cook’s distance:

my_cooks <- cooks.distance(model_life)
quantile(my_cooks, p = c(0, 0.5, 0.95, 0.99, 1)) %>% round(3)
#>    0%   50%   95%   99%  100% 
#> 0.000 0.001 0.016 0.064 0.100
ggplot(mapping = aes(x = my_cooks)) +
  geom_histogram(fill=I("#009E73")) +
  scale_x_continuous(breaks = seq(0, 1, 0.01)) +
  scale_y_continuous(breaks = seq(0, 150, 10)) +
  xlab("Cook's Distance") +
  ylab("Frequency") +
  theme_classic()

To identify only observations with substantial Cook’s distances, we consider those with an absolute value \(\geq 0.05\) (cut-off determined ad hoc):

lifesat %>% 
  filter(my_cooks >= 0.05)
#> # A tibble: 6 x 5
#>   ID    leben_gesamt swk_durchsetzung swk_sozialkomp swk_beziehung
#>   <fct>        <dbl>            <dbl>          <dbl>         <dbl>
#> 1 21            3.45             1              1.57          4.83
#> 2 60            3.27             5.33           3.86          5.17
#> 3 159           3                5              3.71          4   
#> 4 160           3.91             2.67           4.29          5.83
#> 5 313           2.91             1.33           2.57          3.83
#> 6 339           3                6.33           4.71          5.67

These are the persons with the IDs 21, 60, 159, 160, 313 and 339, i.e. exactly the same as those identified above via DfFitS.

A representation for identifying the most influential data points via Cook’s Distance can be obtained using the plot() function and the selection argument which = 4.

plot(model_life, labels.id = lifesat$ID, which = 4)

The labels.id argument can be used to specify the variable whose values will be used to identify the most prominent values in the diagram. In addition to the IDs 21, 313 and 319 that are “flagged” here, we have identified the IDs 60, 159 and 160 as conspicuous above. These are the next higher “spikes” in this diagram (values shown in the order of the data set).

Using the selection argument which = 5 we obtain a “Residuals vs Leverage Plot” with the leverage values on the x axis and the studentized residuals on the y axis.

plot(model_life, labels.id = lifesat$ID, which = 5)

A (dashed) line is drawn with the Cook’s Distance limit of 0.5 (this limit is used here independent of the data situation). In our case this limit is within the plot display area only for the negative residuals. No observed value exceeds this limit (we already know this). For the positive residuals this limit also exists, it is just not visible here, because there is no positive residual value near this limit. This plot shows that those values have the greatest influence on the model whose combination of leverage and (studentized) residual is the greatest.

Estimating the model without the identified influential data points

lifesat_red <- lifesat %>% 
  filter(my_cooks < 0.05)

model_life2 <- lm(leben_gesamt ~ swk_durchsetzung + swk_sozialkomp + swk_beziehung, 
             data = lifesat_red)
summary(model_life2)
#> 
#> Call:
#> lm(formula = leben_gesamt ~ swk_durchsetzung + swk_sozialkomp + 
#>     swk_beziehung, data = lifesat_red)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.26709 -0.29278 -0.01286  0.32374  1.19086 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       2.12219    0.25195   8.423 2.09e-15 ***
#> swk_durchsetzung  0.05948    0.03682   1.616    0.107    
#> swk_sozialkomp    0.05710    0.05005   1.141    0.255    
#> swk_beziehung     0.48346    0.04329  11.169  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4712 on 274 degrees of freedom
#> Multiple R-squared:  0.4268, Adjusted R-squared:  0.4206 
#> F-statistic: 68.02 on 3 and 274 DF,  p-value: < 2.2e-16

After exclusion of the most influential data points (Cook’s Distance \(\ge 0.05\): IDs 21, 60, 159, 160, 313, 339) the effect of swk_sozialkomp is no longer significant and is now only about one third of the original effect. The effects of the two other predictors remain almost unchanged.

Inspecting the values of the excluded persons on the IV swk_sozialkomp (\(\bar x = 5.418\)) as well as the DV leben_gesamt (\(\bar x = 5.278)\), it becomes clear that all six persons reported both a below-average self-efficacy with regard to their social competence and a strongly below-average life satisfaction, reflecting a positive correlation between the two variables. For the remaining 278 persons there seems to be no more (unique) significant predictive contribution of swk_sozialkomp.

10.4.3 Homoscedasticity

The homoscedasticity assumption can be tested by inspecting the distribution of the residuals over the range of the predicted values. If the residuals scatter equally (around 0) at all points of the predicted values, homoscedasticity is given. A graph can be obtained with the plot() function and the selection argument which = 1.

plot(model_life, labels.id = lifesat$ID, which = 1)

This plot shows quite nicely that the distribution of residuals is very similar over the range of fitted values. It can also be used to check the correct specification of the model in terms of the linearity assumption: The residuals should have a mean value of 0 across the whole range of predicted values. This is also the case here (red line).

10.4.4 Normality of residuals

The normality assumption can be checked using a quantile-quantile-plot (QQ-plot), which is obtained using plot() and which = 2:

plot(model_life, labels.id = lifesat$ID, which = 2)

On the x axis the theoretical normal distribution quantiles are plotted, on the y axis the studentized residuals. If the residuals are normally distributed, the empirical residuals should have the same value as the corresponding z quantile. Here, the person with ID = 339 has a studentized residual of < -4, although he/she should have a value of -3 at most (in terms of the expected normal distribution of the residuals).

Except for a few outliers the residuals seem to be normally distributed.

10.4.5 Diagnostic graphics at a glance

When using the plot() function without the which argument a plot of four diagnostic graphics is obtained, three of which we already know. An additional plot is the third plot (which = 3) that may be used for a more fine-grained analysis of the homoscadasticity assumption. Here, the square root of the absolute values of the standardized residuals are plotted against the fitted values. This allows to asses the average absolute value of residuals (red line). If this value is similar across the range of fitted values, the residuals have similar variance and thus homoscedasticity is given.

plot(model_life, labels.id = lifesat$ID)

10.4.6 Regression diagnostics conclusions

Overall, our regression diagnostics have shown that the assumptions of the regression model are largely met. However, there are some data points that affect the normal distribution of the residuals and show partly conspicuous values in the statistics for influential data points. An alternative model has shown that without the six most influential observations, only the variable swk_beziehung was a significant predictor for leben_gesamt, whereas in the original model, swk_sozialkomp also had a significant effect on life satisfaction.

In any case, the exclusion of observations based on regression diagnostics must be very well justified. Only if the conspicuous values are obvious mistakes or if the participants obviously did not answer seriously (e.g, uniform response patterns), can these observations/persons be excluded without further ado.

10.5 Moderated regression with two continuous variables

10.5.1 Self-efficacy as a moderator of the support \(\Rightarrow\) life satisfaction relation

We assume that the perceived emotional support by parents (unt_eltern) has a positive effect on life satisfaction (leben_gesamt). Young people who feel supported should be more satisfied with their lives than those who receive less emotional support from their parents.

Furthermore, we assume that social self-efficacy (swk_sozialkomp) moderates this effect: emotional support by parents could play a more important role for those adolescents’ life satisfaction who are rather low in social self-efficacy. On the other hand, for those with high social self-efficacy support from parents may not be that fundamental for their life satisfaction since they may experience more support from peers.

The statistical moderation hypothesis is therefore: “The higher adolescents’ social self-efficacy, the weaker the effect of parental emotional support on their life satisfaction.” Thus, a negative interaction effect of unt_eltern \(\times\) swk_sozialkomp is expected.

  • For average socially self-effective adolescents (“Mean”), a positive effect of parental support on life satisfaction is expected (since we also expect an overall positive effect).

  • The positive effect should be stronger for young people with below-average social self-efficacy (- 1 SD) and weaker for young people with above-average social-self-efficacy (+ 1 SD).

Data:

mod <- adolescents %>%
  dplyr::select(unt_eltern, leben_gesamt, swk_sozialkomp) %>% 
  drop_na()

head(mod)
#> # A tibble: 6 x 3
#>   unt_eltern leben_gesamt swk_sozialkomp
#>        <dbl>        <dbl>          <dbl>
#> 1       6.67         5.91           5.14
#> 2       5.17         4.64           5.57
#> 3       6.83         5.18           5.43
#> 4       7            5.82           6   
#> 5       6.67         5.82           4.8 
#> 6       6.5          5.36           4.71

describe(mod, fast = TRUE)
#>                vars   n mean   sd  min  max range   se
#> unt_eltern        1 284 5.87 1.00 1.50 7.00  5.50 0.06
#> leben_gesamt      2 284 5.42 0.69 2.91 6.91  4.00 0.04
#> swk_sozialkomp    3 284 5.28 0.79 1.57 7.00  5.43 0.05

10.5.2 Estimating the moderated regression model

A moderation analysis is typically represented by a regression equation containing the main effects of the focal predictor (unt_eltern), the moderator (swk_sozialkomp) and the interaction term (product variable):

\(Y = b_0 + b_1 \cdot X_1 + b_2 \cdot X_2 + b_3 \cdot X_1 \cdot X_2 + E\)

Statistically the focal predictor and the moderator are exchangeable, but conceptually they should be always explicitly differentiated. We can represent the equation also in its conditional effect form where the focal predictor (here: \(X_1\)) is put outside the brackets:

\(Y = (b_0 + b_2 \cdot X_2) + (b_1 + b_3 \cdot X_2) \cdot X_1 + E\)

Using the lm() formula notation such a model is specified by lm(Y ~ X1*X2, data = mydata) or by lm(Y ~ X1 + X2 + X1:X2, data = mydata)

For our example:

modreg <- lm(leben_gesamt ~ unt_eltern*swk_sozialkomp, data = mod)
summary(modreg)
#> 
#> Call:
#> lm(formula = leben_gesamt ~ unt_eltern * swk_sozialkomp, data = mod)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.66104 -0.33321  0.00913  0.39591  1.09442 
#> 
#> Coefficients:
#>                           Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               -0.48129    0.92134  -0.522  0.60182    
#> unt_eltern                 0.74178    0.16280   4.557 7.77e-06 ***
#> swk_sozialkomp             0.77313    0.17776   4.349 1.91e-05 ***
#> unt_eltern:swk_sozialkomp -0.08138    0.03089  -2.635  0.00889 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.5195 on 280 degrees of freedom
#> Multiple R-squared:  0.4428, Adjusted R-squared:  0.4369 
#> F-statistic: 74.18 on 3 and 280 DF,  p-value: < 2.2e-16

The interaction is significant and negative, thus in line with our moderation hypothesis. An increase of the moderator swk_sozialkomp by one unit leads to a \(b_3= -0.081\) lower effect of unt_eltern on leben_gesamt. Thus, for socially more self-efficacious adolescents the influence of parental support on life satisfaction was significantly lower than for socially less self-effective adolescents.

Another (slightly more complicated) possibility of calculation is to first estimate a model with the two predictors as main effects (multiple regression) and then to add the interaction effect in an additional model. A model comparison then can test the significance of the interaction effect. This procedure is actually not needed here since the interaction effect is estimated by a single parameter and thus the model comparison F test is equivalent to the t test of the interaction term.

# Model without interaction
multreg <- lm(leben_gesamt ~ unt_eltern + swk_sozialkomp, data = mod)

# Add interaction term using update()
modreg <- update(multreg, . ~ . + unt_eltern:swk_sozialkomp, data = mod)

# Model comparison
anova(multreg, modreg)
#> Analysis of Variance Table
#> 
#> Model 1: leben_gesamt ~ unt_eltern + swk_sozialkomp
#> Model 2: leben_gesamt ~ unt_eltern + swk_sozialkomp + unt_eltern:swk_sozialkomp
#>   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
#> 1    281 77.453                                
#> 2    280 75.580  1    1.8737 6.9414 0.008891 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

10.5.3 Moderated regression with centered predictors

Calculating the moderated regression analysis with centered predictor variables has the advantage that the main effect of the focal predictor variable (unt_eltern_a) corresponds to the conditional effect (simple slope) at the mean of the moderator variable (= at position 0 of the centered moderator variable swk_sozialkomp_a).

Centering the predictors:

mod <- mod %>%
  mutate(unt_eltern_a = unt_eltern - mean(unt_eltern),
         swk_sozialkomp_a = swk_sozialkomp - mean(swk_sozialkomp)) 

# Alternatively using the scale() function
mod <- mod %>%
  mutate(unt_eltern_a = scale(unt_eltern, scale = FALSE),
         swk_sozialkomp_a = scale(swk_sozialkomp, scale = FALSE))

describe(mod, fast = TRUE)
#>                  vars   n mean   sd   min  max range   se
#> unt_eltern          1 284 5.87 1.00  1.50 7.00  5.50 0.06
#> leben_gesamt        2 284 5.42 0.69  2.91 6.91  4.00 0.04
#> swk_sozialkomp      3 284 5.28 0.79  1.57 7.00  5.43 0.05
#> unt_eltern_a        4 284 0.00 1.00 -4.37 1.13  5.50 0.06
#> swk_sozialkomp_a    5 284 0.00 0.79 -3.71 1.72  5.43 0.05

Graphical representation of the centering (only moderator):

mod %>% 
  gather(key = "centering", value = "swk_sozialkomp", 
         swk_sozialkomp, swk_sozialkomp_a) %>% 
  mutate(zentrierung = factor(centering),
         zentrierung = fct_recode(centering,
                                  uncentered = "swk_sozialkomp",
                                  centered = "swk_sozialkomp_a")) %>% 
  ggplot(mapping = aes(x = zentrierung, y = swk_sozialkomp)) +
  geom_boxplot(aes(fill = "red")) +
  geom_jitter(width = 0.2, alpha = 0.6) +
  geom_hline(yintercept = mean(mod$swk_sozialkomp), linetype = "dashed") +
  geom_hline(yintercept = mean(mod$swk_sozialkomp_a), linetype = "dashed") +
  xlab("") +
  guides(fill = FALSE) +
  theme_classic()

Calculate the moderated regression with the centered variables:

modreg_a <- lm(leben_gesamt ~ unt_eltern_a*swk_sozialkomp_a, data = mod)
summary(modreg_a)
#> 
#> Call:
#> lm(formula = leben_gesamt ~ unt_eltern_a * swk_sozialkomp_a, 
#>     data = mod)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.66104 -0.33321  0.00913  0.39591  1.09442 
#> 
#> Coefficients:
#>                               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                    5.43324    0.03143 172.858  < 2e-16 ***
#> unt_eltern_a                   0.31202    0.03213   9.710  < 2e-16 ***
#> swk_sozialkomp_a               0.29530    0.04151   7.114 9.43e-12 ***
#> unt_eltern_a:swk_sozialkomp_a -0.08138    0.03089  -2.635  0.00889 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.5195 on 280 degrees of freedom
#> Multiple R-squared:  0.4428, Adjusted R-squared:  0.4369 
#> F-statistic: 74.18 on 3 and 280 DF,  p-value: < 2.2e-16

The simple slope of the focal predictor variable at the mean value of the moderator variable is \(b_{mean} = 0.312\) and this effect is significant (\(p = 0\)).

10.5.4 Simple slopes and plots with interactions

With sim_slopes() the simple slopes of the focal predictor can be tested and plotted at any value of the moderator. By default, simple slopes at the mean as well as +/- 1 SD from the mean of the moderator variable are computed:

sim_slopes(modreg, pred = "unt_eltern", modx = "swk_sozialkomp", 
           johnson_neyman = FALSE)
#> SIMPLE SLOPES ANALYSIS 
#> 
#> Slope of unt_eltern when swk_sozialkomp = 4.50 (- 1 SD): 
#> 
#>   Est.   S.E.   t val.      p
#> ------ ------ -------- ------
#>   0.38   0.04     9.87   0.00
#> 
#> Slope of unt_eltern when swk_sozialkomp = 5.28 (Mean): 
#> 
#>   Est.   S.E.   t val.      p
#> ------ ------ -------- ------
#>   0.31   0.03     9.71   0.00
#> 
#> Slope of unt_eltern when swk_sozialkomp = 6.07 (+ 1 SD): 
#> 
#>   Est.   S.E.   t val.      p
#> ------ ------ -------- ------
#>   0.25   0.04     5.86   0.00

Specific moderator values can be specified using the modx.values argument. Like this, for example, we can obtain the simple slopes for the minimum and maximum values of the moderator:

sim_slopes(modreg, pred = "unt_eltern", modx = "swk_sozialkomp",
           modx.values = c(1.57143, 7), johnson_neyman = FALSE)
#> SIMPLE SLOPES ANALYSIS 
#> 
#> Slope of unt_eltern when swk_sozialkomp = 1.57: 
#> 
#>   Est.   S.E.   t val.      p
#> ------ ------ -------- ------
#>   0.61   0.12     5.31   0.00
#> 
#> Slope of unt_eltern when swk_sozialkomp = 7.00: 
#> 
#>   Est.   S.E.   t val.      p
#> ------ ------ -------- ------
#>   0.17   0.06     2.65   0.01

The resulting simple slopes are \(b_{min} = 0.61\) and \(b_{max} = 0.17\), and both are significant.

interact_plot() provides a graphic depiction of the simple slopes at the mean and mean +/- 1 SD:

interact_plot(modreg, pred = "unt_eltern", modx = "swk_sozialkomp",
              x.label = "Support by Parents", 
              y.label = "Life Satisfaction", 
              legend.main = "Social Self-Efficacy",
              colors = "Greens")

Finally, johnson_neyman() can be used to obtain the Regions of Significance (Johnson-Neyman Interval, JNI):

johnson_neyman(modreg, pred = "unt_eltern", modx = "swk_sozialkomp", 
               alpha = 0.05)
#> JOHNSON-NEYMAN INTERVAL 
#> 
#> When swk_sozialkomp is OUTSIDE the interval [7.33, 20.88], the slope of
#> unt_eltern is p < .05.
#> 
#> Note: The range of observed values of swk_sozialkomp is [1.57, 7.00]

For all moderator values \([1.57, 7.00]\) existing in this dataset, the simple slope of unt_eltern was positively significant at the \(\alpha = 0.05\) level, since all these values are OUTSIDE the \(JNI = [7.33, 20.88]\). Since the moderator values specified in the JNI do not exist (because the moderator was measured on a scale of 1-7), the values specified here are irrelevant in a practical sense.

10.6 Moderated regression with a categorical moderator

10.6.1 Father’s education as a moderator for the grades \(\Rightarrow\) school satisfaction relation

Is the effect of grade average on satisfaction with school higher for students with highly educated fathers? Using moderation analysis we will examine whether the effect of the variable Gesamtnote on the variable life_school differs significantly between adolescents with fathers who have different educational levels (bildung_vater_binaer = “niedrig” versus “hoch”).

Data: For demonstration purposes we will use the variable bildung_vater_binaer both as a factor (default) and as a dummy-coded variable (“niedrig” = 0, “hoch” = 1).

modkat <- adolescents %>% 
    dplyr::select(ID, leben_schule, Gesamtnote, bildung_vater_binaer) %>%
    mutate(bildung_vater_dummy = (as.numeric(bildung_vater_binaer) - 1)) %>% 
    drop_na()

head(modkat)
#> # A tibble: 6 x 5
#>   ID    leben_schule Gesamtnote bildung_vater_binaer bildung_vater_dummy
#>   <fct>        <dbl>      <dbl> <fct>                              <dbl>
#> 1 2             4.67          4 hoch                                   1
#> 2 10            3.33          5 niedrig                                0
#> 3 11            4             4 niedrig                                0
#> 4 12            5.33          5 hoch                                   1
#> 5 14            5             4 niedrig                                0
#> 6 15            4             4 niedrig                                0

describe(modkat, fast = TRUE)
#>                      vars   n mean   sd  min  max range   se
#> ID                      1 263  NaN   NA  Inf -Inf  -Inf   NA
#> leben_schule            2 263 4.64 1.06 1.33    7  5.67 0.07
#> Gesamtnote              3 263 4.45 0.70 3.00    6  3.00 0.04
#> bildung_vater_binaer    4 263  NaN   NA  Inf -Inf  -Inf   NA
#> bildung_vater_dummy     5 263 0.49 0.50 0.00    1  1.00 0.03
table(modkat$bildung_vater_binaer)
#> 
#> niedrig    hoch 
#>     133     130

10.6.2 Estimating the moderated regression model

# Using the dummy-variable bildung_vater_dummy
mod_dummy <- lm(leben_schule ~ Gesamtnote*bildung_vater_dummy, 
               data = modkat)
summary(mod_dummy)
#> 
#> Call:
#> lm(formula = leben_schule ~ Gesamtnote * bildung_vater_dummy, 
#>     data = modkat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.4072 -0.5018  0.0857  0.5928  1.8254 
#> 
#> Coefficients:
#>                                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                      2.7012     0.5193   5.202 4.01e-07 ***
#> Gesamtnote                       0.4079     0.1200   3.399 0.000782 ***
#> bildung_vater_dummy             -2.1612     0.7673  -2.816 0.005230 ** 
#> Gesamtnote:bildung_vater_dummy   0.5191     0.1701   3.051 0.002520 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9284 on 259 degrees of freedom
#> Multiple R-squared:  0.241,  Adjusted R-squared:  0.2323 
#> F-statistic: 27.42 on 3 and 259 DF,  p-value: 1.966e-15

# Using the factor bildung_vater_binaer
mod_factor <- lm(leben_schule ~ Gesamtnote*bildung_vater_binaer, 
               data = modkat)
summary(mod_factor)
#> 
#> Call:
#> lm(formula = leben_schule ~ Gesamtnote * bildung_vater_binaer, 
#>     data = modkat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.4072 -0.5018  0.0857  0.5928  1.8254 
#> 
#> Coefficients:
#>                                     Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                           2.7012     0.5193   5.202 4.01e-07 ***
#> Gesamtnote                            0.4079     0.1200   3.399 0.000782 ***
#> bildung_vater_binaerhoch             -2.1612     0.7673  -2.816 0.005230 ** 
#> Gesamtnote:bildung_vater_binaerhoch   0.5191     0.1701   3.051 0.002520 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9284 on 259 degrees of freedom
#> Multiple R-squared:  0.241,  Adjusted R-squared:  0.2323 
#> F-statistic: 27.42 on 3 and 259 DF,  p-value: 1.966e-15

The first analysis shows that the moderation is significant (\(b_3= 0.519, p = 0.003\)) and in the expected direction. In this model, the effect of Gesamtnote is the conditional effect (simple slope) for the low education group \(b_1 = b_{(x_2=0/low)}=0.408, p = 0.001\)).

The results of the second analysis with the moderator in the form of a factor do not differ from those of the first analysis. The only difference is the name of the effects of the variable bildung_vater_binaer. The suffix “hoch” has been appended to signify that this is the effect of the category ‘high’ compared to the reference category ‘low’. Thus, R implicitly dummy-codes the factor variable for the model.

10.6.3 Simple slopes for low and high education of the father

We have already tested the simple slope of Gesamtnote for the reference category “niedrig” of the moderator directly in the moderated regression (see above). To test the simple slope for the non-reference category “hoch” in the same way, we have to define this category as the reference category of the bildung_vater_binaer factor (or reverse the ‘0-1’ coding of the dummy variable):

modkat <- modkat %>% 
    mutate(bildung_vater_binaer_rel = relevel(bildung_vater_binaer, ref = "hoch"))

mod_factor_rel <- lm(leben_schule ~ Gesamtnote*bildung_vater_binaer_rel, 
               data = modkat)
summary(mod_factor_rel)
#> 
#> Call:
#> lm(formula = leben_schule ~ Gesamtnote * bildung_vater_binaer_rel, 
#>     data = modkat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.4072 -0.5018  0.0857  0.5928  1.8254 
#> 
#> Coefficients:
#>                                            Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                                  0.5400     0.5650   0.956  0.34006
#> Gesamtnote                                   0.9269     0.1206   7.684 3.19e-13
#> bildung_vater_binaer_relniedrig              2.1612     0.7673   2.816  0.00523
#> Gesamtnote:bildung_vater_binaer_relniedrig  -0.5191     0.1701  -3.051  0.00252
#>                                               
#> (Intercept)                                   
#> Gesamtnote                                 ***
#> bildung_vater_binaer_relniedrig            ** 
#> Gesamtnote:bildung_vater_binaer_relniedrig ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9284 on 259 degrees of freedom
#> Multiple R-squared:  0.241,  Adjusted R-squared:  0.2323 
#> F-statistic: 27.42 on 3 and 259 DF,  p-value: 1.966e-15

The simple slope of the focal predictor variable Gesamtnote for the “high education” group is \(b_1 = b_{(x_2=1/high)}=0.927, p = 0\)).

Alternatively, we can use sim_slopes() from the interactions package:

sim_slopes(mod_factor, pred = "Gesamtnote", modx = "bildung_vater_binaer",  
           johnson_neyman = FALSE)
#> SIMPLE SLOPES ANALYSIS 
#> 
#> Slope of Gesamtnote when bildung_vater_binaer = hoch: 
#> 
#>   Est.   S.E.   t val.      p
#> ------ ------ -------- ------
#>   0.93   0.12     7.68   0.00
#> 
#> Slope of Gesamtnote when bildung_vater_binaer = niedrig: 
#> 
#>   Est.   S.E.   t val.      p
#> ------ ------ -------- ------
#>   0.41   0.12     3.40   0.00

When the moderator variable is a factor interact_plot() recognizes this and automatically draws the simple slopes for the different factor levels:

interact_plot(mod_factor, pred = "Gesamtnote", modx = "bildung_vater_binaer",
              x.label = "Grade Average", 
              y.label = "Satisfaction with School",
              legend.main = "Education of the Father",
              colors = "Greens")

10.7 Analysis of non-linear effects using linear regression

Non-linear effects of predictors can be analyzed in the linear regression framework by adding polynomial (quadratic, cubic etc.) terms to the regression equation:

\(Y = b_0 + b_1 \cdot X + b_2 \cdot X^2 + b_3 \cdot X^3 + E\)

10.7.1 Quadratic effect of academic self-efficacy on stress symptoms

Our hypothesis is that both a very low and a very high academic self-efficacy may be associated with increased stress symptoms, while a medium level is associated with fewer stress symptoms. We therefore assume a U-shaped relationship reflecting a positive quadratic effect.

The assumption is based on the everyday-observation that people with very high achievement motivation are often under a particular (self-imposed) pressure that can lead to increased psychosomatic stress. Furthermore, it can be assumed that very low self-efficacy will also associated with increased stress symptoms.

Data:

nonlin <- adolescents %>% 
  dplyr::select(ID, stress_somatisch, swk_neueslernen) %>%
  drop_na()

head(nonlin)
#> # A tibble: 6 x 3
#>   ID    stress_somatisch swk_neueslernen
#>   <fct>            <dbl>           <dbl>
#> 1 1                 3               4.57
#> 2 2                 3.83            5   
#> 3 10                3.33            5   
#> 4 11                3.5             4.57
#> 5 12                1.33            5.33
#> 6 14                1.5             4.86

describe(nonlin, fast = TRUE)
#> Warning in FUN(newX[, i], ...): kein nicht-fehlendes Argument für min; gebe Inf
#> zurück
#> Warning in FUN(newX[, i], ...): kein nicht-fehlendes Argument für max; gebe -Inf
#> zurück
#>                  vars   n mean   sd  min  max range   se
#> ID                  1 283  NaN   NA  Inf -Inf  -Inf   NA
#> stress_somatisch    2 283 3.02 1.19 1.00    7  6.00 0.07
#> swk_neueslernen     3 283 5.09 0.77 2.43    7  4.57 0.05

10.7.2 Estimating the regression for a quadratic effect

To estimate the quadratic effect (DV: stress_somatisch), we have to first calculate a squared IV swk_neueslernen_2:

nonlin <- nonlin %>% 
  mutate(swk_neueslernen_2 = swk_neueslernen^2)

head(nonlin)
#> # A tibble: 6 x 4
#>   ID    stress_somatisch swk_neueslernen swk_neueslernen_2
#>   <fct>            <dbl>           <dbl>             <dbl>
#> 1 1                 3               4.57              20.9
#> 2 2                 3.83            5                 25  
#> 3 10                3.33            5                 25  
#> 4 11                3.5             4.57              20.9
#> 5 12                1.33            5.33              28.4
#> 6 14                1.5             4.86              23.6

Let’s calculate a hierarchical regression model with only the predictor swk_neueslernen in the first model and the additional predictor swk_neueslernen_2 in the second model. This means that the first model contains only the linear term, while the second model contains both the linear and the quadratic term.

Are the two models significant? How do the \(R^2\) of the two models differ?

Is the quadratic effect in the second model significant? Direction of the effect? Additionally, let’s perform a model comparison with anova().

linreg <- lm(stress_somatisch ~ swk_neueslernen, data = nonlin)
summary(linreg)
#> 
#> Call:
#> lm(formula = stress_somatisch ~ swk_neueslernen, data = nonlin)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.1164 -0.8121 -0.0874  0.6879  4.1734 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)      3.53663    0.47590   7.431  1.3e-12 ***
#> swk_neueslernen -0.10144    0.09245  -1.097    0.273    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.191 on 281 degrees of freedom
#> Multiple R-squared:  0.004266,   Adjusted R-squared:  0.0007228 
#> F-statistic: 1.204 on 1 and 281 DF,  p-value: 0.2735

nonlinreg <- update(linreg, . ~ . + swk_neueslernen_2)
summary(nonlinreg)
#> 
#> Call:
#> lm(formula = stress_somatisch ~ swk_neueslernen + swk_neueslernen_2, 
#>     data = nonlin)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.5549 -0.8220 -0.0865  0.6608  4.1156 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)        8.99909    2.11824   4.248 2.93e-05 ***
#> swk_neueslernen   -2.32568    0.84589  -2.749  0.00636 ** 
#> swk_neueslernen_2  0.22113    0.08361   2.645  0.00863 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.179 on 280 degrees of freedom
#> Multiple R-squared:  0.02854,    Adjusted R-squared:  0.0216 
#> F-statistic: 4.113 on 2 and 280 DF,  p-value: 0.01736

anova(linreg, nonlinreg)
#> Analysis of Variance Table
#> 
#> Model 1: stress_somatisch ~ swk_neueslernen
#> Model 2: stress_somatisch ~ swk_neueslernen + swk_neueslernen_2
#>   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
#> 1    281 398.83                                
#> 2    280 389.11  1    9.7219 6.9958 0.008631 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The linear trend in model 1 (linreg) is not significant with \(b_1= -0.101, p=0.273\). For model 2 (nonlinreg) both the overall model test is significant with \(R^2 = 0.029, F_{(2, 280)}=4.113, p=0.017\), as well as the test for the quadratic term (variable: swk_neueslernen_2) with \(b_2=0.221, p=0.009\).

The positive quadratic term indicates that there is a U-shaped relationship as we assumed. There is also a simpler way to perform an analysis of non-linear relationships (polynomial regression) in R. The advantage is that the quadratic term can be entered without explicitly computing a quadratic predictor variable:

nonlinreg1 <- lm(stress_somatisch ~ swk_neueslernen + I(swk_neueslernen^2), 
                 data = nonlin)
summary(nonlinreg1)
#> 
#> Call:
#> lm(formula = stress_somatisch ~ swk_neueslernen + I(swk_neueslernen^2), 
#>     data = nonlin)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.5549 -0.8220 -0.0865  0.6608  4.1156 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)           8.99909    2.11824   4.248 2.93e-05 ***
#> swk_neueslernen      -2.32568    0.84589  -2.749  0.00636 ** 
#> I(swk_neueslernen^2)  0.22113    0.08361   2.645  0.00863 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.179 on 280 degrees of freedom
#> Multiple R-squared:  0.02854,    Adjusted R-squared:  0.0216 
#> F-statistic: 4.113 on 2 and 280 DF,  p-value: 0.01736

10.7.3 Plotting the quadratic regression using ggplot2

p <- nonlin %>%
  ggplot(aes(x = swk_neueslernen, y = stress_somatisch))

p + geom_jitter(width = 0.1, alpha = 0.5) +
  geom_line(aes(y = predict(nonlinreg)), size = 1, colour = "blue") +
  theme_bw()

In-depth interpretation of the quadratic effect: With each increase of 1 unit in the predictor swk_neueslernen its (own) effect increases by \(2\cdot b_2=2 \cdot 0.22113 = 0.44226\). The quadratic effect thus quantifies the extent of non-linearity between the IV and the DV. At the (fictitious) position 0 of swk_neueslernen, the linear slope (in the sense of the 1st derivation) of the U-curve is \(b_1=-2.326, p=0.006\), thus strongly negative.

The possible range of the scale of swk_neueslernen starts at a value of 1, where the linear slope of the curve would then be \(b_1 + 2\cdot b_3=-2.32568 + 2\cdot 0.22113 \cdot 1 = -1.88342\).

At the mean value point of swk_neueslernen (\(M = 5.09054\)) the linear slope is then \(b_1 + 2 \cdot b_3 \cdot \bar{x} =-2.32568 + 2 \cdot 0.22113 \cdot 5.09054= -0.0743\). To verify this, we calculate the model with a centered predictor variable swk_neueslernen_a.

In this model the linear effect is \(b_1=-0.0743\):

nonlin <- nonlin %>% 
  mutate(swk_neueslernen_a = scale(swk_neueslernen, scale = FALSE))

nonlinreg2 <- lm(stress_somatisch ~ swk_neueslernen_a + I(swk_neueslernen_a^2), 
                 data = nonlin)

summary(nonlinreg2)
#> 
#> Call:
#> lm(formula = stress_somatisch ~ swk_neueslernen_a + I(swk_neueslernen_a^2), 
#>     data = nonlin)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.5549 -0.8220 -0.0865  0.6608  4.1156 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)             2.89049    0.08554  33.790  < 2e-16 ***
#> swk_neueslernen_a      -0.07430    0.09205  -0.807  0.42024    
#> I(swk_neueslernen_a^2)  0.22113    0.08361   2.645  0.00863 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.179 on 280 degrees of freedom
#> Multiple R-squared:  0.02854,    Adjusted R-squared:  0.0216 
#> F-statistic: 4.113 on 2 and 280 DF,  p-value: 0.01736

10.8 Regression with categorical predictors using dummy and effect coding

10.8.1 Mother’s education affects adolescents’ grade average

Educational research shows that educational success is partly determined by the educational level of the family of origin. Therefore, we want to investigate the influence of the educational level of the mothers on adolescents’ grade average.

The hypothesis is that adolescents whose mothers have a higher educational level have better grades than adolescents whose mothers have a lower educational level. We treat the 4-step educational level of the mother (bildung_mutter) as a categorical variable (factor), using only the nominal information of this ordinal variable for the analysis.

We did a very similar analysis in the chapter on one-way analysis of variance, where we analyzed grade average mean differences between different levels of father’s education.

Data:

edu <- adolescents %>%
    dplyr::select(ID, Gesamtnote, bildung_mutter) %>% 
    drop_na()

head(edu) 
#> # A tibble: 6 x 3
#>   ID    Gesamtnote bildung_mutter
#>   <fct>      <dbl> <fct>         
#> 1 2              4 Realschule    
#> 2 10             5 Hauptschule   
#> 3 11             4 Hochschule    
#> 4 12             5 Realschule    
#> 5 14             4 Realschule    
#> 6 15             4 Realschule
table(edu$bildung_mutter)
#> 
#> Hauptschule  Realschule      Abitur  Hochschule 
#>          42          95          49          81

Let’s first compute the mean values of grade average per category:

edu %>% 
  group_by(bildung_mutter) %>%
  summarise(mean_Gesamtnote = mean(Gesamtnote))
#> # A tibble: 4 x 2
#>   bildung_mutter mean_Gesamtnote
#>   <fct>                    <dbl>
#> 1 Hauptschule               4.10
#> 2 Realschule                4.37
#> 3 Abitur                    4.49
#> 4 Hochschule                4.72

10.8.2 Regression with dummy coded categorical predictors

There are several ways to dummy code a categorical variable in R. The first (though not the simplest) possibility we look at is using dummy.code() from the psych package:

dummies <- psych::dummy.code(edu$bildung_mutter) # creates dummy variables
edu <- data.frame(edu, dummies) # adds dummy variables to the data frame

head(edu)
#>   ID Gesamtnote bildung_mutter Realschule Hochschule Abitur Hauptschule
#> 1  2          4     Realschule          1          0      0           0
#> 2 10          5    Hauptschule          0          0      0           1
#> 3 11          4     Hochschule          0          1      0           0
#> 4 12          5     Realschule          1          0      0           0
#> 5 14          4     Realschule          1          0      0           0
#> 6 15          4     Realschule          1          0      0           0

edu <- edu %>% 
    dplyr::select(-Hauptschule) # delete dummmy variable 'Hauptschule' (reference catgory)

head(edu)
#>   ID Gesamtnote bildung_mutter Realschule Hochschule Abitur
#> 1  2          4     Realschule          1          0      0
#> 2 10          5    Hauptschule          0          0      0
#> 3 11          4     Hochschule          0          1      0
#> 4 12          5     Realschule          1          0      0
#> 5 14          4     Realschule          1          0      0
#> 6 15          4     Realschule          1          0      0

dummy.code() creates a dummy variable for each of the \(c=4\) categories. Dummy variables are variables in which a certain category receives the code 1, while all other categories receive the code 0.

However, only \(c-1=4-1=3\) dummy variables are required, so we do not need a separate dummy variable for the reference category Hauptschule. One could nevertheless leave the Hauptschule dummy in the data set and simply not use it. Furthermore, it could be useful keeping all crated dummy variables in case we would like to use an alternative reference category in a subsequent analysis.

Now let’s do the analysis:

model_dummy1 <- lm(Gesamtnote ~ Realschule + Abitur + Hochschule, data = edu)
summary(model_dummy1)
#> 
#> Call:
#> lm(formula = Gesamtnote ~ Realschule + Abitur + Hochschule, data = edu)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.72222 -0.37158  0.00816  0.50816  1.62842 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   4.0952     0.1022  40.089  < 2e-16 ***
#> Realschule    0.2763     0.1227   2.253  0.02511 *  
#> Abitur        0.3966     0.1392   2.849  0.00473 ** 
#> Hochschule    0.6270     0.1259   4.981 1.15e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.662 on 263 degrees of freedom
#> Multiple R-squared:  0.09391,    Adjusted R-squared:  0.08358 
#> F-statistic: 9.086 on 3 and 263 DF,  p-value: 9.605e-06

The overall model is significant with \(F(3, 263)=9.086, p < 0.001\) and \(R^2= 0.094\). The dummy variable coefficients show that the mean values of all three categories (Realschule, Abitur, and Hochschule) differ significantly from the mean value of the reference category Hauptschule (Intercept). All three regression coefficients are also positive, i.e. they indicate a higher Gesamtnote in the respective category compared to the reference category.

Alternatively, and way more easily, this model can be computed like this:

model_dummy2 <- lm(Gesamtnote ~ bildung_mutter, data = edu)
summary(model_dummy2)
#> 
#> Call:
#> lm(formula = Gesamtnote ~ bildung_mutter, data = edu)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.72222 -0.37158  0.00816  0.50816  1.62842 
#> 
#> Coefficients:
#>                          Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                4.0952     0.1022  40.089  < 2e-16 ***
#> bildung_mutterRealschule   0.2763     0.1227   2.253  0.02511 *  
#> bildung_mutterAbitur       0.3966     0.1392   2.849  0.00473 ** 
#> bildung_mutterHochschule   0.6270     0.1259   4.981 1.15e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.662 on 263 degrees of freedom
#> Multiple R-squared:  0.09391,    Adjusted R-squared:  0.08358 
#> F-statistic: 9.086 on 3 and 263 DF,  p-value: 9.605e-06

The results do not differ at all! In both cases, lm() calculates a model in which the DV Gesamtnote is predicted by the dummy coded factor bildung_mutter. So the factor is dummy-coded internally with the first factor level “Hauptschule” automatically taken as the reference category. The internal dummy coding is defined in the contrasts attribute of the factor bildung_mutter. Below, we will see how to change the values of this argument.

contrasts(edu$bildung_mutter)
#>             Realschule Abitur Hochschule
#> Hauptschule          0      0          0
#> Realschule           1      0          0
#> Abitur               0      1          0
#> Hochschule           0      0          1

The function model.matrix() returns the internal dummy coding of all observations:

head(model.matrix(~ bildung_mutter, data = edu))
#>   (Intercept) bildung_mutterRealschule bildung_mutterAbitur
#> 1           1                        1                    0
#> 2           1                        0                    0
#> 3           1                        0                    0
#> 4           1                        1                    0
#> 5           1                        1                    0
#> 6           1                        1                    0
#>   bildung_mutterHochschule
#> 1                        0
#> 2                        0
#> 3                        1
#> 4                        0
#> 5                        0
#> 6                        0

10.8.3 Regression with effect coded categorical predictors

(Un)fortunately, there is no function in the psych package for creating effect coded variables. But as we just saw, we do not have to actually create coding variables in order to use them in the analysis. It is sufficient if the factor we use as predictor internally represents a specific coding or parametrization. As seen above, the default setting for the coding of a factor is dummy coding (contr.treatment()), with the first factor level used as the reference category.

For effect coding, the function contr.sum() can be used to define the contrasts attribute of bildung_mutter. Unfortunately, the basic version of this function is not very user-friendly, so we use a version of contr.sum() from the package memisc:

# Create a copy of bildung_mutter (bildung_mutter_e)
edu <- edu %>% mutate(bildung_mutter_e = bildung_mutter) 

# Change contrast coding of bildung_mutter_e to effect coding  
contrasts(edu$bildung_mutter_e) <- memisc::contr.sum(levels(edu$bildung_mutter_e), base = "Hauptschule")

# Check
contrasts(edu$bildung_mutter_e)
#>             Realschule Abitur Hochschule
#> Hauptschule         -1     -1         -1
#> Realschule           1      0          0
#> Abitur               0      1          0
#> Hochschule           0      0          1
head(model.matrix(~ bildung_mutter_e, data = edu))
#>   (Intercept) bildung_mutter_eRealschule bildung_mutter_eAbitur
#> 1           1                          1                      0
#> 2           1                         -1                     -1
#> 3           1                          0                      0
#> 4           1                          1                      0
#> 5           1                          1                      0
#> 6           1                          1                      0
#>   bildung_mutter_eHochschule
#> 1                          0
#> 2                         -1
#> 3                          1
#> 4                          0
#> 5                          0
#> 6                          0

Now let’s do the analysis:

model_effect <- lm(Gesamtnote ~ bildung_mutter_e, data = edu)
summary(model_effect)
#> 
#> Call:
#> lm(formula = Gesamtnote ~ bildung_mutter_e, data = edu)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.72222 -0.37158  0.00816  0.50816  1.62842 
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                 4.42022    0.04287 103.109  < 2e-16 ***
#> bildung_mutter_eRealschule -0.04864    0.06438  -0.756    0.451    
#> bildung_mutter_eAbitur      0.07162    0.07944   0.902    0.368    
#> bildung_mutter_eHochschule  0.30200    0.06740   4.481 1.11e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.662 on 263 degrees of freedom
#> Multiple R-squared:  0.09391,    Adjusted R-squared:  0.08358 
#> F-statistic: 9.086 on 3 and 263 DF,  p-value: 9.605e-06

The results for the overall model (\(F\) and \(R^2\)) are the same as in the dummy coded analysis. The regression coefficients now represent the comparison of the corresponding category to the (unweighted) mean of the category means (Intercept \(b_0 = 4.42\)). Thus, the mean values of the categories Realschule and Abitur are non-significantly different from this unweighted mean of the category means (\(p= 0.451\) and \(p= 0.368\)). In contrast, the mean of Gesamtnote in the category Hochschule is significantly higher than this unweighted average (\(b_3= 0.302\), \(t= 4.481\), \(p= 0\)).

Calculating the unweighted mean of the category means (for comparison with \(b_0\)):

edu %>% 
    group_by(bildung_mutter) %>%
    summarise(mean_Gesamtnote = mean(Gesamtnote)) %>% 
    ungroup() %>% 
    summarise(MM_Gesamtnote = mean(mean_Gesamtnote))
#> # A tibble: 1 x 1
#>   MM_Gesamtnote
#>           <dbl>
#> 1          4.42

10.8.4 Regression with weighted effects-coding

For weighted effect encoding there is a separate package called wec (weighted effects coding). The function for re-defining the contrasts is contr.wec(). The ommitted argument is used to define the reference category.

# Create a copy of bildung_mutter (bildung_mutter_e)
edu <- edu %>% mutate(bildung_mutter_we = bildung_mutter) 

# Change contrast coding of bildung_mutter_we to weighted effect coding  
contrasts(edu$bildung_mutter_we) <- wec::contr.wec(edu$bildung_mutter_we, omitted = "Hauptschule")

# Check
contrasts(edu$bildung_mutter_we)
#>             Realschule    Abitur Hochschule
#> Hauptschule  -2.261905 -1.166667  -1.928571
#> Realschule    1.000000  0.000000   0.000000
#> Abitur        0.000000  1.000000   0.000000
#> Hochschule    0.000000  0.000000   1.000000
head(model.matrix(~ bildung_mutter_we, data = edu))
#>   (Intercept) bildung_mutter_weRealschule bildung_mutter_weAbitur
#> 1           1                    1.000000                0.000000
#> 2           1                   -2.261905               -1.166667
#> 3           1                    0.000000                0.000000
#> 4           1                    1.000000                0.000000
#> 5           1                    1.000000                0.000000
#> 6           1                    1.000000                0.000000
#>   bildung_mutter_weHochschule
#> 1                    0.000000
#> 2                   -1.928571
#> 3                    1.000000
#> 4                    0.000000
#> 5                    0.000000
#> 6                    0.000000

Analyze this:

model_weffect <- lm(Gesamtnote ~ bildung_mutter_we, data = edu)
summary(model_weffect)
#> 
#> Call:
#> lm(formula = Gesamtnote ~ bildung_mutter_we, data = edu)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.72222 -0.37158  0.00816  0.50816  1.62842 
#> 
#> Coefficients:
#>                             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                  4.45655    0.04052 109.996  < 2e-16 ***
#> bildung_mutter_weRealschule -0.08498    0.05452  -1.559     0.12    
#> bildung_mutter_weAbitur      0.03528    0.08546   0.413     0.68    
#> bildung_mutter_weHochschule  0.26567    0.06140   4.327 2.15e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.662 on 263 degrees of freedom
#> Multiple R-squared:  0.09391,    Adjusted R-squared:  0.08358 
#> F-statistic: 9.086 on 3 and 263 DF,  p-value: 9.605e-06

The results are very similar to those of the unweighted effect coding analysis: Again, only the mean value of the category Hochschule differs significantly from the overall mean (weighted mean of category means) with \(b_3 = 0.266\), \(t = 4.327\), \(p = 0\).

For comparison with \(b_0\) let’s calculate the overall mean (same as the weighted mean of the category means):

mean(edu$Gesamtnote)
#> [1] 4.456554

10.9 Analysis of Covariance

10.9.1 „Stereotype threat“?

Experimental studies have shown that stereotypes about one’s own group (e.g. gender) can affect test performance, e.g. the stereotype that girls may be weaker in mathematics than boys. This phenomenon is known as stereotype threat.

We now want to use the math grades of the adolescents to investigate whether a possible gender difference in the math grade might be due to gender differences in the self-efficacy regarding the ability to learn math. We further assume that these differences in self-efficacy may be due to stereotypes leading girls to believe that they are less able to learn math.

Our concrete assumption is that a) boys exhibit (significantly) better math grades than girls, and b) that this difference disappears (is no longer significant) when we control for the variable “math-related self-efficacy”, i.e. include this variable in the model as a covariate. So we assume that if we calculate the gender difference as if girls and boys had the same math-related self-efficacy, then boys and girls will no longer differ with regard to their math grades.

Data:

stereo <- adolescents %>%
  dplyr::select(ID, Mathenote, geschlecht, swk1) %>%
  drop_na() %>%
  mutate(Mathenote = as.numeric(Mathenote),
         swk1_a = scale(swk1, scale = FALSE))

head(stereo)
#> # A tibble: 6 x 5
#>   ID    Mathenote geschlecht  swk1 swk1_a[,1]
#>   <fct>     <dbl> <fct>      <dbl>      <dbl>
#> 1 1             3 weiblich       2    -3.04  
#> 2 2             4 männlich       4    -1.04  
#> 3 10            5 weiblich       5    -0.0423
#> 4 11            2 weiblich       3    -2.04  
#> 5 12            4 weiblich       5    -0.0423
#> 6 14            4 männlich       3    -2.04

summary(stereo)
#>        ID        Mathenote        geschlecht       swk1      
#>  1      :  1   Min.   :2.000   männlich:151   Min.   :1.000  
#>  2      :  1   1st Qu.:4.000   weiblich:133   1st Qu.:4.000  
#>  10     :  1   Median :4.000                  Median :5.000  
#>  11     :  1   Mean   :4.173                  Mean   :5.042  
#>  12     :  1   3rd Qu.:5.000                  3rd Qu.:6.000  
#>  14     :  1   Max.   :6.000                  Max.   :7.000  
#>  (Other):278                                                 
#>       swk1_a.V1     
#>  Min.   :-4.042254  
#>  1st Qu.:-1.042254  
#>  Median :-0.042254  
#>  Mean   : 0.000000  
#>  3rd Qu.: 0.957746  
#>  Max.   : 1.957746  
#> 

Descriptive stats using summarise():

stereo %>% 
  group_by(geschlecht) %>% 
  summarise(Mean_Mathe = round(mean(Mathenote), 3),
            SD_Mathe = round(sd(Mathenote), 3)) %>%
  as.data.frame() 
#>   geschlecht Mean_Mathe SD_Mathe
#> 1   männlich      4.132    0.964
#> 2   weiblich      4.218    0.932

stereo %>% 
  group_by(geschlecht) %>% 
  summarise(Mean_swk1 = round(mean(swk1), 3),
            SD_swk1 = round(sd(swk1), 3)) %>% 
  as.data.frame()
#>   geschlecht Mean_swk1 SD_swk1
#> 1   männlich     5.212   1.081
#> 2   weiblich     4.850   1.252

10.9.2 Model 1: Gender differences in math grade (model without covariate)

model1 <- lm(Mathenote ~ geschlecht, data = stereo)
summary(model1)
#> 
#> Call:
#> lm(formula = Mathenote ~ geschlecht, data = stereo)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.2180 -0.2180 -0.1325  0.7820  1.8676 
#> 
#> Coefficients:
#>                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         4.13245    0.07724  53.503   <2e-16 ***
#> geschlechtweiblich  0.08559    0.11287   0.758    0.449    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9491 on 282 degrees of freedom
#> Multiple R-squared:  0.002035,   Adjusted R-squared:  -0.001504 
#> F-statistic: 0.5751 on 1 and 282 DF,  p-value: 0.4489

The model and thus also the effect of the only predictor geschlecht are not significant with \(p = 0.4489\). Descriptively, a \(b_1 =0.0856\) higher math mean for geschlecht = "weiblich" can be observed. The math mean for geschlecht = "männlich" (reference category) is the intercept \(b_0 =4.1325\) of the model. Our first hypothesis (boys have higher math grades than girls) was therefore not confirmed. Nevertheless, the ANCOVA model can still be instructive for investigating the stereotype-threat hypothesis (see next paragraph).

10.9.3 Model 2: ANCOVA

model2 <- lm(Mathenote ~ geschlecht + swk1, data = stereo)
summary(model2)
#> 
#> Call:
#> lm(formula = Mathenote ~ geschlecht + swk1, data = stereo)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.90961 -0.47498  0.09039  0.52502  1.95966 
#> 
#> Coefficients:
#>                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         1.86716    0.22422   8.328 3.66e-15 ***
#> geschlechtweiblich  0.24306    0.09681   2.511   0.0126 *  
#> swk1                0.43464    0.04115  10.563  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.8044 on 281 degrees of freedom
#> Multiple R-squared:  0.2857, Adjusted R-squared:  0.2806 
#> F-statistic: 56.19 on 2 and 281 DF,  p-value: < 2.2e-16

After taking self-efficacy into account, a significant gender effect \((p =0.0126)\) appears: Girls have a \(b_1 =0.2431\) higher math grade than boys. This result is quite compatible with the assumption of a stereotype threat: If girls had the same mathematical self-efficacy as boys, they would also have (significantly) higher math grades than boys.

Furthermore, there is a significant positive effect of the covariate mathemetical self-efficacy on the math grade \(b_2 = 0.4346, p = 0\). The higher the self-efficacy, the higher the math grade.

ANCOVA with centered covariate swk1_a:

model3 <- lm(Mathenote ~ geschlecht + swk1_a, data = stereo)
summary(model3)
#> 
#> Call:
#> lm(formula = Mathenote ~ geschlecht + swk1_a, data = stereo)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.90961 -0.47498  0.09039  0.52502  1.95966 
#> 
#> Coefficients:
#>                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         4.05871    0.06583  61.652   <2e-16 ***
#> geschlechtweiblich  0.24306    0.09681   2.511   0.0126 *  
#> swk1_a              0.43464    0.04115  10.563   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.8044 on 281 degrees of freedom
#> Multiple R-squared:  0.2857, Adjusted R-squared:  0.2806 
#> F-statistic: 56.19 on 2 and 281 DF,  p-value: < 2.2e-16

Only the intercept changes (now: \(b_0 =4.0587\)). This corresponds to the adjusted mean of the DV Mathenote in the reference category (geschlecht = männlich).

10.9.4 Checking the homogeneity-of-slopes assumption

The homogeneity-of-slopes assumption states that the interpretation of an ANCOVA model is only valid if there is no interaction between covariate and factor. In case of a significant interaction, the effect of gender would depend on the extent of mathematical self-efficacy (and thus the slopes of swk1 would be different between the sexes, so there would be no homogeneity of slopes).

model4 <- lm(Mathenote ~ geschlecht*swk1, data = stereo)
summary(model4)
#> 
#> Call:
#> lm(formula = Mathenote ~ geschlecht * swk1, data = stereo)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.8546 -0.4507  0.1454  0.5555  1.9531 
#> 
#> Coefficients:
#>                         Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)              2.02763    0.32372   6.263 1.41e-09 ***
#> geschlechtweiblich      -0.04379    0.42817  -0.102    0.919    
#> swk1                     0.40385    0.06083   6.639 1.63e-10 ***
#> geschlechtweiblich:swk1  0.05685    0.08265   0.688    0.492    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.8052 on 280 degrees of freedom
#> Multiple R-squared:  0.2869, Adjusted R-squared:  0.2793 
#> F-statistic: 37.55 on 3 and 280 DF,  p-value: < 2.2e-16

The interaction term is not signficant \((p =0.4921)\) and thus the null hypothesis of homogeneous slopes of swk1 can be maintained (assumption not violated).

10.9.5 Adjusted means

Adjusted means are computed using the overall mean of the covariate (\(\bar x_{swk1}=5.0423)\) for predicting Mathenote separately for the two groups:

\(\hat{Y}_{männlich}= 1.8672 + 0.2431 \cdot 0 + 0.4346 \cdot 5.0423 = 4.059\)

\(\hat{Y}_{weiblich}= 1.8672 + 0.2431 \cdot 1 + 0.4346 \cdot 5.0423 = 4.302\)

In R, this can be done using predict():

# Männlich
predict(model2, newdata = data.frame(geschlecht = "männlich", 
                                     swk1 = mean(stereo$swk1))) %>% round(3)
#>     1 
#> 4.059

# Weiblich
predict(model2, newdata = data.frame(geschlecht = "weiblich", 
                                     swk1 = mean(stereo$swk1))) %>% round(3)
#>     1 
#> 4.302

10.9.6 Grafical representation of the ANCOVA model