11 Data simulation
::p_load(tidyverse, lme4, simr) pacman
11.1 Basics
Most common distributions are implemented in base R.
For a good general overview see this chapter of another R course.
Each distribution has implemented also a quasi-random data generating function starting with r
, e.g. rnorm()
for normally distributed numbers, or runif()
for random uniform numbers.
We can specify parameters for the number generators to generate data according to our prerequisites.
The sample()
function is also a very usefull tool.
If we chose the setting replace = TRUE
in that function, and leave the default prob
setting, we sample according to a discrete uniform distribution over all entries in our sampling vector.
We can, for example, randomly sample 50 values from a normal distribution with \(\mu = 2\) and the standard deviation \(\sigma = 2\). But first, we make the example reproducible by setting a seed:
set.seed(1413)
rnorm(n = 50, mean = 2, sd = 2) %>%
round(4) %>%
print() %>% # print out rounded values
density() %>% # smooth density estimation
plot() %>% # plot density
abline(v = 2, col = "red") # add vertical line of true mean
## [1] 2.4067 1.4031 2.9782 0.2701 -1.2127 0.1227 2.6192 4.0750 2.8920
## [10] 6.6612 4.6134 2.0064 0.6730 1.3692 0.1816 2.4681 3.1938 4.2389
## [19] 4.7378 1.4919 3.2757 2.0376 3.3145 3.1323 0.1044 0.6688 -1.9733
## [28] 0.4641 0.8027 -2.6118 1.0727 3.8566 2.5002 0.8582 0.0440 3.8491
## [37] 6.0439 5.5044 1.1016 4.1413 1.3772 5.4016 3.7061 1.2309 -0.1826
## [46] 2.4296 5.6817 1.6379 0.1860 -0.5932
Or we can simulate 10 coin tosses with various functions.
# easiest probably with the sample function:
<- sample(c("heads", "tails"), 10, replace = TRUE)
samp
# rbinom
<- rbinom(n = 10, size = 1, prob = .5)
rbino
# runif
<- runif(n = 10, min = 0, max = 1)
runi ifelse(runi < 0.5, "heads", "tails")
## [1] "heads" "heads" "heads" "heads" "tails" "heads" "tails" "heads" "tails"
## [10] "tails"
11.2 Under H0 (Simulate test statistic)
The idea here is to simulate a test statistic under \(H_0\), whatever \(H_0\) may be. If we can sample under \(H_0\) we can also compare an empirical observation to the distribution under \(H_0\). We can then judge how likely the observed data would be, given that \(H_0\) were true.
Idea of example taken from the Computational Statistics lecture by Marloes Maathuis.
Panini: 1 box contains 100 packs with each 5 stickers. 678 stickers in total for UEFA 2020.
Now we bought half a box, and got only 23 duplicates.
But how many duplicated cards would we expect if stickers are all as common, and independently distributed (this is our \(H_0\))? How likely are 23 duplicates in 250 cards given our \(H_0\)?
set.seed(315)
<- 1e5 # number of simulations we want to conduct
n.sim <- 50*5 # how many stickers in half a box
half.box # n.packs <- 50 # how many packs in half a box
# cards.pack <- 5 # how many cards in a pack
<- 678 # how many different cards are there?
n.cards <- 23 # observed number of duplicated cards in half a box.
emp.dupl
# lets simulate one draw:
<- sample(1:n.cards, size = half.box, replace = TRUE)
sample.cards <- sum(duplicated(sample.cards))
sample.stat
# how many did we get?
sample.stat
## [1] 41
We can now simulate a distribution under \(H_0\) by writing a function that simulates one draw and returns the test statistic.
<- function(){
card.fun <- sample(1:n.cards, size = half.box, replace = TRUE)
sample.cards <- sum(duplicated(sample.cards))
sample.stat return(sample.stat)
}
Then we repeat this process n.sim
times and store the vector of results.
<- replicate(n = n.sim, card.fun())
card.h0 summary(card.h0)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 37.00 41.00 40.79 44.00 63.00
A histogram of the test statistic under \(H_0\) shows us how far away the observed outcome was compared to expected outcomes given \(H_0\).
hist(card.h0, xlim = c(0, 60)) %>%
abline(v = emp.dupl, col = "red")
We can now calculate a p-value as the frequency of observed values under \(H_0\) which were more extreme than our observed values.
To make the calculation more robust, we add + 1
to both sides of the division.
sum(emp.dupl >= card.h0) + 1)/(n.sim + 1) (
## [1] 0.0001299987
Such an extreme value would be very unlikely, given that the \(H_0\) is true. We would therefore reject \(H_0\).
11.3 Under H1 (Power)
::p_load(
pacman# for data wrangling and stuff...
tidyverse,
lme4,
simr )
11.3.1 Background
Power is the probability of detecting an effect given that the effect is true. In practice, the concept of power is used to determine an appropriate sample size for an experiment under the assumption that the effect of interest (generally a [set of] model parameter[s]) is true and of a particular size.
In simple cases one can derive an analytic solution for such a probability. This is not always feasible as models grow more complex. In such cases it can be very handy to rely on monte-carlo simulations.
We would then like to know how probable it is under the alternative hypothesi (\(H_1\)) to detect a significant effect for the parameter of interest. Monte-carlo simulations produce random data according to the model we are interested in. Such a model has to be clearly specified. We then simulate data based on our \(H_1\), fit the model, and evaluate whether the parameter of interest was significant (i.e. detected). To approximate the probability of detecting an effect we calculate the frequency of finding a significant effect given a predefined number of simulations.
11.3.1.1 Example (two independent samples t-test)
Let’s use the model:
\(Y = \beta_0 + \beta_1*X + \epsilon\)
Where Y is the dependent variable, X represents control (x = 0) or experimental (x = 1) group, and \(\epsilon\) is the random error. \(\beta_0\) is then the mean of group 0, and \(\beta_1\) the difference in means.
This is the model formulation for a t-test.
Traditionally, we have the null hypothesis \(H_0: \beta_1 = 0\). To calculate power we need to assume a specific effect size under the alternative hypothesis, e.g: \(H_1: \beta = 0.4\). Of course, we also need to specify a significance level.
For the t-test the analytic solution is straightforward. Here, we explore a monte carlo simulation for the same problem and subsequently compare the results.
We start with simulating one data set of x and y, with a fixed n of 100 and \(\beta = 0.4\).
set.seed(237894) # to make results reproducible
<- 100
n .1 <- .4
beta
# If experimental group is randomly assigned:
# x <- sample(0:1, n, replace = TRUE)
# if done like this, the group sizes are not neccessarily equal.
# For this example, we can use a fixed x, the order does not matter.
<- rep(0:1, times = n/2)
x x
## [1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
## [38] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
## [75] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
# under H_1 we use the specified beta.1.
<- beta.1*x + rnorm(n) # effect based on x + error
y # under H_0 we would replace beta.1 with 0, and would be left with only the random error.
# y <- 0*x + rnorm(n)
Now this looks as follows:
plot(y~as_factor(x))
Here we can see that group 1 has a higher y value on average, but is it statistically significant?
summary(lm(y ~ x))
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.27453 -0.65678 0.04281 0.49153 2.69333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1990 0.1445 -1.377 0.17156
## x 0.6884 0.2043 3.370 0.00108 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.022 on 98 degrees of freedom
## Multiple R-squared: 0.1038, Adjusted R-squared: 0.09469
## F-statistic: 11.35 on 1 and 98 DF, p-value: 0.001078
Yes it is.
We have now one instance of simulated data under \(H_1\). We would like to know how probable it is under \(H_1\) that a randomly simulated dataset yields a significant effect for the \(\beta_1\) coefficient. To approximate the probability we calculate the frequency of finding a significant \(\beta_1\) effect given a set number of simulations.
First, we write a function that always simulates a new set of data, fits a linear model and checks if the model was significant. For later use we let n be specified in the function call.
<- function(n){
simulate_and_check <- rep(0:1, n/2)
x <- beta.1*x + rnorm(n) # effect based on x + error
y <- data.frame(x, y)
temp.dat
# fit model
<- lm(data = temp.dat, y ~ x)
temp.lm
# check if effect was discovered in this one simulation
# access p value and return TRUE if smaller than 0.05:
summary(temp.lm)$coefficients["x",4] < .05
}
Lets see if the function works:
set.seed(237894) # same seed, same code, should yield TRUE
simulate_and_check(n = n)
## [1] TRUE
We can now repeat this simulation many times (n.sim
times, to be precise).
We do this in a function, so we can do the same for different sample sizes.
<- 1000
n.sim
# we want to get for each simulation the information whether we discovered the effect or not.
<- function(n){
power # now we replicate this procedure n.sim times
<- replicate(n = n.sim, simulate_and_check(n))
result
# we can now calculate power as the proportion of successful discoveries over total simulations (n.sim)
sum(result)/n.sim
}
power(n = n)
## [1] 0.517
Even better, we can now make the function take a whole vector of n’s and can find out how many people we need to get power = .8
<- Vectorize(power) # this allows the power function to take a vector!
power
# specify n of interest:
<- seq(from = 10, to = 250, by = 20)
n.vec
# get one result for each n
<- power(n.vec)
res.vec
plot(x = n.vec, y = res.vec, type = "l")
Analytic solution:
<- power.t.test(n = n.vec/2, delta = beta.1, sd = 1, strict = TRUE)
an.power
# bind together
<- data.frame(n = n.vec, simulated = res.vec, analytic = an.power$power)
comp.dat
# wide to long
<- comp.dat %>%
comp.dat pivot_longer(names_to = "type",
values_to = "power",
cols = simulated:analytic)
head(comp.dat)
## # A tibble: 6 × 3
## n type power
## <dbl> <chr> <dbl>
## 1 10 simulated 0.074
## 2 10 analytic 0.0866
## 3 30 simulated 0.175
## 4 30 analytic 0.185
## 5 50 simulated 0.287
## 6 50 analytic 0.283
# plot
ggplot(aes(x = n, y = power, color = type), data = comp.dat) +
geom_line()
pretty close!
Some things to consider: Simulation takes much longer than the analytic solution. This can be a problem for complex models and large data sets. Also, the more times you simulate, the longer it takes. But with more simulations you get a higher accuracy for the power estimate.
We can see this by running the simulation with different number of simulations.
Here, we first do the simulation n.sim
times, and then take various sub-samples of those simulations.
E.g. n.sim.5
has 0.5 times n.sim
samples, so 500.
<- function(n){
power.alternative # now we replicate this procedure n.sim times
<- replicate(n = n.sim, simulate_and_check(n))
result
# we can now calculate power as the proportion of successful discoveries over total simulations (n.sim)
c(n.sim = sum(result)/n.sim,
n.sim.5 = sum(sample(result, size = n.sim*.5)) / (n.sim*.5),
n.sim.2 = sum(sample(result, size = n.sim*.2)) / (n.sim*.2),
n.sim.1 = sum(sample(result, size = n.sim*.1)) / (n.sim*.1),
n.sim.05 = sum(sample(result, size = n.sim*.05)) / (n.sim*.05))
}
<- Vectorize(power.alternative) # this allows the power function to take a vector!
power.alternative
# specify n of interest:
<- seq(from = 10, to = 250, by = 20)
n.vec
# get one result for each n
<- power.alternative(n.vec)
res.vec.alternative <- rbind(n = n.vec, analytic = an.power$power, res.vec.alternative)
res.vec.alternative <- as.data.frame(t(res.vec.alternative))
comp.dat.alternative
<- comp.dat.alternative %>% gather(
comp.dat.alternative key = "type", value = "sim.power", n.sim:n.sim.05
)
ggplot(data = comp.dat.alternative, aes(x = n, y = sim.power, color = type)) +
geom_line() +
geom_line(aes(x = n, y = analytic), color = "black") +
theme_classic() +
geom_abline(slope = 0, intercept = .8, size = .3)
11.3.2 Power for mixed effects models
We compute power for mixed effects models using the simr
package.
It allows for power simulations of lmer
and glmer
objects.
The package is relatively well documented.
Official information can be found here and here. (This)[https://humburg.github.io/Power-Analysis/simr_power_analysis.html] is a useful blogpost as well.
The data simulation are always based on a fitted lmer
object.
Generally, we have to options to get a fitted lmer
object.
Either we base it on available data, e.g. a pilot study or simulated data, or we can create the model from scratch using the makeLmer()
function provided by the simr
package.
Additionally, the simr
package provides functionality to adjust sample sizes and parameters of a given fitted model.
11.3.2.1 Salary example
<- read_csv(
df url("https://raw.githubusercontent.com/methodenlehre/data/master/salary-data2.csv")) %>%
mutate(company = as.factor(company),
sector = as.factor(sector),
id = as_factor(1:nrow(.)))
## Rows: 600 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): company, sector
## dbl (2): experience, salary
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Fit the model
<- lmer(salary ~ experience + (1 | company),
random.intercept.model data = df, REML = TRUE)
We want to vary sample size in this model. To verify that the sample size has changed in exactly the way we want, we constucted a custom function to summarise the new data.
The easy version is to just look at the output of the getData()
function from the simr
package.
We do this here too, but then continue to summarise the data in a specific way.
<- function(model){
custom.summary <- function(v) {
getmode <- unique(v)
uniqv which.max(tabulate(match(v, uniqv)))]
uniqv[
}
getData(model) %>% group_by(company) %>%
summarise(n = n(),
mean.exp = mean(experience),
mean.salary = mean(salary),
sector = getmode(sector)
%>% group_by(sector) %>%
) print() %>%
summarise(n = n())
}
The custom.summary()
function takes a fitted model as input, and outputs a summary of the getData()
function output.
We can then, for example, directly look at a summary of the originally fitted data:
custom.summary(random.intercept.model)
## # A tibble: 20 × 5
## # Groups: sector [2]
## company n mean.exp mean.salary sector
## <fct> <int> <dbl> <dbl> <fct>
## 1 Company 01 30 6.37 9588. Private
## 2 Company 02 30 6.13 9927. Private
## 3 Company 03 30 6.09 10808. Private
## 4 Company 04 30 4.71 7514. Public
## 5 Company 05 30 6.34 9764. Private
## 6 Company 06 30 5.1 7705. Private
## 7 Company 07 30 4.02 8727. Public
## 8 Company 08 30 5.71 8463. Private
## 9 Company 09 30 5.15 8036. Private
## 10 Company 10 30 5.21 9565. Public
## 11 Company 11 30 5.40 8193. Private
## 12 Company 12 30 4.91 9750. Public
## 13 Company 13 30 5.23 7938. Public
## 14 Company 14 30 4.70 7687. Public
## 15 Company 15 30 4.15 7487. Public
## 16 Company 16 30 4.88 8632. Public
## 17 Company 17 30 5.61 9450. Private
## 18 Company 18 30 5.03 8557. Public
## 19 Company 19 30 4.02 9116. Public
## 20 Company 20 30 5.04 7844. Private
## # A tibble: 2 × 2
## sector n
## <fct> <int>
## 1 Private 10
## 2 Public 10
We see that we get 20 companies with each 30 participants. 10 companies are in sector Public, 10 in sector Private.
We can then do a power simulation.
The function powerSim()
chooses a parameter to calculate power for.
We can specify that ourselves, although in this case the function chooses correctly.
We set the parameter nsim
(number of simulations) very low, as otherwise this takes too long.
powerSim(random.intercept.model, nsim = 20, test = fcompare(salary ~ 1))
## Simulating: | |Simulating: |=== |Simulating: |====== |Simulating: |========= |Simulating: |============= |Simulating: |================ |Simulating: |=================== |Simulating: |======================= |Simulating: |========================== |Simulating: |============================= |Simulating: |================================= |Simulating: |==================================== |Simulating: |======================================= |Simulating: |========================================== |Simulating: |============================================== |Simulating: |================================================= |Simulating: |==================================================== |Simulating: |======================================================== |Simulating: |=========================================================== |Simulating: |============================================================== |Simulating: |==================================================================|
## Warning in observedPowerWarning(sim): This appears to be an "observed power"
## calculation
## Power for model comparison, (95% confidence interval):
## 100.0% (83.16, 100.0)
##
## Test: Likelihood ratio
## Comparison to salary ~ 1 + [re]
##
## Based on 20 simulations, (0 warnings, 0 errors)
## alpha = 0.05, nrow = 600
##
## Time elapsed: 0 h 0 m 1 s
##
## nb: result might be an observed power calculation
Now, we have a huge sample size, so no wonder the power is very large.
We can use the extend()
function to either extend or shrink data within a model along a variable.
Before, we had 20 companies. We can extend this to 60 companies:
.2 <- extend(random.intercept.model,
random.intercept.modelalong = "company", n = 60)
custom.summary(random.intercept.model.2)
## # A tibble: 60 × 5
## # Groups: sector [2]
## company n mean.exp mean.salary sector
## <fct> <int> <dbl> <dbl> <fct>
## 1 a 30 6.37 9588. Private
## 2 b 30 6.13 9927. Private
## 3 c 30 6.09 10808. Private
## 4 d 30 4.71 7514. Public
## 5 e 30 6.34 9764. Private
## 6 f 30 5.1 7705. Private
## 7 g 30 4.02 8727. Public
## 8 h 30 5.71 8463. Private
## 9 i 30 5.15 8036. Private
## 10 j 30 5.21 9565. Public
## # … with 50 more rows
## # A tibble: 2 × 2
## sector n
## <fct> <int>
## 1 Private 30
## 2 Public 30
We now have 60 companies with each 30 employees.
On this extended model, we could again run the powerSim()
function.
Similarly, we can extend the number of employees in total, if he have an id variable specified:
.3 <- extend(random.intercept.model,
random.intercept.modelalong = "id", n = 800)
custom.summary(random.intercept.model.3)
## # A tibble: 20 × 5
## # Groups: sector [2]
## company n mean.exp mean.salary sector
## <fct> <int> <dbl> <dbl> <fct>
## 1 Company 01 60 6.37 9588. Private
## 2 Company 02 60 6.13 9927. Private
## 3 Company 03 60 6.09 10808. Private
## 4 Company 04 60 4.71 7514. Public
## 5 Company 05 60 6.34 9764. Private
## 6 Company 06 60 5.1 7705. Private
## 7 Company 07 50 4.03 8814. Public
## 8 Company 08 30 5.71 8463. Private
## 9 Company 09 30 5.15 8036. Private
## 10 Company 10 30 5.21 9565. Public
## 11 Company 11 30 5.40 8193. Private
## 12 Company 12 30 4.91 9750. Public
## 13 Company 13 30 5.23 7938. Public
## 14 Company 14 30 4.70 7687. Public
## 15 Company 15 30 4.15 7487. Public
## 16 Company 16 30 4.88 8632. Public
## 17 Company 17 30 5.61 9450. Private
## 18 Company 18 30 5.03 8557. Public
## 19 Company 19 30 4.02 9116. Public
## 20 Company 20 30 5.04 7844. Private
## # A tibble: 2 × 2
## sector n
## <fct> <int>
## 1 Private 10
## 2 Public 10
We now have 800 employees and still 20 companies, but the employees are not uniformly distributed amongst the companies. It looks like the companies are topped up with employees in multiples of the original employee numbers.
We can also extend within id in company by 2, and then have each id twice in each company:
# each id is used twice.
.4 <- extend(random.intercept.model,
random.intercept.modelwithin = "id + company", n = 2)
custom.summary(random.intercept.model.4)
## # A tibble: 20 × 5
## # Groups: sector [2]
## company n mean.exp mean.salary sector
## <fct> <int> <dbl> <dbl> <fct>
## 1 Company 01 60 6.37 9588. Private
## 2 Company 02 60 6.13 9927. Private
## 3 Company 03 60 6.09 10808. Private
## 4 Company 04 60 4.71 7514. Public
## 5 Company 05 60 6.34 9764. Private
## 6 Company 06 60 5.1 7705. Private
## 7 Company 07 60 4.02 8727. Public
## 8 Company 08 60 5.71 8463. Private
## 9 Company 09 60 5.15 8036. Private
## 10 Company 10 60 5.21 9565. Public
## 11 Company 11 60 5.40 8193. Private
## 12 Company 12 60 4.91 9750. Public
## 13 Company 13 60 5.23 7938. Public
## 14 Company 14 60 4.70 7687. Public
## 15 Company 15 60 4.15 7487. Public
## 16 Company 16 60 4.88 8632. Public
## 17 Company 17 60 5.61 9450. Private
## 18 Company 18 60 5.03 8557. Public
## 19 Company 19 60 4.02 9116. Public
## 20 Company 20 60 5.04 7844. Private
## # A tibble: 2 × 2
## sector n
## <fct> <int>
## 1 Private 10
## 2 Public 10
Or much more useful; just extend number of entries within companies:
# get more ids within company (I hope)
.5 <- extend(random.intercept.model,
random.intercept.modelwithin = "company", n = 35)
custom.summary(random.intercept.model.5)
## # A tibble: 20 × 5
## # Groups: sector [2]
## company n mean.exp mean.salary sector
## <fct> <int> <dbl> <dbl> <fct>
## 1 Company 01 35 6.39 9595. Private
## 2 Company 02 35 6.24 10116. Private
## 3 Company 03 35 6.07 10827. Private
## 4 Company 04 35 4.71 7554. Public
## 5 Company 05 35 6.29 9649. Private
## 6 Company 06 35 5.16 7671. Private
## 7 Company 07 35 3.99 8619. Public
## 8 Company 08 35 5.76 8524. Private
## 9 Company 09 35 5.02 7969. Private
## 10 Company 10 35 5.23 9551. Public
## 11 Company 11 35 5.31 8045. Private
## 12 Company 12 35 4.88 9775. Public
## 13 Company 13 35 5.18 8020. Public
## 14 Company 14 35 4.55 7700. Public
## 15 Company 15 35 4.22 7514. Public
## 16 Company 16 35 4.81 8539. Public
## 17 Company 17 35 5.59 9444. Private
## 18 Company 18 35 5.04 8456. Public
## 19 Company 19 35 4.06 9046. Public
## 20 Company 20 35 5.08 7849. Private
## # A tibble: 2 × 2
## sector n
## <fct> <int>
## 1 Private 10
## 2 Public 10
11.3.2.1.1 Adjust effect sizes
When we have a fitted model we can adjust the effect sizes according to our needs.
We use the random.intercept.model.5
as an example.
fixef(random.intercept.model.5)["experience"] <- 150
summary(random.intercept.model.5)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ experience + (1 | company)
## Data: df
##
## REML criterion at convergence: 10127.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8109 -0.6884 0.0005 0.5980 3.8833
##
## Random effects:
## Groups Name Variance Std.Dev.
## company (Intercept) 614367 783.8
## Residual 1184502 1088.3
## Number of obs: 600, groups: company, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 5964.18 229.41 48.00 25.997 < 2e-16 ***
## experience 150.00 27.21 589.48 5.514 5.26e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## experience -0.615
We can now check what the power would be for such a small effect size:
powerSim(random.intercept.model.5, nsim = 20)
## Simulating: | |Simulating: |=== |Simulating: |====== |Simulating: |========= |Simulating: |============= |Simulating: |================ |Simulating: |=================== |Simulating: |======================= |Simulating: |========================== |Simulating: |============================= |Simulating: |================================= |Simulating: |==================================== |Simulating: |======================================= |Simulating: |========================================== |Simulating: |============================================== |Simulating: |================================================= |Simulating: |==================================================== |Simulating: |======================================================== |Simulating: |=========================================================== |Simulating: |============================================================== |Simulating: |==================================================================|
## Power for predictor 'experience', (95% confidence interval):
## 100.0% (83.16, 100.0)
##
## Test: unknown test
## Effect size for experience is 1.5e+02
##
## Based on 20 simulations, (0 warnings, 0 errors)
## alpha = 0.05, nrow = 700
##
## Time elapsed: 0 h 0 m 5 s
It is still reasonably large.
But how many employees should we sample within the companies to get a power of 80%?
5.pc <- powerCurve(random.intercept.model.5, nsim = 60,
random.int.within = "company", progress = FALSE)
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
The powerCurve()
function automatically selects 10 positions along the specified variable and simulates the power accordingly.
We can plot the object:
plot(random.int.5.pc)
And see that we need 15 employees per company to get 80% power.
To get a better resolution around the number of observation within company we can specify break points in the powerCurve()
function:
5.pc2 <- powerCurve(random.intercept.model.5, nsim = 60,
random.int.within = "company", breaks = 8:12, progress = FALSE)
plot(random.int.5.pc2)
11.3.2.2 Repeated measures (from scratch)
We have a 2x2 within subjects design.
<- 1:2
timepoint <- 1:2
condition <- letters[1:10]
participant <- expand.grid(timepoint = timepoint, participant = participant, condition = condition)
X X
## timepoint participant condition
## 1 1 a 1
## 2 2 a 1
## 3 1 b 1
## 4 2 b 1
## 5 1 c 1
## 6 2 c 1
## 7 1 d 1
## 8 2 d 1
## 9 1 e 1
## 10 2 e 1
## 11 1 f 1
## 12 2 f 1
## 13 1 g 1
## 14 2 g 1
## 15 1 h 1
## 16 2 h 1
## 17 1 i 1
## 18 2 i 1
## 19 1 j 1
## 20 2 j 1
## 21 1 a 2
## 22 2 a 2
## 23 1 b 2
## 24 2 b 2
## 25 1 c 2
## 26 2 c 2
## 27 1 d 2
## 28 2 d 2
## 29 1 e 2
## 30 2 e 2
## 31 1 f 2
## 32 2 f 2
## 33 1 g 2
## 34 2 g 2
## 35 1 h 2
## 36 2 h 2
## 37 1 i 2
## 38 2 i 2
## 39 1 j 2
## 40 2 j 2
Specify some fixed and random parameters.
<- c(6, -.15, .002, .5) # fixed intercept and slope
b <- matrix(c(1, 0.5, 0.5, 1), 2) # random intercept and slope variance-covariance matrix
V1 <- 1 # residual standard deviation s
<- makeLmer(y ~ condition*timepoint + (timepoint|participant),
scratch1 fixef=b, VarCorr=V1, sigma=s, data=X)
scratch1
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ condition * timepoint + (timepoint | participant)
## Data: X
## REML criterion at convergence: 117.1171
## Random effects:
## Groups Name Std.Dev. Corr
## participant (Intercept) 1
## timepoint 1 0.50
## Residual 1
## Number of obs: 40, groups: participant, 10
## Fixed Effects:
## (Intercept) condition timepoint
## 6.000 -0.150 0.002
## condition:timepoint
## 0.500
Note that the y variable was not defined outside of the function makeLmer
.
We an calculate power for the interaction effect by:
powerSim(scratch1, nsim = 20, test = fcompare(y ~ condition + timepoint), progress = FALSE)
## Power for model comparison, (95% confidence interval):
## 10.00% ( 1.23, 31.70)
##
## Test: Likelihood ratio
## Comparison to y ~ condition + timepoint + [re]
##
## Based on 20 simulations, (0 warnings, 0 errors)
## alpha = 0.05, nrow = 40
##
## Time elapsed: 0 h 0 m 1 s
We estimate the power as 25%. Again, we can also expand the data to estimate power for more participants:
<- extend(scratch1, along = "participant", n = 150)
scratch2 # look at data to verify:
summary(getData(scratch2))
## timepoint participant condition y
## Min. :1.0 a : 4 Min. :1.0 Min. : 4.303
## 1st Qu.:1.0 b : 4 1st Qu.:1.0 1st Qu.: 5.940
## Median :1.5 c : 4 Median :1.5 Median : 6.629
## Mean :1.5 d : 4 Mean :1.5 Mean : 7.037
## 3rd Qu.:2.0 e : 4 3rd Qu.:2.0 3rd Qu.: 7.718
## Max. :2.0 f : 4 Max. :2.0 Max. :11.642
## (Other):576
# redo power simulation
powerSim(scratch2, nsim = 50, fcompare(y ~ condition + timepoint), progress = FALSE)
## Power for model comparison, (95% confidence interval):
## 80.00% (66.28, 89.97)
##
## Test: Likelihood ratio
## Comparison to y ~ condition + timepoint + [re]
##
## Based on 50 simulations, (50 warnings, 0 errors)
## alpha = 0.05, nrow = 600
##
## Time elapsed: 0 h 0 m 6 s
We get around 80 % power.
Again, we can investigate the powerCurve()
:
<- powerCurve(scratch2, test = fcompare(y~condition + timepoint),
scratch.pc nsim = 50, along = "participant", progress = FALSE)
plot(scratch.pc)
11.3.2.2.1 Adjusting random variance
We can modify the variance-covariance matrix similarly to the fixed effects. However, we must assign a matrix, not just a vector. The matrix we assign is the variance-covariance matrix. What the model prints out are the standard deviations and the correlation.
https://cran.r-project.org/web/packages/simr/vignettes/fromscratch.html \[\operatorname{cor}(x, y)=\frac{\operatorname{cov}(x, y)}{s d(x) \times s d(y)}\]
# define new VarCorr matrix
<- matrix(c(0.5,0.05,0.05,0.1), 2)
V2 V2
## [,1] [,2]
## [1,] 0.50 0.05
## [2,] 0.05 0.10
# assign it
VarCorr(scratch2) <- V2
We should now get in the model print out:
# random intercept sd:
<- sqrt(0.5) %>% round(3)) (int.sd
## [1] 0.707
# random slope sd:
<- sqrt(0.1) %>% round(3)) (slope.sd
## [1] 0.316
# correlation
0.05/(int.sd*slope.sd)) %>% round(3) (
## [1] 0.224
# lets see:
scratch2
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ condition * timepoint + (timepoint | participant)
## Data: X
## REML criterion at convergence: 117.1171
## Random effects:
## Groups Name Std.Dev. Corr
## participant (Intercept) 0.7071
## timepoint 0.3162 0.22
## Residual 1.0000
## Number of obs: 40, groups: participant, 10
## Fixed Effects:
## (Intercept) condition timepoint
## 6.000 -0.150 0.002
## condition:timepoint
## 0.500
Looks correct.
We can also modify the residual variance using the scale()
function.
scale(scratch2) <- 1.3
scratch2
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ condition * timepoint + (timepoint | participant)
## Data: X
## REML criterion at convergence: 117.1171
## Random effects:
## Groups Name Std.Dev. Corr
## participant (Intercept) 0.9192
## timepoint 0.4111 0.22
## Residual 1.3000
## Number of obs: 40, groups: participant, 10
## Fixed Effects:
## (Intercept) condition timepoint
## 6.000 -0.150 0.002
## condition:timepoint
## 0.500