Return Home

Download the template notebook here.

Let’s load in the packages we’d been using previously.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(palmerpenguins)

Let’s also create a dataset that doesn’t have NA values for the factor sex, to simplify our task later. This also gets rid of all cases with missing numerical measurements.

penguins.complete <- penguins %>% filter(!is.na(sex))

1 Linear model components

Based on this model output, what can we say about the effects of species and sex on flipper length, and the interaction of the two?

model.1 <- lm(flipper_length_mm ~ species + sex + species:sex, data=penguins.complete) # equivalent to species * sex
summary(model.1)
## 
## Call:
## lm(formula = flipper_length_mm ~ species + sex + species:sex, 
##     data = penguins.complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7945  -3.4110   0.0882   3.4590  17.5890 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              187.7945     0.6619 283.721  < 2e-16 ***
## speciesChinstrap           3.9408     1.1742   3.356 0.000884 ***
## speciesGentoo             24.9124     0.9947  25.044  < 2e-16 ***
## sexmale                    4.6164     0.9361   4.932  1.3e-06 ***
## speciesChinstrap:sexmale   3.5600     1.6606   2.144 0.032782 *  
## speciesGentoo:sexmale      4.2176     1.3971   3.019 0.002737 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.655 on 327 degrees of freedom
## Multiple R-squared:  0.8396, Adjusted R-squared:  0.8372 
## F-statistic: 342.4 on 5 and 327 DF,  p-value: < 2.2e-16

1.1 Main effects

What does it mean that there are stars at the end of the row that starts sexmale?

penguins.complete %>% 
  ggplot(aes(x = sex,
             y = flipper_length_mm,
             fill = sex)) +
  theme_bw() +
  geom_boxplot() +
  ggtitle("sexmale                     4.616      0.936    4.93  1.3e-06 ***")

# the title is copied from the lm output for convenience. (don't do this for real)

What does it mean that there are stars at the end of the rows beginning with speciesChinstrap and speciesGentoo?

penguins.complete %>% 
  filter(species != "Gentoo") %>% 
  ggplot(aes(x = species,
             y = flipper_length_mm,
             fill = species)) +
  theme_bw() +
  geom_boxplot() +
  ggtitle("speciesChinstrap            3.941      1.174    3.36  0.00088 ***")

penguins.complete %>% 
  filter(species != "Chinstrap") %>% 
  ggplot(aes(x = species,
             y = flipper_length_mm,
             fill = species)) +
  theme_bw() +
  geom_boxplot() +
  ggtitle("speciesGentoo              24.912      0.995   25.04  < 2e-16 ***")

But wait – what about comparing Chinstrap to Gentoo? This is just a pair-wise comparison, and it’s only two of the three combinations.

First: relevel (reorder the levels) so that the baseline is one of the other options.

model.2 <- lm(flipper_length_mm ~ species + sex + species:sex, 
              data=penguins.complete %>% 
                      mutate(species = fct_relevel(species, c("Chinstrap", "Adelie", "Gentoo")))) 
summary(model.2)
## 
## Call:
## lm(formula = flipper_length_mm ~ species + sex + species:sex, 
##     data = penguins.complete %>% mutate(species = fct_relevel(species, 
##         c("Chinstrap", "Adelie", "Gentoo"))))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7945  -3.4110   0.0882   3.4590  17.5890 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           191.7353     0.9699 197.692  < 2e-16 ***
## speciesAdelie          -3.9408     1.1742  -3.356 0.000884 ***
## speciesGentoo          20.9716     1.2215  17.169  < 2e-16 ***
## sexmale                 8.1765     1.3716   5.961 6.48e-09 ***
## speciesAdelie:sexmale  -3.5600     1.6606  -2.144 0.032782 *  
## speciesGentoo:sexmale   0.6576     1.7196   0.382 0.702394    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.655 on 327 degrees of freedom
## Multiple R-squared:  0.8396, Adjusted R-squared:  0.8372 
## F-statistic: 342.4 on 5 and 327 DF,  p-value: < 2.2e-16

This gives us the remaining pair-wise comparison.

penguins.complete %>% 
  filter(species != "Adelie") %>% 
  ggplot(aes(x = species,
             y = flipper_length_mm,
             fill = species)) +
  theme_bw() +
  geom_boxplot() +
  ggtitle("speciesGentoo           20.972      1.221   17.17  < 2e-16 ***")

But it still doesn’t tell us about the main effect of species. For this, we need to get metaphorical for a moment.

When you need to know how much your small pet weighs, but you only have a bathroom scale, you can’t just put the pet on the scale. This is because the scale is not accurate for such small measures. To weight your pet, you’ll need to weigh yourself holding the pet, then put the pet down and weigh yourself alone. Then, subtract your own weight from the combined weight and you get the weight of your pet.

This is a similar strategy to what we call model comparison. Model comparison is great for evaluating the contribution of any effect to the overall fit of the model, but it is especially important for effects that contain complex interactions or multiple levels.

To assess the contribution of species to the overall fit of the model (that is, does adding species to the model improve its description of the variance in a meaningful way), we need to compare the model with it in, and with it out.

For simplicity, let’s get rid of the interaction term for now.

model.full <- lm(flipper_length_mm ~ 1 +
                                     species + 
                                     sex, 
                 data=penguins.complete)

model.no.species <- lm(flipper_length_mm ~ 1 +
                                           #species + 
                                           sex, 
                       data=penguins.complete)

anova(model.full, model.no.species)
## Analysis of Variance Table
## 
## Model 1: flipper_length_mm ~ 1 + species + sex
## Model 2: flipper_length_mm ~ 1 + sex
##   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
## 1    329 10787                                 
## 2    331 60972 -2    -50185 765.3 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This ANOVA function isn’t always an ANOVA (as well see soon), but it always provides the appropriate model comparison. Here, we can see that species contributes significantly to the overall model fit, as the model with it in describes more variance than without it.

1.2 Interactions

Interactions can be very trick to interpret. Looking back at our first model, we have the same problem as before, since species has two levels. But this time, it’s unclear what the estimate is referring to.

summary(model.1)
## 
## Call:
## lm(formula = flipper_length_mm ~ species + sex + species:sex, 
##     data = penguins.complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7945  -3.4110   0.0882   3.4590  17.5890 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              187.7945     0.6619 283.721  < 2e-16 ***
## speciesChinstrap           3.9408     1.1742   3.356 0.000884 ***
## speciesGentoo             24.9124     0.9947  25.044  < 2e-16 ***
## sexmale                    4.6164     0.9361   4.932  1.3e-06 ***
## speciesChinstrap:sexmale   3.5600     1.6606   2.144 0.032782 *  
## speciesGentoo:sexmale      4.2176     1.3971   3.019 0.002737 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.655 on 327 degrees of freedom
## Multiple R-squared:  0.8396, Adjusted R-squared:  0.8372 
## F-statistic: 342.4 on 5 and 327 DF,  p-value: < 2.2e-16

Let’s also look at the releveled analysis so we can see how the Chinstrap-Gentoo comparison is evaluated:

summary(model.2)
## 
## Call:
## lm(formula = flipper_length_mm ~ species + sex + species:sex, 
##     data = penguins.complete %>% mutate(species = fct_relevel(species, 
##         c("Chinstrap", "Adelie", "Gentoo"))))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7945  -3.4110   0.0882   3.4590  17.5890 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           191.7353     0.9699 197.692  < 2e-16 ***
## speciesAdelie          -3.9408     1.1742  -3.356 0.000884 ***
## speciesGentoo          20.9716     1.2215  17.169  < 2e-16 ***
## sexmale                 8.1765     1.3716   5.961 6.48e-09 ***
## speciesAdelie:sexmale  -3.5600     1.6606  -2.144 0.032782 *  
## speciesGentoo:sexmale   0.6576     1.7196   0.382 0.702394    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.655 on 327 degrees of freedom
## Multiple R-squared:  0.8396, Adjusted R-squared:  0.8372 
## F-statistic: 342.4 on 5 and 327 DF,  p-value: < 2.2e-16

Note that the interaction of speciesChinstrap:sexmale and speciesGentoo:sexfemale (the baseline) is not significant. So, is the overall contribution of the interaction term significant, or is it only describing a very small part of the variance in the data?

Here is where data visualisation becomes imperative for interpretation:

penguins.complete %>% 
  group_by(species, sex) %>% 
  summarise(flipper.mean = mean(flipper_length_mm),
            flipper.se = sd(flipper_length_mm)/sqrt(n())) %>% 
  #mutate(species = fct_relevel(species, c("Gentoo", "Adelie", "Chinstrap"))) %>% 
  ggplot(aes(x = sex,
             y = flipper.mean,
             group = species,
             colour = species)) +
  theme_bw() +
  geom_point() +
  geom_errorbar(aes(ymin = flipper.mean-flipper.se,
                    ymax = flipper.mean+flipper.se),
                width=.1) +
  geom_path() +
  NULL

Well, it’s possible to eye-ball that Adelies have a different slope between female and male than the other two groups (what we saw in model.1). It’s a little harder but not impossible to guess that the slope for Chinstraps and Gentoos is almost exactly the same. This would explain why there is no significant interaction for that comparison.

But the simple model output doesn’t tell us anything about the interaction overall. For that, we need model comparison again. Remember, model.1 is our complete version of this model. (And model.2 simply reorders the species levels.)

model.no.interaction <- lm(flipper_length_mm ~ 1 + sex + species, data = penguins.complete)

anova(model.1, model.no.interaction)
## Analysis of Variance Table
## 
## Model 1: flipper_length_mm ~ species + sex + species:sex
## Model 2: flipper_length_mm ~ 1 + sex + species
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)   
## 1    327 10458                                
## 2    329 10787 -2   -329.04 5.1442 0.006314 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Even though the interaction is tiny (as we see in the plot), it’s enough to significantly contribute to the overall model fit.

2 Mixed effects

Fixed effects are the main and interaction effects we’ve been working with up until now. Random effects are the type of effect I’ll be introducing shortly. The use of both fixed and random effects in a single linear model is what we call mixed effects modeling.

# load package for mixed effects modeling
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# read in your data, however you like
simdat <- read.csv("../data/simulated-data.csv", header = TRUE)

In short, random effects help you account for noise that you suspect is being introduced into your data in a consistent way, but from a source that has nothing to do with your research question and/or that you can’t control.

For instance, each person will have a different reading style, different exposure to vocabulary, different muscle responsiveness, different vocal tract shapes, and a different brain.

Each sentence they read (no matter how carefully you craft them) will have different word frequencies, different syllable structures, different meanings, and different contexts.

These two sources of variation are unavoidable, and they might interact in unexpected ways. One person might consistently read familiar words more slowly, but they might also be more familiar with some of the “low frequency” vocabulary. Another person might consistently read familiar words more quickly, but they might also be more familiar with the “low frequency” vocabulary. Here is how different participants’ responses appear – it’s very messy!

simdat %>% 
  group_by(subj,freq,gram) %>% 
  summarise(rt.mean = mean(rt),
            rt.se = sd(rt)/sqrt(n())) %>% 
  ggplot(aes(x=freq,colour=gram,y=rt.mean)) +
  geom_path(aes(group=gram)) +
  geom_errorbar(aes(ymin=rt.mean-rt.se,ymax=rt.mean+rt.se), width=.1) +
  theme_bw() +
  facet_wrap(~subj)

Rather than trying to establish a baseline for each individual on each stimulus item (which isn’t possible, nor relevant to the actual research question, which is about the population in aggregate), we can add subj (participant) and item to our random effects. We can also add freq and gram (and anything else) to our random effects structure, because we aren’t sure how each person-item-condition-etc pairing will vary.

mdl.max <- lmer(rt ~ 1 +
                  freq +
                  gram +
                  freq:gram +
                  age +
                  (1 + freq + gram | subj) + 
                  (1 + age | item),
                data = simdat)
## boundary (singular) fit: see ?isSingular
summary(mdl.max)
## Linear mixed model fit by REML ['lmerMod']
## Formula: rt ~ 1 + freq + gram + freq:gram + age + (1 + freq + gram | subj) +  
##     (1 + age | item)
##    Data: simdat
## 
## REML criterion at convergence: 48965.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.9870 -0.6372 -0.0143  0.6032  4.2258 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr       
##  subj     (Intercept) 1.624e+01   4.0294            
##           freqlow     1.684e+01   4.1041 -1.00      
##           gramyes     5.540e+01   7.4434 -0.72  0.73
##  item     (Intercept) 1.782e-01   0.4221            
##           age         1.930e-02   0.1389 -1.00      
##  Residual             1.215e+04 110.2225            
## Number of obs: 4000, groups:  subj, 40; item, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)     342.3039     7.9133  43.257
## freqlow           1.8710     6.0369   0.310
## gramyes         -11.7403     6.1175  -1.919
## age               1.1763     0.1614   7.289
## freqlow:gramyes   4.8117     8.4871   0.567
## 
## Correlation of Fixed Effects:
##             (Intr) freqlw gramys age   
## freqlow     -0.385                     
## gramyes     -0.379  0.503              
## age         -0.840 -0.001 -0.005       
## frqlw:grmys  0.268 -0.703 -0.694  0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular

If we get messages about boundary (singular) fit: see ?isSingular, this is a good indication our data don’t support the complexity of the model, even if it is conceptually sound. (Practitioners of Bayesian statistics would say ‘this is why you should be doing Bayesian statistics instead of inferential statistics!’ but Bayesian statistics has its own problems, so it’s not always better!)

We can simplify the model until it “converges” or is appropriate for our data. Unfortunately, this might mean removing terms that are conceptually important. Ideally, you would run simulations of your data before you even started your experiment, so you would know how much data you would need to support the models you want to run. This is a good practice, but very difficult to plan for when you’re first starting out with these sorts of analytical tools. Just something to keep in mind for the future.

To do model comparison, copy the maximal model and comment out the interaction first. This will tell you the “weight” or contribution of the interaction term to the overall fit. This will also provide the “missing” p-values, if you want them.

mdl.int <- lmer(rt ~ 1 +
                  freq +
                  gram +
                  #freq:gram +
                  age +
                  (1 | subj) + 
                  (1 | item),
                data = simdat)

anova(mdl.max, mdl.int)
## refitting model(s) with ML (instead of REML)
## Data: simdat
## Models:
## mdl.int: rt ~ 1 + freq + gram + age + (1 | subj) + (1 | item)
## mdl.max: rt ~ 1 + freq + gram + freq:gram + age + (1 | subj) + (1 | item)
##         npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)
## mdl.int    7 48997 49041 -24492    48983                     
## mdl.max    8 48999 49049 -24491    48983 0.3228  1     0.5699

Since the interaction term contains both freq and gram, we can’t have it in the model when we test the main effects of freq or gram, so we can use the mdl.int model for our new ‘baseline’ model.

First, let’s test the contribution of frequency.

mdl.frq <- lmer(rt ~ 1 +
                  #freq +
                  gram +
                  #freq:gram +
                  age +
                  (1 | subj) + 
                  (1 | item),
                data = simdat)

anova(mdl.int, mdl.frq)
## refitting model(s) with ML (instead of REML)
## Data: simdat
## Models:
## mdl.frq: rt ~ 1 + gram + age + (1 | subj) + (1 | item)
## mdl.int: rt ~ 1 + freq + gram + age + (1 | subj) + (1 | item)
##         npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)
## mdl.frq    6 48997 49034 -24492    48985                     
## mdl.int    7 48997 49041 -24492    48983 1.7421  1     0.1869

Now let’s test grammaticality. (Note: See this page for convergence help, although the warning below will disappear when the maximal model is fixed.)

mdl.frq <- lmer(rt ~ 1 +
                  freq +
                  #gram +
                  #freq:gram +
                  age +
                  (1 | subj) + 
                  (1 | item),
                #control=lmerControl(optimizer="bobyqa"),
                data = simdat)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00214572 (tol = 0.002, component 1)
anova(mdl.int, mdl.frq)
## refitting model(s) with ML (instead of REML)
## Data: simdat
## Models:
## mdl.frq: rt ~ 1 + freq + age + (1 | subj) + (1 | item)
## mdl.int: rt ~ 1 + freq + gram + age + (1 | subj) + (1 | item)
##         npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)  
## mdl.frq    6 49001 49039 -24495    48989                       
## mdl.int    7 48997 49041 -24492    48983 6.3763  1    0.01157 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since age isn’t part of the interaction, we can use the maximal model for a comparison baseline again.

mdl.age <- lmer(rt ~ 1 +
                  freq +
                  gram +
                  freq:gram +
                  #age +
                  (1 | subj) + 
                  (1 | item),
                data = simdat)

anova(mdl.max, mdl.age)
## refitting model(s) with ML (instead of REML)
## Data: simdat
## Models:
## mdl.age: rt ~ 1 + freq + gram + freq:gram + (1 | subj) + (1 | item)
## mdl.max: rt ~ 1 + freq + gram + freq:gram + age + (1 | subj) + (1 | item)
##         npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)    
## mdl.age    7 49033 49077 -24510    49019                         
## mdl.max    8 48999 49049 -24491    48983 36.372  1  1.631e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you’d like to learn more about why we do this, when it’s appropriate, and how to design the best mixed effects model for your data I recommend these two papers. They’re a bit technical and dense, but if you are patient with yourself and chip away slowly, they are treasure troves of information (even when they disagree)!

I can also offer you some tips and strategies for reporting these sorts of statistics in prose, which I have written up here.

3 Challenge questions

Using the binomial (log linear) model in lme4, construct and interpret a model that examines accuracy as a function of age, frequency, grammaticality, and the interaction of frequency and grammaticality. Include random effects for intercept and figure out if any random effects for slope will converge.

# maximal model
# add depleted models to test the contribution of each fixed effect

To help you interpret the interaction term, here is a plot of the interaction (without age or random effects accounted for).

simdat %>% 
  filter(rating==1) %>% 
  group_by(freq,gram) %>% 
  summarise(n=n(),
            acc=sum(accuracy),
            correct= acc/n) %>% 
  ggplot(aes(x = freq, colour=gram, y = correct)) +
  geom_point() +
  geom_path(aes(group=gram)) +
  theme_bw() +
  NULL