Download course materials from here.
Dataset
sleep
:
head(sleep)
What does this do?
sleep[1,]
How does this differ from [1,]
?
sleep[2,] # so that means this is row 2
How does this differ from [3,]
?
sleep[,3] # what do you think this is? (column 3)
…which makes it dataset[row,column]
You can also navigate with column names:
How would you view the column extra
?
Use str()
to get a summary of the structure of the dataset
What are all the unique values in ID
?
What’s the value in the first row, third column?
What’s the first element in the column ID
?
You can also view the dataset as a spreadsheet (although it can’t be altered).
View(sleep)
library(tidyverse)
A tibble
is different from a table
.
Pipes are like toy funnels
sleep %>%
head()
# or
sleep %>%
head
How many attestations of each type of group
?
How can you make a new column? Duplicate group
into group2
:
Let’s rename the levels in group
to be “one” and “two”:
What’s wrong with this one?
sleep %>%
mutate(group2 = case_when(group==1 ~ as.factor("one"),
group==2 ~ as.factor("two")))
Now, let’s recreate the count
function with group_by
and summarise
:
We cal use group_by
and summarise
to do a lot more than just count:
# mean value of `extra` by `group2`
Dataset
quakes
:
What does quakes
look like?
How many observations are there per “level” of magnitude?
Let’s create a table of the means, standard deviations, and standard errors for both stations reporting and depths:
This dataset is wide. Let’s make it long using gather
(compare to spread
):
# https://bookdown.org/ndphillips/YaRrr/pirateplot.html
library(tidyverse)
library(ggbeeswarm)
Dataset
iris
:
The function ggplot
layers different geometries and aesthetics to build up a plot:
iris %>%
mutate(Sepal.Width = Sepal.Width+rnorm(length(Sepal.Width),0,.1))%>%
ggplot(aes(x=Species,y=Sepal.Width,fill=Species)) +
geom_violin(lty=0,alpha=.5)+
geom_boxplot(alpha=0.5,lwd=.5) +
geom_quasirandom(dodge.width=1, alpha=.5)
In what ways might we change this plot?
Going back to quakes
, let’s pipe our table into a ggplot (and fill in missing values):
quakes %>%
group_by(•••) %>%
summarise(n=•••,
sta=•••,
staSD=•••,
staSE=•••,
dep=•••,
depSD=•••,
depSE=•••) %>%
ggplot(aes(x=•••)) +
geom_point(aes(y=•••),colour="red")+
geom_path(aes(y=•••),colour="red")+
geom_ribbon(aes(ymin=•••,ymax=•••),width=.05,fill="red",alpha=.2) +
geom_ribbon(aes(ymin=•••,ymax=•••),width=.05,fill="red",alpha=.5) +
geom_point(aes(y=•••),colour="blue")+
geom_path(aes(y=•••),colour="blue")+
geom_ribbon(aes(ymin=•••,ymax=•••),width=.05,fill="blue",alpha=.2) +
geom_ribbon(aes(ymin=•••,ymax=•••),width=.05,fill="blue",alpha=.5) +
theme_bw() + ylab("depth OR number of stations reporting")
See Simulated Data notebook
Plot boxplots and violin plots for the ratings. Subset by participant.
Group by region, word, frequency, and grammaticality. Summarise mean and standard error.
data %>%
group_by(•••) %>%
summarise(meanRT = •••,
seRT = •••) %>%
ggplot(aes(x=•••,y=•••,colour=•••)) +
geom_point(•••) +
geom_path(aes(lty=•••)) +
geom_errorbar(aes(ymin=•••, ymax=•••),width=.1) +
scale_x_continuous(breaks=c(1:5),labels = c("the","old","VERB","the","boat")) +
ylab("reading time (miliseconds)")
Well, check out the Datasaurus Dozen, which all have the same x/y means and standard deviations:
library(lme4)
Build a simple linear model to examine region 3 (the verb):
# summary(
# lm ( DV ~ IV1 * IV2 + IV3 , data = data )
# )
Add in mixed effects for a linear mixed effects model:
#summary(
# lmer( DV ~ IV1 * IV2 + IV3 + ( 1 | RV1 ) + ( 1 | RV2 ) , data = data)
# )
Bodo Winter telling it like it is. (Photo credit: Adam Schembri 27/02/2019)
We should do model comparison to assess the contribution of each of the factors to the overall fit. But, read the Bates et al and Barr et al papers for an overview of the debates around how to design and test models.
Let’s do model comparison for region 3:
mdl1.max <- lmer(rt ~ freq*gram + age + accuracy + (1|subj),data[data$region==3,])
mdl1.age <- lmer()
mdl1.acc <- lmer()
mdl1.int <- lmer()
mdl1.frq <- lmer()
mdl1.grm <- lmer()
anova() # max vs age
anova() # max vs acc
anova() # max vs int
anova() # int vs frq
anova() # int vs grm
How do regions 4 and 5 compare?:
First, how could we go about using lmer
for rating data?
library(ordinal)
For the ratings, build models like above, but using clmm()
(these take a little longer to run):
See visualising_ordinal_data.R for Predicted Curves script
data %>%
filter(region==1) %>%
mutate(age_group = case_when(age<35~"young",
age>=35&age<=55~"middle",
age>55~"old")) %>%
mutate(age_group = factor(age_group,levels=c("young","middle","old")))%>%
group_by(freq,gram,age_group) %>%
summarise(accuracy=sum(accuracy)/n()) %>%
ggplot(aes(x=freq,fill=gram,y=accuracy))+
geom_bar(stat="identity",position="dodge")+
facet_wrap(~age_group)
Does accuracy change as a function of age?
# summary(
# glmer( DV ~ IV1 * IV2 + IV3 + ( 1 | RV1 ) + ( 1 | RV2 ), family="binomial", data = data)
# )
Do model comparison to assess the contribution of each of the factors to the overall fit: