Data Visualisation and Analysis

Download course materials from here.

Base R functionality

Dataset sleep:

head(sleep)
  • column 1 = how many extra hours of sleep were recorded
  • column 2 = which drug was administered
  • column 3 = which participant

Indices, or: finding the index of the value you want

What does this do?

sleep[1,]

How does this differ from [1,]?

sleep[2,] # so that means this is row 2

How does this differ from [3,]?

sleep[,3] # what do you think this is? (column 3)

…which makes it dataset[row,column]

More descriptive methods

You can also navigate with column names:

How would you view the column extra?

Use str() to get a summary of the structure of the dataset

What are all the unique values in ID?

What’s the value in the first row, third column?

What’s the first element in the column ID?

You can also view the dataset as a spreadsheet (although it can’t be altered).

View(sleep)

Tidyverse functionality

library(tidyverse)

A tibble is different from a table.

Pipes are like toy funnels

sleep %>%
  head()

# or

sleep %>%
  head

How many attestations of each type of group?

How can you make a new column? Duplicate group into group2:

Let’s rename the levels in group to be “one” and “two”:

What’s wrong with this one?

sleep %>%
  mutate(group2 = case_when(group==1 ~ as.factor("one"),
                            group==2 ~ as.factor("two")))

Now, let’s recreate the count function with group_by and summarise:

We cal use group_by and summarise to do a lot more than just count:

# mean value of `extra` by `group2`

Processing into tables

Dataset quakes:

What does quakes look like?

How many observations are there per “level” of magnitude?

Let’s create a table of the means, standard deviations, and standard errors for both stations reporting and depths:

This dataset is wide. Let’s make it long using gather (compare to spread):

Basic ggplot2

# https://bookdown.org/ndphillips/YaRrr/pirateplot.html
library(tidyverse)
library(ggbeeswarm)

Dataset iris:

The function ggplot layers different geometries and aesthetics to build up a plot:

iris %>%
  mutate(Sepal.Width = Sepal.Width+rnorm(length(Sepal.Width),0,.1))%>%
  ggplot(aes(x=Species,y=Sepal.Width,fill=Species)) +
  geom_violin(lty=0,alpha=.5)+
  geom_boxplot(alpha=0.5,lwd=.5) +
  geom_quasirandom(dodge.width=1, alpha=.5)

In what ways might we change this plot?

Going back to quakes, let’s pipe our table into a ggplot (and fill in missing values):

quakes %>%
  group_by(•••) %>%
  summarise(n=•••,
            sta=•••,
            staSD=•••,
            staSE=•••,
            dep=•••,
            depSD=•••,
            depSE=•••) %>%
  ggplot(aes(x=•••)) +
  geom_point(aes(y=•••),colour="red")+
  geom_path(aes(y=•••),colour="red")+
  geom_ribbon(aes(ymin=•••,ymax=•••),width=.05,fill="red",alpha=.2) +
  geom_ribbon(aes(ymin=•••,ymax=•••),width=.05,fill="red",alpha=.5) +
  geom_point(aes(y=•••),colour="blue")+
  geom_path(aes(y=•••),colour="blue")+
  geom_ribbon(aes(ymin=•••,ymax=•••),width=.05,fill="blue",alpha=.2) +
  geom_ribbon(aes(ymin=•••,ymax=•••),width=.05,fill="blue",alpha=.5) +
  theme_bw() + ylab("depth OR number of stations reporting")

Storytelling with data

See Simulated Data notebook

The Ten Commandments of Statistical Inference

Exploration

Plot boxplots and violin plots for the ratings. Subset by participant.

Group by region, word, frequency, and grammaticality. Summarise mean and standard error.

data %>%
  group_by(•••) %>%
  summarise(meanRT = •••,
            seRT = •••) %>%
  ggplot(aes(x=•••,y=•••,colour=•••)) +
    geom_point(•••) +
    geom_path(aes(lty=•••)) +
    geom_errorbar(aes(ymin=•••, ymax=•••),width=.1) +
  scale_x_continuous(breaks=c(1:5),labels = c("the","old","VERB","the","boat")) +
  ylab("reading time (miliseconds)")

Why?

Well, check out the Datasaurus Dozen, which all have the same x/y means and standard deviations:

Continuous data

library(lme4)

Build a simple linear model to examine region 3 (the verb):

# summary(
#     lm ( DV ~ IV1 * IV2 + IV3 , data = data )
#        )

Add in mixed effects for a linear mixed effects model:

#summary(
#  lmer( DV ~ IV1 * IV2 + IV3 + ( 1 | RV1 ) + ( 1 | RV2 ) , data = data)
#        )

Bodo Winter telling it like it is. (Photo credit: Adam Schembri 27/02/2019)

We should do model comparison to assess the contribution of each of the factors to the overall fit. But, read the Bates et al and Barr et al papers for an overview of the debates around how to design and test models.

Let’s do model comparison for region 3:

mdl1.max <- lmer(rt ~ freq*gram + age + accuracy + (1|subj),data[data$region==3,])
mdl1.age <- lmer()
mdl1.acc <- lmer()
mdl1.int <- lmer()
mdl1.frq <- lmer()
mdl1.grm <- lmer()

anova() # max vs age
anova() # max vs acc
anova() # max vs int
anova() # int vs frq
anova() # int vs grm

How do regions 4 and 5 compare?:

Ordinal data

First, how could we go about using lmer for rating data?

Better ordinal data

library(ordinal)

For the ratings, build models like above, but using clmm() (these take a little longer to run):

See visualising_ordinal_data.R for Predicted Curves script

Binomial data

data %>%
  filter(region==1) %>%
  mutate(age_group = case_when(age<35~"young",
                               age>=35&age<=55~"middle",
                               age>55~"old")) %>%
  mutate(age_group = factor(age_group,levels=c("young","middle","old")))%>%
  group_by(freq,gram,age_group) %>%
  summarise(accuracy=sum(accuracy)/n()) %>%
  ggplot(aes(x=freq,fill=gram,y=accuracy))+
  geom_bar(stat="identity",position="dodge")+
  facet_wrap(~age_group)

Does accuracy change as a function of age?

# summary(
#   glmer( DV ~ IV1 * IV2 + IV3 + ( 1 | RV1 ) + ( 1 | RV2 ), family="binomial", data = data)
# )

Do model comparison to assess the contribution of each of the factors to the overall fit: