🔙 Home

★ DRAFT ★
(A work in progress…)

What is “ordinal” data?

Data like reaction times is continuous because there are no discrete boundaries. RTs can be seconds, fractions of sections, fractions of milliseconds, or even finer grained than that.

Data like counts of words in corpora are discrete, but hypothetically infinite. You could have 0 attestations, 1, or one million… but you could not have 13.5 attestations of a word.

Binary data is also discrete, and can come from looking at hits vs misses (e.g., correct or incorrect responses) among other things. This data type is binary because there are only two values you can observe: 1 (hit) or 0 (miss).

But there is another type of data we use often, too. Ordinal data refers to data that is discrete but also ranked like in a Likert scale task. What makes this type of data unique is that it has a maximum and minimum endpoint, but also a limited number of (ordered) values in between. In this respect, binary data and ordinal data are similar, but when you have more than two possible responses, analysis can get quite complicated.

In particular, what makes analysis of ordinal data different from continuous, count, and binary data is that the perceptual size of each level may vary. That is, on a 7-point scale, the difference between a rating of 1 and a rating of 2 may not be the same distance as the difference between a rating of 2 and a rating of 3, or between a rating of 6 and a rating of 7. The distance between each level of the scale can vary across participants and items, so when we analyse this data, we really should take into account this dimension of variability.

That’s where the ordinal package comes in.

library(ordinal)
dataOrd <- read.csv("data/ordinal_data.csv",header=TRUE)
head(dataOrd)
  subject item  factor1 factor2 rating
1       1    1 Baseline  Absent      5
2       1    2 Baseline  Absent      6
3       1    3 Baseline  Absent      7
4       1    4 Baseline  Absent      7
5       1    5 Baseline  Absent      7
6       1    6 Baseline  Absent      7

When we graph our rating data in a box plot, we see there is likely to be a main effect of Factor 1 and possibly an interaction between Factor 1 and Factor 2. It’s hard to tell more at this point.

ggplot(dataOrd, aes(x=factor1,y=rating)) + geom_boxplot(aes(fill=factor2)) + scale_y_continuous(breaks=c(1:7)) + ggtitle("Latin square rating data")

ggplot(dataOrd, aes(x=factor1,y=rating)) + geom_boxplot(aes(fill=factor1)) + scale_y_continuous(breaks=c(1:7)) + ggtitle("Main effect of Factor 1")

ggplot(dataOrd, aes(x=factor2,y=rating)) + geom_boxplot(aes(fill=factor2)) + scale_y_continuous(breaks=c(1:7)) + ggtitle("Main effect of Factor 2")

Ordinal regressions

When we summarise a linear mixed effect model of this rating data, this is what we get.

lmer.rate.max <- lmer(rating ~ factor1*factor2 + (1|subject)+(1|item),data=dataOrd, REML=FALSE)
summary(lmer.rate.max)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: rating ~ factor1 * factor2 + (1 | subject) + (1 | item)
   Data: dataOrd

     AIC      BIC   logLik deviance df.resid 
  2892.6   2926.7  -1439.3   2878.6      953 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.4690 -0.8577 -0.0005  0.8743  3.6434 

Random effects:
 Groups   Name        Variance Std.Dev.
 subject  (Intercept) 0.071283 0.26699 
 item     (Intercept) 0.001825 0.04272 
 Residual             1.128254 1.06219 
Number of obs: 960, groups:  subject, 40; item, 24

Fixed effects:
                                Estimate Std. Error t value
(Intercept)                      6.02083    0.08099  74.342
factor1Treatment                -1.65833    0.09696 -17.102
factor2Present                  -0.07083    0.09696  -0.731
factor1Treatment:factor2Present -1.65417    0.13713 -12.063

Correlation of Fixed Effects:
            (Intr) fctr1T fctr2P
fctr1Trtmnt -0.599              
factr2Prsnt -0.599  0.500       
fctr1Trt:2P  0.423 -0.707 -0.707

The following model is an cumulative link mixed model, which operates in a very similar way to a linear mixed effect model on the surface, but takes into account the possibility that each level of the rating scale could be a different size or distance from its neighbors.

clmm.max <- clmm(as.factor(rating) ~ factor1*factor2 + (1|subject)+(1|item),data=dataOrd)
summary(clmm.max)
Cumulative Link Mixed Model fitted with the Laplace approximation

formula: as.factor(rating) ~ factor1 * factor2 + (1 | subject) + (1 |      item)
data:    dataOrd

 link  threshold nobs logLik   AIC     niter      max.grad cond.H 
 logit flexible  960  -1377.05 2776.10 1269(3796) 9.11e-04 6.4e+01

Random effects:
 Groups  Name        Variance Std.Dev.
 subject (Intercept) 0.17920  0.4233  
 item    (Intercept) 0.01514  0.1230  
Number of groups:  subject 40,  item 24 

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
factor1Treatment                 -2.7711     0.1974 -14.040   <2e-16 ***
factor2Present                   -0.1735     0.1672  -1.038    0.299    
factor1Treatment:factor2Present  -2.3701     0.2564  -9.243   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Threshold coefficients:
    Estimate Std. Error z value
1|2  -6.7035     0.2714 -24.703
2|3  -5.4889     0.2400 -22.867
3|4  -4.1191     0.2131 -19.332
4|5  -2.6917     0.1826 -14.745
5|6  -0.9258     0.1480  -6.254
6|7   0.5231     0.1458   3.588

Just like you might with a LMER, you can do a model comparison to determine the contribution of each main effect and their interaction.

clmm.int <- clmm(as.factor(rating) ~ factor1+factor2 + (1|subject)+(1|item),data=dataOrd)
clmm.one <- clmm(as.factor(rating) ~ factor2 + (1|subject)+(1|item),data=dataOrd)
clmm.two <- clmm(as.factor(rating) ~ factor1 + (1|subject)+(1|item),data=dataOrd)
anova(clmm.max,clmm.int)
Likelihood ratio tests of cumulative link models:
 
         formula:                                                           link: threshold:
clmm.int as.factor(rating) ~ factor1 + factor2 + (1 | subject) + (1 | item) logit flexible  
clmm.max as.factor(rating) ~ factor1 * factor2 + (1 | subject) + (1 | item) logit flexible  

         no.par    AIC  logLik LR.stat df Pr(>Chisq)    
clmm.int     10 2863.5 -1421.8                          
clmm.max     11 2776.1 -1377.0  89.439  1  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(clmm.int,clmm.one)
Likelihood ratio tests of cumulative link models:
 
         formula:                                                           link: threshold:
clmm.one as.factor(rating) ~ factor2 + (1 | subject) + (1 | item)           logit flexible  
clmm.int as.factor(rating) ~ factor1 + factor2 + (1 | subject) + (1 | item) logit flexible  

         no.par    AIC  logLik LR.stat df Pr(>Chisq)    
clmm.one      9 3536.0 -1759.0                          
clmm.int     10 2863.5 -1421.8  674.43  1  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(clmm.int,clmm.two)
Likelihood ratio tests of cumulative link models:
 
         formula:                                                           link: threshold:
clmm.two as.factor(rating) ~ factor1 + (1 | subject) + (1 | item)           logit flexible  
clmm.int as.factor(rating) ~ factor1 + factor2 + (1 | subject) + (1 | item) logit flexible  

         no.par    AIC  logLik LR.stat df Pr(>Chisq)    
clmm.two      9 2970.1 -1476.0                          
clmm.int     10 2863.5 -1421.8  108.53  1  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

These suggest both main effects and their interaction all contribute significantly to the fit of the model.

Visualising the ratings

If we visualise the ratings as probabilities, we can see subtle differences that aren’t visible in box plots or bar charts. In the graph below, the coloured lines indicate the estimated probability each of the conditions will be assigned a certain rating.

# script adapted from Jalal Al Tamimi
par(oma = c(0, 0, 0, 1))  # ,mfrow=c(1,1))
xlim = c(min(clmm.max$beta) - 0.5, max(clmm.max$beta) + 0.5)
ylim = c(0, 1)
plot(0, 0, xlim = xlim, ylim = ylim, type = "n", ylab = "Probability", xlab = "", 
    xaxt = "n", main = "Predicted curves")
axis(side = 1, labels = c("Treatment   \nPresent", "Baseline   \nAbsent", "Treatment   \nAbsent", 
    "Baseline   \nPresent"), at = c(clmm.max$beta, 0), las = 2, cex.axis = 0.7)
xs = seq(xlim[1], xlim[2], length.out = 100)
lines(xs, plogis(clmm.max$Theta[1] - xs), col = "red", lwd = 3)
lines(xs, plogis(clmm.max$Theta[2] - xs) - plogis(clmm.max$Theta[1] - xs), col = "orange", 
    lwd = 3)
lines(xs, plogis(clmm.max$Theta[3] - xs) - plogis(clmm.max$Theta[2] - xs), col = "yellow", 
    lwd = 3)
lines(xs, plogis(clmm.max$Theta[4] - xs) - plogis(clmm.max$Theta[3] - xs), col = "green", 
    lwd = 3)
lines(xs, plogis(clmm.max$Theta[5] - xs) - plogis(clmm.max$Theta[4] - xs), col = "blue", 
    lwd = 3)
lines(xs, plogis(clmm.max$Theta[6] - xs) - plogis(clmm.max$Theta[5] - xs), col = "purple", 
    lwd = 3)
lines(xs, 1 - (plogis(clmm.max$Theta[6] - xs)), col = "black", lwd = 3)
abline(v = c(0, clmm.max$beta[1:4]), lty = 3)
abline(h = 0, lty = "dashed")
abline(h = 1, lty = "dashed")
legend(par("usr")[2], par("usr")[4], bty = "n", xpd = NA, lty = 1, col = c("red", 
    "orange", "yellow", "green", "blue", "purple", "black"), legend = c("1", 
    "2", "3", "4", "5", "6", "7"), cex = 0.75)

In this graph, we can see the probability curves for each of the seven ratings. In particular, we can now see that the Baseline+Present condition and the Baseline+Absent condition are not rated in precisely the same manner: Baseline+Present is more likely to receive a rating of 7 and less like to receive a rating of 5 than Baseline+Absent. In some cases, when this difference ends up being significant, it will not be visible in a box plot but will be visible in this sort of probability curve plot.

---
title: "Dealing with ordinal data"
author: "Lauren M Ackerman"
date: "Last updated: 03 May 2018"
output:
  html_notebook:
    toc: yes
    toc_depth: 2
    toc_float: yes
---
[🔙 Home](https://verbingnouns.github.io/notebooks/)

```{r echo=FALSE,message=FALSE}
library(ggplot2)
library(lme4)
library(ordinal)
library(formatR)
```

> ★ DRAFT ★  
> (A work in progress…)

# What is "ordinal" data?

Data like reaction times is *continuous* because there are no discrete boundaries. RTs can be seconds, fractions of sections, fractions of milliseconds, or even finer grained than that.

Data like counts of words in corpora are *discrete*, but hypothetically infinite. You could have 0 attestations, 1, or one million… but you could not have 13.5 attestations of a word.

*Binary* data is also discrete, and can come from looking at hits vs misses (e.g., correct or incorrect responses) among other things. This data type is binary because there are only two values you can observe: 1 (hit) or 0 (miss).

But there is another type of data we use often, too. *Ordinal* data refers to data that is discrete but also ranked like in a Likert scale task. What makes this type of data unique is that it has a maximum and minimum endpoint, but also a limited number of (ordered) values in between. In this respect, binary data and ordinal data are similar, but when you have more than two possible responses, analysis can get quite complicated.

In particular, what makes analysis of ordinal data different from continuous, count, and binary data is that the perceptual size of each level may vary. That is, on a 7-point scale, the difference between a rating of 1 and a rating of 2 may not be the same distance as the difference between a rating of 2 and a rating of 3, or between a rating of 6 and a rating of 7. The distance between each level of the scale can vary across participants and items, so when we analyse this data, we really should take into account this dimension of variability.

That's where the `ordinal` package comes in.

```{r}
library(ordinal)

dataOrd <- read.csv("data/ordinal_data.csv",header=TRUE)
head(dataOrd)
```

When we graph our rating data in a box plot, we see there is likely to be a main effect of *Factor 1* and possibly an interaction between *Factor 1* and *Factor 2*. It's hard to tell more at this point.

```{r}
ggplot(dataOrd, aes(x=factor1,y=rating)) + geom_boxplot(aes(fill=factor2)) + scale_y_continuous(breaks=c(1:7)) + ggtitle("Latin square rating data")
```

```{r}
ggplot(dataOrd, aes(x=factor1,y=rating)) + geom_boxplot(aes(fill=factor1)) + scale_y_continuous(breaks=c(1:7)) + ggtitle("Main effect of Factor 1")
ggplot(dataOrd, aes(x=factor2,y=rating)) + geom_boxplot(aes(fill=factor2)) + scale_y_continuous(breaks=c(1:7)) + ggtitle("Main effect of Factor 2")
```

# Ordinal regressions

When we summarise a linear mixed effect model of this rating data, this is what we get.

```{r}
lmer.rate.max <- lmer(rating ~ factor1*factor2 + (1|subject)+(1|item),data=dataOrd, REML=FALSE)
summary(lmer.rate.max)
```

The following model is an cumulative link mixed model, which operates in a very similar way to a linear mixed effect model on the surface, but takes into account the possibility that each level of the rating scale could be a different size or distance from its neighbors.

```{r}
clmm.max <- clmm(as.factor(rating) ~ factor1*factor2 + (1|subject)+(1|item),data=dataOrd)
summary(clmm.max)
```

Just like you might with a LMER, you can do a model comparison to determine the contribution of each main effect and their interaction.

```{r}
clmm.int <- clmm(as.factor(rating) ~ factor1+factor2 + (1|subject)+(1|item),data=dataOrd)
clmm.one <- clmm(as.factor(rating) ~ factor2 + (1|subject)+(1|item),data=dataOrd)
clmm.two <- clmm(as.factor(rating) ~ factor1 + (1|subject)+(1|item),data=dataOrd)
```
```{r}
anova(clmm.max,clmm.int)
```

```{r}
anova(clmm.int,clmm.one)
```

```{r}
anova(clmm.int,clmm.two)
```

These suggest both main effects and their interaction all contribute significantly to the fit of the model.

# Visualising the ratings

If we visualise the ratings as probabilities, we can see subtle differences that aren't visible in box plots or bar charts. In the graph below, the coloured lines indicate the estimated probability each of the conditions will be assigned a certain rating. 

```{r, eval=TRUE,collapse=TRUE,tidy=TRUE}
# script adapted from Jalal Al Tamimi
par(oma=c(0, 0, 0, 1)) # ,mfrow=c(1,1))
xlim = c(min(clmm.max$beta)-.5,max(clmm.max$beta)+.5)
ylim = c(0,1)
plot(0,0,xlim=xlim, ylim=ylim, type="n", ylab="Probability", xlab="", xaxt = "n",main="Predicted curves")
axis(side = 1, labels = c("Treatment   \nPresent","Baseline   \nAbsent","Treatment   \nAbsent","Baseline   \nPresent"), 
     at = c(clmm.max$beta,0),las=2,cex.axis=.7)
xs = seq(xlim[1], xlim[2], length.out=100)
lines(xs, plogis(clmm.max$Theta[1] - xs), col='red',lwd=3)
lines(xs, plogis(clmm.max$Theta[2] - xs)-plogis(clmm.max$Theta[1] - xs), col='orange',lwd=3)
lines(xs, plogis(clmm.max$Theta[3] - xs)-plogis(clmm.max$Theta[2] - xs), col='yellow',lwd=3)
lines(xs, plogis(clmm.max$Theta[4] - xs)-plogis(clmm.max$Theta[3] - xs), col='green',lwd=3)
lines(xs, plogis(clmm.max$Theta[5] - xs)-plogis(clmm.max$Theta[4] - xs), col='blue',lwd=3)
lines(xs, plogis(clmm.max$Theta[6] - xs)-plogis(clmm.max$Theta[5] - xs), col='purple',lwd=3)
lines(xs, 1-(plogis(clmm.max$Theta[6] - xs)), col='black',lwd=3)
abline(v=c(0,clmm.max$beta[1:4]),lty=3)
abline(h=0, lty="dashed")
abline(h=1, lty="dashed")
legend(par('usr')[2], par('usr')[4], bty='n', xpd=NA,lty=1, col=c("red", "orange", "yellow", "green", "blue", "purple", "black"),
       legend=c("1", "2", "3", "4", "5", "6", "7"),cex=0.75)
```

In this graph, we can see the probability curves for each of the seven ratings. In particular, we can now see that the *Baseline+Present* condition and the *Baseline+Absent* condition are not rated in precisely the same manner: *Baseline+Present* is more likely to receive a rating of 7 and less like to receive a rating of 5 than *Baseline+Absent*. In some cases, when this difference ends up being significant, it will not be visible in a box plot but will be visible in this sort of probability curve plot.

