Welcome to Day 1!

Today, we’ll be breezing through some of the basics of R. Hopefully, some of this starts off as review. If you find yourself struggling with it at any time, that’s okay! It’s all written down, so you can come back to it later, when it’s had more time to sink in.

Learning R and statistics is an iterative process. If you attended the exact same workshop four times in a row, you’d learn something new each time. Depending on your skill level and confidence coming into this workshop, you are likely to learn different things.

If you’re more skilled, think of the familiar parts as a chance to refresh your knowledge or see it from a slightly different perspective (as R and statistics are a lot like natural language – there are dialects and communities of practice that vary across the population)!

If this is your first formal foray into R and linear models, it may seem overwhelming at times. That’s normal and good, actually. This is a chance to get a crash course in some aspects of R and statistics, hold on to a few bits of information, and build on them as you learn more from other sources. I think of it like trying to get a drink of water from a garden hosepipe turned all the way on. Most of the water will miss your mouth, splash everywhere, and go down your front. Some of the water will spray into your mouth but bounce out (or get spit out!), and a very tiny bit will actually go down your throat. Every time you return to the hosepipe of information, your thirst for knowledge will be sated a little more, but it will require being on the receiving end of much more information than you can possibly take in before your thirst is quenched.

After this Summer School, you can always return to these materials. Once they’re online, I won’t be taking them down. I encourage you to take notes and try them out alongside me, though, as I will be saying much more than is written. I will also be uploading the “complete” files which are what we end each day with. These will not be linked directly from the main page, but you will be able to access them any time to see the final state of the materials we worked on as a group. Your own materials may differ, so it’s good to have them handy as well.

The last thing I want to assert before we dive in is that no one writes their code from scratch. Even though it might sometimes look like I’m starting with a (nearly) blank page, or that I’ve memorized all the functions and arguments we’ll be using, this is an illusion. The best way to write code is by copying and pasting code from other people’s (reliable) sources, and adapting it. This is what professional programmers do, and it’s what you’ll do too.

1 Base R

Hopefully, you’re already familiar enough with R to know that it can act like a calculator. You might already know that you can store values in things called variables. These values can be single objects, vectors, lists, arrays, matrices, or other things.

# A vector of class `character` containing four "strings":
myList <- c("English", "Spanish", "Mandarin", "Arabic")

# A vector of class `integer` containing four numbers:
myNums <- 1:4

# How can we identify which items are where?
myList[2:3]

## [1] "Spanish"  "Mandarin"

# What operations can we do to the vector object?
# How does that affect the members of the vector?
myNums * .5 -> myNumsHalf

# What are "classes" and why do we care?
as.factor(myList)

## [1] English  Spanish  Mandarin Arabic  
## Levels: Arabic English Mandarin Spanish

class(myList)

## [1] "character"

class(myNums)

## [1] "integer"

class(myNumsHalf)

## [1] "numeric"

class(as.factor(myList))

## [1] "factor"

Libraries are like recipe books. Once we’ve installed the library (like purchasing the recipe book), we need to take it off the shelf and open it to access the recipes inside.

library(palmerpenguins)

One of the “recipes” in the palmerpenguins library is a dataset called penguins.

head(penguins)

## # A tibble: 6 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 6 Adelie  Torge…           39.3          20.6              190        3650 male 
## # … with 1 more variable: year <int>

1.1 Simple operations

Some base R operations will be useful throughout this workshop. Here are some of the most important things to know.

penguins$species

##   [1] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##   [8] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [15] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [22] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [29] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [36] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [43] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [50] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [57] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [64] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [71] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [78] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [85] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [92] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [99] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [106] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [113] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [120] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [127] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [134] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [141] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [148] Adelie    Adelie    Adelie    Adelie    Adelie    Gentoo    Gentoo   
## [155] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [162] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [169] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [176] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [183] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [190] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [197] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [204] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [211] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [218] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [225] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [232] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [239] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [246] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [253] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [260] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [267] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [274] Gentoo    Gentoo    Gentoo    Chinstrap Chinstrap Chinstrap Chinstrap
## [281] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [288] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [295] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [302] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [309] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [316] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [323] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [330] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [337] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [344] Chinstrap
## Levels: Adelie Chinstrap Gentoo

x <- unique(penguins$species)

class(x)

## [1] "factor"

One note of caution: you don’t want to irreparably alter your data (which is why we don’t want to open it in Excel if we can help it). This also means you don’t want to overwrite your original data file. Let’s create a new dataset that we can overwrite and modify so that we don’t change the original.

penguins_edited <- penguins
penguins_edited$species <- as.character(penguins_edited$species)
head(penguins_edited)

## # A tibble: 6 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <chr>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 6 Adelie  Torge…           39.3          20.6              190        3650 male 
## # … with 1 more variable: year <int>

Finally, let’s look at an overview of the structure of the dataset.

str(penguins)

## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

1.2 Summary statistics

Before we get into inferential statistics, we should review summary statistics.

max(penguins$year)

## [1] 2009

min(penguins$year)

## [1] 2007

range(penguins$year)

## [1] 2007 2009

mean(penguins$year)

## [1] 2008.029

median(penguins$year)

## [1] 2008

Some of these functions can optionally take an additional argument that tells it what to do with missing data.

max(   penguins$body_mass_g, na.rm = TRUE)

## [1] 6300

min(   penguins$body_mass_g, na.rm = TRUE)

## [1] 2700

range( penguins$body_mass_g, na.rm = TRUE)

## [1] 2700 6300

mean(  penguins$body_mass_g, na.rm = TRUE)

## [1] 4201.754

median(penguins$body_mass_g, na.rm = TRUE)

## [1] 4050

We can also polish up the output so it’s more reader friendly.

print("Adelie stats")

## [1] "Adelie stats"

max(   penguins$body_mass_g[penguins$species=="Adelie"], na.rm = TRUE)

## [1] 4775

min(   penguins$body_mass_g[penguins$species=="Adelie"], na.rm = TRUE)

## [1] 2850

mean(  penguins$body_mass_g[penguins$species=="Adelie"], na.rm = TRUE)

## [1] 3700.662

median(penguins$body_mass_g[penguins$species=="Adelie"], na.rm = TRUE)

## [1] 3700

print("Chinstrap stats")

## [1] "Chinstrap stats"

paste("max body mass =",       max(penguins$body_mass_g[penguins$species=="Chinstrap"], na.rm = TRUE))

## [1] "max body mass = 4800"

paste("min body mass =",       min(penguins$body_mass_g[penguins$species=="Chinstrap"], na.rm = TRUE))

## [1] "min body mass = 2700"

paste("mean body mass =",     mean(penguins$body_mass_g[penguins$species=="Chinstrap"], na.rm = TRUE))

## [1] "mean body mass = 3733.08823529412"

paste("median body mass =", median(penguins$body_mass_g[penguins$species=="Chinstrap"], na.rm = TRUE))

## [1] "median body mass = 3700"

print("Gentoo stats")

## [1] "Gentoo stats"

paste("max body mass =",                    max(penguins$body_mass_g[penguins$species=="Gentoo"], na.rm = TRUE))

## [1] "max body mass = 6300"

paste("min body mass =",                    min(penguins$body_mass_g[penguins$species=="Gentoo"], na.rm = TRUE))

## [1] "min body mass = 3950"

paste("mean body mass =",                  mean(penguins$body_mass_g[penguins$species=="Gentoo"], na.rm = TRUE))

## [1] "mean body mass = 5076.0162601626"

paste("median body mass =",              median(penguins$body_mass_g[penguins$species=="Gentoo"], na.rm = TRUE))

## [1] "median body mass = 5000"

paste("standard deviation of body mass =",   sd(penguins$body_mass_g[penguins$species=="Gentoo"], na.rm = TRUE))

## [1] "standard deviation of body mass = 504.116236657092"

paste("standard error of body mass =", round(sd(penguins$body_mass_g[penguins$species=="Gentoo"], 
              na.rm = TRUE)/sqrt(length(na.omit(penguins$body_mass_g[penguins$species=="Gentoo"]))), 2))

## [1] "standard error of body mass = 45.45"

2 Tidyverse

The tidyverse is a series of libraries that function well together and are designed to wrangle, manipulate, and visualise data cleanly and easily.

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

They’re held together with something called a ‘pipe’, which is more of a funnel than a pipe:

penguins %>% head()

## # A tibble: 6 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 6 Adelie  Torge…           39.3          20.6              190        3650 male 
## # … with 1 more variable: year <int>

There are some quirks in tidyverse packages, but ultimately they make it easier for humans to read code.

penguins %>% 
  pull(species) %>% 
  unique()

## [1] Adelie    Gentoo    Chinstrap
## Levels: Adelie Chinstrap Gentoo

Here are two ways to do the same thing:

penguins %>% 
  pull(body_mass_g) %>% 
  mean(na.rm = TRUE)

## [1] 4201.754

penguins %>% 
  filter(!is.na(body_mass_g)) %>% 
  pull(body_mass_g) %>% 
  mean()

## [1] 4201.754

We can also polish up the output so it’s formatted nicely.

penguins %>% 
  pull(body_mass_g) %>% 
  mean(na.rm = TRUE) %>% 
  round(2) %>% 
  paste("mean body mass =", .)

## [1] "mean body mass = 4201.75"

2.1 Summaries and tables

Typically, datasets are giant and unweildy. We usually don’t want to look at the whole thing. It’s more informative if we can summarise it. There are functions to summarise whole datasets automatically, but they don’t know what you care about in the dataset. Let’s summarise penguins in the ways we care about.

penguins %>% 
  filter(!is.na(body_mass_g)) %>% 
  group_by(species) %>% 
  summarise(max =       max(body_mass_g) %>% round(2),
            min =       min(body_mass_g) %>% round(2),
            mean =     mean(body_mass_g) %>% round(2),
            median = median(body_mass_g) %>% round(2),
            stdev =      sd(body_mass_g) %>% round(2),
            se =       (stdev/sqrt(n())) %>% round(2))

## # A tibble: 3 x 7
##   species     max   min  mean median stdev    se
##   <fct>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 Adelie     4775  2850 3701.   3700  459.  37.3
## 2 Chinstrap  4800  2700 3733.   3700  384.  46.6
## 3 Gentoo     6300  3950 5076.   5000  504.  45.4

As a side note, you can output these sorts of tables formatted for publication so you don’t have to copy and paste individual values to a document!

penguins %>% 
  filter(!is.na(body_mass_g)) %>% 
  group_by(species) %>% 
  summarise(`maximum` =           max(body_mass_g),
            `minimum` =           min(body_mass_g),
            `mean value` =       mean(body_mass_g),
            `median value` =   median(body_mass_g),
            `standard deviation` = sd(body_mass_g),
            `standard error` = (`standard deviation`/sqrt(n())))  %>% 
  knitr::kable(caption = "Table 1: Summary of penguin body mass (g) by species",
               align = "c",
               digits = 2)

Table 1: Summary of penguin body mass (g) by species
species	maximum	minimum	mean value	median value	standard deviation	standard error
Adelie	4775	2850	3700.66	3700	458.57	37.32
Chinstrap	4800	2700	3733.09	3700	384.34	46.61
Gentoo	6300	3950	5076.02	5000	504.12	45.45

3 Data wrangling

Data wrangling is the process of taking raw data from whatever output and making it ready to be analysed. It’s time consuming and finnicky, but can also be pretty fun once you get the hang of it!

3.1 Column manipulations

You can create or overwrite columns using mutate(), vertically subset data using select(), and horizontally subset data using filter().

# area of a triangle = height * base / 2
# density = mass / volume; volume = length * width * height
penguins %>% 
  mutate(bill_area_mm2     = bill_length_mm * bill_depth_mm / 2,
         flipper_length_cm = flipper_length_mm/100,
         bill_length_cm    = bill_length_mm/100,
         bill_depth_cm     = bill_depth_mm/100,
         penguin_volume    = (.75*flipper_length_cm) * bill_length_cm * (10*bill_depth_cm),
         penguin_density   = body_mass_g / penguin_volume) %>% 
  select(year, species, sex, island, bill_area_mm2, penguin_density) %>% 
  filter(!is.na(sex)) %>% 
  group_by(species, sex) %>% 
  summarise(mean_bill_area_mm2 = mean(bill_area_mm2),
            mean_density       = mean(penguin_density))

## # A tibble: 6 x 4
## # Groups:   species [3]
##   species   sex    mean_bill_area_mm2 mean_density
##   <fct>     <fct>               <dbl>        <dbl>
## 1 Adelie    female               328.        3658.
## 2 Adelie    male                 385.        3656.
## 3 Chinstrap female               410.        3008.
## 4 Chinstrap male                 492.        2673.
## 5 Gentoo    female               325.        4534.
## 6 Gentoo    male                 389.        4268.

3.2 Long vs wide data

Sometimes, your data is too long or too wide for the analysis you want to do. This is made easy to fix with pivot_wider() and pivot_longer().

penguins %>% 
  filter(!is.na(body_mass_g)) %>% 
  group_by(species) %>% 
  summarise(max =       max(body_mass_g) %>% round(2),
            min =       min(body_mass_g) %>% round(2),
            mean =     mean(body_mass_g) %>% round(2),
            median = median(body_mass_g) %>% round(2),
            stdev =      sd(body_mass_g) %>% round(2),
            se =       (stdev/sqrt(n())) %>% round(2)) -> wide_penguins

wide_penguins

## # A tibble: 3 x 7
##   species     max   min  mean median stdev    se
##   <fct>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 Adelie     4775  2850 3701.   3700  459.  37.3
## 2 Chinstrap  4800  2700 3733.   3700  384.  46.6
## 3 Gentoo     6300  3950 5076.   5000  504.  45.4

wide_penguins %>% pivot_longer(cols = c("max", "min", "mean", "median", "stdev", "se"), names_to = "measure", values_to = "values")

## # A tibble: 18 x 3
##    species   measure values
##    <fct>     <chr>    <dbl>
##  1 Adelie    max     4775  
##  2 Adelie    min     2850  
##  3 Adelie    mean    3701. 
##  4 Adelie    median  3700  
##  5 Adelie    stdev    459. 
##  6 Adelie    se        37.3
##  7 Chinstrap max     4800  
##  8 Chinstrap min     2700  
##  9 Chinstrap mean    3733. 
## 10 Chinstrap median  3700  
## 11 Chinstrap stdev    384. 
## 12 Chinstrap se        46.6
## 13 Gentoo    max     6300  
## 14 Gentoo    min     3950  
## 15 Gentoo    mean    5076. 
## 16 Gentoo    median  5000  
## 17 Gentoo    stdev    504. 
## 18 Gentoo    se        45.4

wide_penguins %>% pivot_longer(cols = c("max", "min", "mean", "median", "stdev", "se"), names_to = "measure", values_to = "values") -> long_penguins

long_penguins %>% 
  pivot_wider(names_from = "measure", values_from = "values")

## # A tibble: 3 x 7
##   species     max   min  mean median stdev    se
##   <fct>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 Adelie     4775  2850 3701.   3700  459.  37.3
## 2 Chinstrap  4800  2700 3733.   3700  384.  46.6
## 3 Gentoo     6300  3950 5076.   5000  504.  45.4

3.3 Clean up messy values

The package palmerpenguins also comes with a messier ‘raw’ version of the data.

head(penguins_raw)

## # A tibble: 6 x 17
##   studyName `Sample Number` Species Region Island Stage `Individual ID`
##   <chr>               <dbl> <chr>   <chr>  <chr>  <chr> <chr>          
## 1 PAL0708                 1 Adelie… Anvers Torge… Adul… N1A1           
## 2 PAL0708                 2 Adelie… Anvers Torge… Adul… N1A2           
## 3 PAL0708                 3 Adelie… Anvers Torge… Adul… N2A1           
## 4 PAL0708                 4 Adelie… Anvers Torge… Adul… N2A2           
## 5 PAL0708                 5 Adelie… Anvers Torge… Adul… N3A1           
## 6 PAL0708                 6 Adelie… Anvers Torge… Adul… N3A2           
## # … with 10 more variables: `Clutch Completion` <chr>, `Date Egg` <date>,
## #   `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>, `Flipper Length
## #   (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>, `Delta 15 N (o/oo)` <dbl>,
## #   `Delta 13 C (o/oo)` <dbl>, Comments <chr>

3.3.1 Case when

The function case_when() allows us to isolate certain rows for manipulation or cleaning and apply different rules to different situations. It’s very handy but can take some practice to get used to.

penguins_raw %>% pull(`Comments`) %>% unique()

##  [1] "Not enough blood for isotopes."                                      
##  [2] NA                                                                    
##  [3] "Adult not sampled."                                                  
##  [4] "Nest never observed with full clutch."                               
##  [5] "No blood sample obtained."                                           
##  [6] "No blood sample obtained for sexing."                                
##  [7] "Nest never observed with full clutch. Not enough blood for isotopes."
##  [8] "Sexing primers did not amplify. Not enough blood for isotopes."      
##  [9] "Sexing primers did not amplify."                                     
## [10] "Adult not sampled. Nest never observed with full clutch."            
## [11] "No delta15N data received from lab."

penguins_raw %>% pull(`Delta 13 C (o/oo)`) %>% range(na.rm = TRUE)

## [1] -27.01854 -23.78767

penguins_raw %>% 
  mutate(keeper = case_when(Comments == "Adult not sampled." ~                                       "remove",
                            Comments == "No blood sample obtained." ~                                "remove",
                            Comments == "Nest never observed with full clutch." ~                    "keep",
                            Comments == "No blood sample obtained for sexing." ~                     "remove",
                            Comments == "Adult not sampled. Nest never observed with full clutch." ~ "remove",
                            TRUE ~ "evaluate"),
         D13.level = case_when(`Delta 13 C (o/oo)` <= -26  ~ "low",
                               `Delta 13 C (o/oo)` > -24.8 ~ "high",
                               TRUE ~ "mid")) %>% 
  group_by(D13.level, keeper) %>% summarise(count = n())

## # A tibble: 7 x 3
## # Groups:   D13.level [3]
##   D13.level keeper   count
##   <chr>     <chr>    <int>
## 1 high      evaluate    55
## 2 high      keep        11
## 3 low       evaluate   141
## 4 low       keep        11
## 5 mid       evaluate   108
## 6 mid       keep        12
## 7 mid       remove       6

4 Workshop activities

Practice reshaping long-data.csv and wide-data.csv, which you can download by right-clicking or copying/pasting the links.

You may also want to try to look up new functions to use for some of these tasks.

Make long-data.csv wider so that each type of measurement is its own column.
Once you’ve created a Savings column, find a way to separate or split the numeric and string information.
- This might require you to do a search online or in the Help window.
Change the class of data in columns, e.g., strings containing only numbers should be numeric.
Convert all values in the Savings column into a single currency. This might include tasks such as:
- searching for the current conversion rate online
- using case_when to manipulate different certain rows in different ways
Make wide-data.csv longer so that each survey question (column) except for name is on a separate row.
Interpret and debug error messages.

longdat <- read_csv("../data/long-data.csv")

## 
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   subject = col_double(),
##   measure = col_character(),
##   value = col_character()
## )

str(longdat)

## tibble [15 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ subject: num [1:15] 1 1 1 2 2 2 3 3 3 4 ...
##  $ measure: chr [1:15] "Nationality" "Age" "Savings" "Nationality" ...
##  $ value  : chr [1:15] "English" "21" "85 GBP" "Scottish" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   subject = col_double(),
##   ..   measure = col_character(),
##   ..   value = col_character()
##   .. )

longdat %>% pivot_wider(names_from = "measure", values_from = "value")

## # A tibble: 5 x 4
##   subject Nationality Age   Savings
##     <dbl> <chr>       <chr> <chr>  
## 1       1 English     21    85 GBP 
## 2       2 Scottish    22    200 GBP
## 3       3 N.Irish     18    125 GBP
## 4       4 Welsh       24    300 GBP
## 5       5 American    20    105 USD

widedat <- read_csv("../data/wide-data.csv")

## 
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   name = col_character(),
##   age = col_double(),
##   nationality = col_character(),
##   gender = col_character(),
##   L1 = col_character(),
##   L2 = col_character(),
##   favColour = col_character()
## )

str(widedat)

## tibble [3 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name       : chr [1:3] "James" "Jenny" "Jordan"
##  $ age        : num [1:3] 21 20 22
##  $ nationality: chr [1:3] "English" "American" "Canadian"
##  $ gender     : chr [1:3] "male" "female" "female"
##  $ L1         : chr [1:3] "English" "English" "French"
##  $ L2         : chr [1:3] NA "Spanish" "English"
##  $ favColour  : chr [1:3] "blue" "green" "purple"
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   age = col_double(),
##   ..   nationality = col_character(),
##   ..   gender = col_character(),
##   ..   L1 = col_character(),
##   ..   L2 = col_character(),
##   ..   favColour = col_character()
##   .. )

widedat %>% 
  pivot_longer(cols = 3:7, names_to = "question", values_to = "values")

## # A tibble: 15 x 4
##    name     age question    values  
##    <chr>  <dbl> <chr>       <chr>   
##  1 James     21 nationality English 
##  2 James     21 gender      male    
##  3 James     21 L1          English 
##  4 James     21 L2          <NA>    
##  5 James     21 favColour   blue    
##  6 Jenny     20 nationality American
##  7 Jenny     20 gender      female  
##  8 Jenny     20 L1          English 
##  9 Jenny     20 L2          Spanish 
## 10 Jenny     20 favColour   green   
## 11 Jordan    22 nationality Canadian
## 12 Jordan    22 gender      female  
## 13 Jordan    22 L1          French  
## 14 Jordan    22 L2          English 
## 15 Jordan    22 favColour   purple

4.1 If you have time

Explore the simulated dataset. We’ll be using it more in the future, so you can start to get a feel for it now.

simdat <- read_csv("../data/simulated-data.csv")

## 
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   subj = col_double(),
##   age = col_double(),
##   item = col_double(),
##   freq = col_character(),
##   gram = col_character(),
##   rating = col_double(),
##   accuracy = col_double(),
##   region = col_double(),
##   word = col_character(),
##   rt = col_double()
## )

str(simdat)

## tibble [4,000 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ subj    : num [1:4000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ age     : num [1:4000] 51 51 51 51 51 51 51 51 51 51 ...
##  $ item    : num [1:4000] 1 2 3 4 5 6 7 8 9 10 ...
##  $ freq    : chr [1:4000] "high" "high" "high" "high" ...
##  $ gram    : chr [1:4000] "yes" "yes" "yes" "yes" ...
##  $ rating  : num [1:4000] 4 5 5 5 5 4 3 1 4 1 ...
##  $ accuracy: num [1:4000] 1 1 1 1 1 1 1 0 1 1 ...
##  $ region  : num [1:4000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ word    : chr [1:4000] "the" "the" "the" "the" ...
##  $ rt      : num [1:4000] 446 318 204 464 242 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   subj = col_double(),
##   ..   age = col_double(),
##   ..   item = col_double(),
##   ..   freq = col_character(),
##   ..   gram = col_character(),
##   ..   rating = col_double(),
##   ..   accuracy = col_double(),
##   ..   region = col_double(),
##   ..   word = col_character(),
##   ..   rt = col_double()
##   .. )

Here is some basic background about the design of the simulated data. It’s meant to immitate some experimental linguistics data.

4.1.1 Background

We believe high frequency words are generally read and comprehended faster than low frequency words. We also believe that sentences which are temporarily ambiguous cause reading time slowdowns.

Ferreira & Henderson, 1990: Use of Verb Information in Syntactic Parsing: Evidence from Eye Movements and Self-Paced Reading

Compare:

The coach praised the player tossed the frisbee.
The coach praised the player thrown the frisbee.

Sentence (1) is more difficult to interpret because “tossed” is ambiguous between a verb and a past participle.
Sentence (2) is unambiguous.
Sentence (1) takes longer to read, especially after the original interpretation (“The coach praised the player.”) has been disconfirmed (“tossed”).
Sentence (2) also has a disconfirmation, but the solution is made clear so the reinterpretation is straightforward.

The ambiguity here relies primarily on a reduced relative clause. Changing it to an unreduced relative clause removes all ambiguity and comprehension difficulty. Compare:

The coach praised the player who was tossed the frisbee.
The coach praised the player who was thrown the frisbee.

“The old VERB the boat.” where VERB is where the manipulation of interest will be. All verbs are ambiguous with nouns.

Condition Number	Sample sentence	Verb frequency	Grammaticality
1	the old man the boat	high	grammatical
2	the old put the boat	high	ungrammatical
3	the old run the boat	low	grammatical
4	the old owe the boat	low	ungrammatical

4.1.2 Data collection

4.1.2.1 Procedure

The data came from a simulated experiment. No real people were involved. This is fake data.

BUT: if it were real data, this is the experiment it would have come from:

Participant signs a consent form and is assigned an identifying number, the two of which can only be linked using information stored securely (i.e., not in a laptop, notebook, or email).
Participant sits at a computer and reads instructions for the task:
1. Read a sentence word by word at a natural pace.
2. Only one word will be visible at any time.
3. To see the next word, press the space bar or other button.
4. After the last word, press the button to answer two follow-up questions.
5. First, answer a question about the sentence you just read.
6. Second, rate the preceding sentence on a scale of 1 (unnatural, unacceptable) to 5 (natural, acceptable).
7. Once all sentences and associated questions have been answered, the participant answers demographic questions (e.g., age).
Participant consents to participating.
Participant completes a few practice trials to become familiar with the task.
Participant completes the task as instructed.
Participant is debriefed about the purpose of the experiment.
Participant is compensated for their time and labour in a manner approved by the Ethics Committee.

4.1.3 Data structure

This dataset was specifically designed to teach linguists about data visualisation and analysis. Let’s take a look at what it contains to understand its specifically linguistic properties.

Now we can read it in to our R session:

# read in the data
data <- read.csv("../data/simulated-data.csv")
# check out what it contains
str(data)

## 'data.frame':    4000 obs. of  10 variables:
##  $ subj    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ age     : int  51 51 51 51 51 51 51 51 51 51 ...
##  $ item    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ freq    : chr  "high" "high" "high" "high" ...
##  $ gram    : chr  "yes" "yes" "yes" "yes" ...
##  $ rating  : int  4 5 5 5 5 4 3 1 4 1 ...
##  $ accuracy: int  1 1 1 1 1 1 1 0 1 1 ...
##  $ region  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ word    : chr  "the" "the" "the" "the" ...
##  $ rt      : num  446 318 204 464 242 ...

This data contains the following columns:

subj: unique participant ID numbers to anonymise each person (integers)
age: the participant’s age in years (whole numbers, discrete data)
item: each participant was shown each of these items
freq: whether the participant was shown the high or low frequency version of the specific item
gram: whether the participant was shown the grammatical or ungrammatical version of the specific item
rating: the acceptability rating that the participant gave this particular version of the item
accuracy: whether the participant answered a comprehension question correctly or not
region: the order in which each word in the sentence occurred
word: the lexical content of each position in the sentence
rt: the reaction time; the time it took for the participant to read the work and click a button

Data Wrangling and Tidyverse

Dr Lauren Ackerman

08 JUN 2021