Welcome to Day 1!
Today, we’ll be breezing through some of the basics of R. Hopefully, some of this starts off as review. If you find yourself struggling with it at any time, that’s okay! It’s all written down, so you can come back to it later, when it’s had more time to sink in.
Learning R and statistics is an iterative process. If you attended the exact same workshop four times in a row, you’d learn something new each time. Depending on your skill level and confidence coming into this workshop, you are likely to learn different things.
If you’re more skilled, think of the familiar parts as a chance to refresh your knowledge or see it from a slightly different perspective (as R and statistics are a lot like natural language – there are dialects and communities of practice that vary across the population)!
If this is your first formal foray into R and linear models, it may seem overwhelming at times. That’s normal and good, actually. This is a chance to get a crash course in some aspects of R and statistics, hold on to a few bits of information, and build on them as you learn more from other sources. I think of it like trying to get a drink of water from a garden hosepipe turned all the way on. Most of the water will miss your mouth, splash everywhere, and go down your front. Some of the water will spray into your mouth but bounce out (or get spit out!), and a very tiny bit will actually go down your throat. Every time you return to the hosepipe of information, your thirst for knowledge will be sated a little more, but it will require being on the receiving end of much more information than you can possibly take in before your thirst is quenched.
After this Summer School, you can always return to these materials. Once they’re online, I won’t be taking them down. I encourage you to take notes and try them out alongside me, though, as I will be saying much more than is written. I will also be uploading the “complete” files which are what we end each day with. These will not be linked directly from the main page, but you will be able to access them any time to see the final state of the materials we worked on as a group. Your own materials may differ, so it’s good to have them handy as well.
The last thing I want to assert before we dive in is that no one writes their code from scratch. Even though it might sometimes look like I’m starting with a (nearly) blank page, or that I’ve memorized all the functions and arguments we’ll be using, this is an illusion. The best way to write code is by copying and pasting code from other people’s (reliable) sources, and adapting it. This is what professional programmers do, and it’s what you’ll do too.
Hopefully, you’re already familiar enough with R to know that it can act like a calculator. You might already know that you can store values in things called variables. These values can be single objects, vectors, lists, arrays, matrices, or other things.
# A vector of class `character` containing four "strings":
myList <- c("English", "Spanish", "Mandarin", "Arabic")
# A vector of class `integer` containing four numbers:
myNums <- 1:4
# How can we identify which items are where?
myList[2:3]
## [1] "Spanish" "Mandarin"
# What operations can we do to the vector object?
# How does that affect the members of the vector?
myNums * .5 -> myNumsHalf
# What are "classes" and why do we care?
as.factor(myList)
## [1] English Spanish Mandarin Arabic
## Levels: Arabic English Mandarin Spanish
class(myList)
## [1] "character"
class(myNums)
## [1] "integer"
class(myNumsHalf)
## [1] "numeric"
class(as.factor(myList))
## [1] "factor"
Libraries are like recipe books. Once we’ve installed the library (like purchasing the recipe book), we need to take it off the shelf and open it to access the recipes inside.
library(palmerpenguins)
One of the “recipes” in the palmerpenguins
library is a dataset called penguins
.
head(penguins)
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## 4 Adelie Torge… NA NA NA NA <NA>
## 5 Adelie Torge… 36.7 19.3 193 3450 fema…
## 6 Adelie Torge… 39.3 20.6 190 3650 male
## # … with 1 more variable: year <int>
Some base R operations will be useful throughout this workshop. Here are some of the most important things to know.
penguins$species
## [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [8] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [15] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [22] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [29] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [36] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [43] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [50] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [57] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [64] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [71] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [78] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [85] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [92] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [99] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [106] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [113] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [120] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [127] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [134] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [141] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [148] Adelie Adelie Adelie Adelie Adelie Gentoo Gentoo
## [155] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [162] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [169] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [176] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [183] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [190] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [197] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [204] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [211] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [218] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [225] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [232] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [239] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [246] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [253] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [260] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [267] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [274] Gentoo Gentoo Gentoo Chinstrap Chinstrap Chinstrap Chinstrap
## [281] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [288] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [295] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [302] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [309] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [316] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [323] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [330] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [337] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [344] Chinstrap
## Levels: Adelie Chinstrap Gentoo
x <- unique(penguins$species)
class(x)
## [1] "factor"
One note of caution: you don’t want to irreparably alter your data (which is why we don’t want to open it in Excel if we can help it). This also means you don’t want to overwrite your original data file. Let’s create a new dataset that we can overwrite and modify so that we don’t change the original.
penguins_edited <- penguins
penguins_edited$species <- as.character(penguins_edited$species)
head(penguins_edited)
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <chr> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## 4 Adelie Torge… NA NA NA NA <NA>
## 5 Adelie Torge… 36.7 19.3 193 3450 fema…
## 6 Adelie Torge… 39.3 20.6 190 3650 male
## # … with 1 more variable: year <int>
Finally, let’s look at an overview of the structure of the dataset.
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Before we get into inferential statistics, we should review summary statistics.
max(penguins$year)
## [1] 2009
min(penguins$year)
## [1] 2007
range(penguins$year)
## [1] 2007 2009
mean(penguins$year)
## [1] 2008.029
median(penguins$year)
## [1] 2008
Some of these functions can optionally take an additional argument that tells it what to do with missing data.
max( penguins$body_mass_g, na.rm = TRUE)
## [1] 6300
min( penguins$body_mass_g, na.rm = TRUE)
## [1] 2700
range( penguins$body_mass_g, na.rm = TRUE)
## [1] 2700 6300
mean( penguins$body_mass_g, na.rm = TRUE)
## [1] 4201.754
median(penguins$body_mass_g, na.rm = TRUE)
## [1] 4050
We can also polish up the output so it’s more reader friendly.
print("Adelie stats")
## [1] "Adelie stats"
max( penguins$body_mass_g[penguins$species=="Adelie"], na.rm = TRUE)
## [1] 4775
min( penguins$body_mass_g[penguins$species=="Adelie"], na.rm = TRUE)
## [1] 2850
mean( penguins$body_mass_g[penguins$species=="Adelie"], na.rm = TRUE)
## [1] 3700.662
median(penguins$body_mass_g[penguins$species=="Adelie"], na.rm = TRUE)
## [1] 3700
print("Chinstrap stats")
## [1] "Chinstrap stats"
paste("max body mass =", max(penguins$body_mass_g[penguins$species=="Chinstrap"], na.rm = TRUE))
## [1] "max body mass = 4800"
paste("min body mass =", min(penguins$body_mass_g[penguins$species=="Chinstrap"], na.rm = TRUE))
## [1] "min body mass = 2700"
paste("mean body mass =", mean(penguins$body_mass_g[penguins$species=="Chinstrap"], na.rm = TRUE))
## [1] "mean body mass = 3733.08823529412"
paste("median body mass =", median(penguins$body_mass_g[penguins$species=="Chinstrap"], na.rm = TRUE))
## [1] "median body mass = 3700"
print("Gentoo stats")
## [1] "Gentoo stats"
paste("max body mass =", max(penguins$body_mass_g[penguins$species=="Gentoo"], na.rm = TRUE))
## [1] "max body mass = 6300"
paste("min body mass =", min(penguins$body_mass_g[penguins$species=="Gentoo"], na.rm = TRUE))
## [1] "min body mass = 3950"
paste("mean body mass =", mean(penguins$body_mass_g[penguins$species=="Gentoo"], na.rm = TRUE))
## [1] "mean body mass = 5076.0162601626"
paste("median body mass =", median(penguins$body_mass_g[penguins$species=="Gentoo"], na.rm = TRUE))
## [1] "median body mass = 5000"
paste("standard deviation of body mass =", sd(penguins$body_mass_g[penguins$species=="Gentoo"], na.rm = TRUE))
## [1] "standard deviation of body mass = 504.116236657092"
paste("standard error of body mass =", round(sd(penguins$body_mass_g[penguins$species=="Gentoo"],
na.rm = TRUE)/sqrt(length(na.omit(penguins$body_mass_g[penguins$species=="Gentoo"]))), 2))
## [1] "standard error of body mass = 45.45"
The tidyverse
is a series of libraries that function well together and are designed to wrangle, manipulate, and visualise data cleanly and easily.
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
They’re held together with something called a ‘pipe’, which is more of a funnel than a pipe:
penguins %>% head()
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## 4 Adelie Torge… NA NA NA NA <NA>
## 5 Adelie Torge… 36.7 19.3 193 3450 fema…
## 6 Adelie Torge… 39.3 20.6 190 3650 male
## # … with 1 more variable: year <int>
There are some quirks in tidyverse
packages, but ultimately they make it easier for humans to read code.
penguins %>%
pull(species) %>%
unique()
## [1] Adelie Gentoo Chinstrap
## Levels: Adelie Chinstrap Gentoo
Here are two ways to do the same thing:
penguins %>%
pull(body_mass_g) %>%
mean(na.rm = TRUE)
## [1] 4201.754
penguins %>%
filter(!is.na(body_mass_g)) %>%
pull(body_mass_g) %>%
mean()
## [1] 4201.754
We can also polish up the output so it’s formatted nicely.
penguins %>%
pull(body_mass_g) %>%
mean(na.rm = TRUE) %>%
round(2) %>%
paste("mean body mass =", .)
## [1] "mean body mass = 4201.75"
Typically, datasets are giant and unweildy. We usually don’t want to look at the whole thing. It’s more informative if we can summarise it. There are functions to summarise whole datasets automatically, but they don’t know what you care about in the dataset. Let’s summarise penguins
in the ways we care about.
penguins %>%
filter(!is.na(body_mass_g)) %>%
group_by(species) %>%
summarise(max = max(body_mass_g) %>% round(2),
min = min(body_mass_g) %>% round(2),
mean = mean(body_mass_g) %>% round(2),
median = median(body_mass_g) %>% round(2),
stdev = sd(body_mass_g) %>% round(2),
se = (stdev/sqrt(n())) %>% round(2))
## # A tibble: 3 x 7
## species max min mean median stdev se
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie 4775 2850 3701. 3700 459. 37.3
## 2 Chinstrap 4800 2700 3733. 3700 384. 46.6
## 3 Gentoo 6300 3950 5076. 5000 504. 45.4
As a side note, you can output these sorts of tables formatted for publication so you don’t have to copy and paste individual values to a document!
penguins %>%
filter(!is.na(body_mass_g)) %>%
group_by(species) %>%
summarise(`maximum` = max(body_mass_g),
`minimum` = min(body_mass_g),
`mean value` = mean(body_mass_g),
`median value` = median(body_mass_g),
`standard deviation` = sd(body_mass_g),
`standard error` = (`standard deviation`/sqrt(n()))) %>%
knitr::kable(caption = "Table 1: Summary of penguin body mass (g) by species",
align = "c",
digits = 2)
species | maximum | minimum | mean value | median value | standard deviation | standard error |
---|---|---|---|---|---|---|
Adelie | 4775 | 2850 | 3700.66 | 3700 | 458.57 | 37.32 |
Chinstrap | 4800 | 2700 | 3733.09 | 3700 | 384.34 | 46.61 |
Gentoo | 6300 | 3950 | 5076.02 | 5000 | 504.12 | 45.45 |
Data wrangling is the process of taking raw data from whatever output and making it ready to be analysed. It’s time consuming and finnicky, but can also be pretty fun once you get the hang of it!
You can create or overwrite columns using mutate()
, vertically subset data using select()
, and horizontally subset data using filter()
.
# area of a triangle = height * base / 2
# density = mass / volume; volume = length * width * height
penguins %>%
mutate(bill_area_mm2 = bill_length_mm * bill_depth_mm / 2,
flipper_length_cm = flipper_length_mm/100,
bill_length_cm = bill_length_mm/100,
bill_depth_cm = bill_depth_mm/100,
penguin_volume = (.75*flipper_length_cm) * bill_length_cm * (10*bill_depth_cm),
penguin_density = body_mass_g / penguin_volume) %>%
select(year, species, sex, island, bill_area_mm2, penguin_density) %>%
filter(!is.na(sex)) %>%
group_by(species, sex) %>%
summarise(mean_bill_area_mm2 = mean(bill_area_mm2),
mean_density = mean(penguin_density))
## # A tibble: 6 x 4
## # Groups: species [3]
## species sex mean_bill_area_mm2 mean_density
## <fct> <fct> <dbl> <dbl>
## 1 Adelie female 328. 3658.
## 2 Adelie male 385. 3656.
## 3 Chinstrap female 410. 3008.
## 4 Chinstrap male 492. 2673.
## 5 Gentoo female 325. 4534.
## 6 Gentoo male 389. 4268.
Sometimes, your data is too long or too wide for the analysis you want to do. This is made easy to fix with pivot_wider()
and pivot_longer()
.
penguins %>%
filter(!is.na(body_mass_g)) %>%
group_by(species) %>%
summarise(max = max(body_mass_g) %>% round(2),
min = min(body_mass_g) %>% round(2),
mean = mean(body_mass_g) %>% round(2),
median = median(body_mass_g) %>% round(2),
stdev = sd(body_mass_g) %>% round(2),
se = (stdev/sqrt(n())) %>% round(2)) -> wide_penguins
wide_penguins
## # A tibble: 3 x 7
## species max min mean median stdev se
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie 4775 2850 3701. 3700 459. 37.3
## 2 Chinstrap 4800 2700 3733. 3700 384. 46.6
## 3 Gentoo 6300 3950 5076. 5000 504. 45.4
wide_penguins %>% pivot_longer(cols = c("max", "min", "mean", "median", "stdev", "se"), names_to = "measure", values_to = "values")
## # A tibble: 18 x 3
## species measure values
## <fct> <chr> <dbl>
## 1 Adelie max 4775
## 2 Adelie min 2850
## 3 Adelie mean 3701.
## 4 Adelie median 3700
## 5 Adelie stdev 459.
## 6 Adelie se 37.3
## 7 Chinstrap max 4800
## 8 Chinstrap min 2700
## 9 Chinstrap mean 3733.
## 10 Chinstrap median 3700
## 11 Chinstrap stdev 384.
## 12 Chinstrap se 46.6
## 13 Gentoo max 6300
## 14 Gentoo min 3950
## 15 Gentoo mean 5076.
## 16 Gentoo median 5000
## 17 Gentoo stdev 504.
## 18 Gentoo se 45.4
wide_penguins %>% pivot_longer(cols = c("max", "min", "mean", "median", "stdev", "se"), names_to = "measure", values_to = "values") -> long_penguins
long_penguins %>%
pivot_wider(names_from = "measure", values_from = "values")
## # A tibble: 3 x 7
## species max min mean median stdev se
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie 4775 2850 3701. 3700 459. 37.3
## 2 Chinstrap 4800 2700 3733. 3700 384. 46.6
## 3 Gentoo 6300 3950 5076. 5000 504. 45.4
The package palmerpenguins
also comes with a messier ‘raw’ version of the data.
head(penguins_raw)
## # A tibble: 6 x 17
## studyName `Sample Number` Species Region Island Stage `Individual ID`
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie… Anvers Torge… Adul… N1A1
## 2 PAL0708 2 Adelie… Anvers Torge… Adul… N1A2
## 3 PAL0708 3 Adelie… Anvers Torge… Adul… N2A1
## 4 PAL0708 4 Adelie… Anvers Torge… Adul… N2A2
## 5 PAL0708 5 Adelie… Anvers Torge… Adul… N3A1
## 6 PAL0708 6 Adelie… Anvers Torge… Adul… N3A2
## # … with 10 more variables: `Clutch Completion` <chr>, `Date Egg` <date>,
## # `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>, `Flipper Length
## # (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>, `Delta 15 N (o/oo)` <dbl>,
## # `Delta 13 C (o/oo)` <dbl>, Comments <chr>
The function case_when()
allows us to isolate certain rows for manipulation or cleaning and apply different rules to different situations. It’s very handy but can take some practice to get used to.
penguins_raw %>% pull(`Comments`) %>% unique()
## [1] "Not enough blood for isotopes."
## [2] NA
## [3] "Adult not sampled."
## [4] "Nest never observed with full clutch."
## [5] "No blood sample obtained."
## [6] "No blood sample obtained for sexing."
## [7] "Nest never observed with full clutch. Not enough blood for isotopes."
## [8] "Sexing primers did not amplify. Not enough blood for isotopes."
## [9] "Sexing primers did not amplify."
## [10] "Adult not sampled. Nest never observed with full clutch."
## [11] "No delta15N data received from lab."
penguins_raw %>% pull(`Delta 13 C (o/oo)`) %>% range(na.rm = TRUE)
## [1] -27.01854 -23.78767
penguins_raw %>%
mutate(keeper = case_when(Comments == "Adult not sampled." ~ "remove",
Comments == "No blood sample obtained." ~ "remove",
Comments == "Nest never observed with full clutch." ~ "keep",
Comments == "No blood sample obtained for sexing." ~ "remove",
Comments == "Adult not sampled. Nest never observed with full clutch." ~ "remove",
TRUE ~ "evaluate"),
D13.level = case_when(`Delta 13 C (o/oo)` <= -26 ~ "low",
`Delta 13 C (o/oo)` > -24.8 ~ "high",
TRUE ~ "mid")) %>%
group_by(D13.level, keeper) %>% summarise(count = n())
## # A tibble: 7 x 3
## # Groups: D13.level [3]
## D13.level keeper count
## <chr> <chr> <int>
## 1 high evaluate 55
## 2 high keep 11
## 3 low evaluate 141
## 4 low keep 11
## 5 mid evaluate 108
## 6 mid keep 12
## 7 mid remove 6
Practice reshaping long-data.csv and wide-data.csv, which you can download by right-clicking or copying/pasting the links.
You may also want to try to look up new functions to use for some of these tasks.
long-data.csv
wider so that each type of measurement is its own column.Savings
column, find a way to separate or split the numeric and string information.
Savings
column into a single currency. This might include tasks such as:
case_when
to manipulate different certain rows in different wayswide-data.csv
longer so that each survey question (column) except for name
is on a separate row.longdat <- read_csv("../data/long-data.csv")
##
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## subject = col_double(),
## measure = col_character(),
## value = col_character()
## )
str(longdat)
## tibble [15 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ subject: num [1:15] 1 1 1 2 2 2 3 3 3 4 ...
## $ measure: chr [1:15] "Nationality" "Age" "Savings" "Nationality" ...
## $ value : chr [1:15] "English" "21" "85 GBP" "Scottish" ...
## - attr(*, "spec")=
## .. cols(
## .. subject = col_double(),
## .. measure = col_character(),
## .. value = col_character()
## .. )
longdat %>% pivot_wider(names_from = "measure", values_from = "value")
## # A tibble: 5 x 4
## subject Nationality Age Savings
## <dbl> <chr> <chr> <chr>
## 1 1 English 21 85 GBP
## 2 2 Scottish 22 200 GBP
## 3 3 N.Irish 18 125 GBP
## 4 4 Welsh 24 300 GBP
## 5 5 American 20 105 USD
widedat <- read_csv("../data/wide-data.csv")
##
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## name = col_character(),
## age = col_double(),
## nationality = col_character(),
## gender = col_character(),
## L1 = col_character(),
## L2 = col_character(),
## favColour = col_character()
## )
str(widedat)
## tibble [3 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name : chr [1:3] "James" "Jenny" "Jordan"
## $ age : num [1:3] 21 20 22
## $ nationality: chr [1:3] "English" "American" "Canadian"
## $ gender : chr [1:3] "male" "female" "female"
## $ L1 : chr [1:3] "English" "English" "French"
## $ L2 : chr [1:3] NA "Spanish" "English"
## $ favColour : chr [1:3] "blue" "green" "purple"
## - attr(*, "spec")=
## .. cols(
## .. name = col_character(),
## .. age = col_double(),
## .. nationality = col_character(),
## .. gender = col_character(),
## .. L1 = col_character(),
## .. L2 = col_character(),
## .. favColour = col_character()
## .. )
widedat %>%
pivot_longer(cols = 3:7, names_to = "question", values_to = "values")
## # A tibble: 15 x 4
## name age question values
## <chr> <dbl> <chr> <chr>
## 1 James 21 nationality English
## 2 James 21 gender male
## 3 James 21 L1 English
## 4 James 21 L2 <NA>
## 5 James 21 favColour blue
## 6 Jenny 20 nationality American
## 7 Jenny 20 gender female
## 8 Jenny 20 L1 English
## 9 Jenny 20 L2 Spanish
## 10 Jenny 20 favColour green
## 11 Jordan 22 nationality Canadian
## 12 Jordan 22 gender female
## 13 Jordan 22 L1 French
## 14 Jordan 22 L2 English
## 15 Jordan 22 favColour purple
Explore the simulated dataset. We’ll be using it more in the future, so you can start to get a feel for it now.
simdat <- read_csv("../data/simulated-data.csv")
##
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## subj = col_double(),
## age = col_double(),
## item = col_double(),
## freq = col_character(),
## gram = col_character(),
## rating = col_double(),
## accuracy = col_double(),
## region = col_double(),
## word = col_character(),
## rt = col_double()
## )
str(simdat)
## tibble [4,000 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ subj : num [1:4000] 1 1 1 1 1 1 1 1 1 1 ...
## $ age : num [1:4000] 51 51 51 51 51 51 51 51 51 51 ...
## $ item : num [1:4000] 1 2 3 4 5 6 7 8 9 10 ...
## $ freq : chr [1:4000] "high" "high" "high" "high" ...
## $ gram : chr [1:4000] "yes" "yes" "yes" "yes" ...
## $ rating : num [1:4000] 4 5 5 5 5 4 3 1 4 1 ...
## $ accuracy: num [1:4000] 1 1 1 1 1 1 1 0 1 1 ...
## $ region : num [1:4000] 1 1 1 1 1 1 1 1 1 1 ...
## $ word : chr [1:4000] "the" "the" "the" "the" ...
## $ rt : num [1:4000] 446 318 204 464 242 ...
## - attr(*, "spec")=
## .. cols(
## .. subj = col_double(),
## .. age = col_double(),
## .. item = col_double(),
## .. freq = col_character(),
## .. gram = col_character(),
## .. rating = col_double(),
## .. accuracy = col_double(),
## .. region = col_double(),
## .. word = col_character(),
## .. rt = col_double()
## .. )
Here is some basic background about the design of the simulated data. It’s meant to immitate some experimental linguistics data.
We believe high frequency words are generally read and comprehended faster than low frequency words. We also believe that sentences which are temporarily ambiguous cause reading time slowdowns.
Compare:
Sentence (1) is more difficult to interpret because “tossed” is ambiguous between a verb and a past participle.
Sentence (2) is unambiguous.
Sentence (1) takes longer to read, especially after the original interpretation (“The coach praised the player.”) has been disconfirmed (“tossed”).
Sentence (2) also has a disconfirmation, but the solution is made clear so the reinterpretation is straightforward.
The ambiguity here relies primarily on a reduced relative clause. Changing it to an unreduced relative clause removes all ambiguity and comprehension difficulty. Compare:
“The old VERB the boat.” where VERB is where the manipulation of interest will be. All verbs are ambiguous with nouns.
Condition Number | Sample sentence | Verb frequency | Grammaticality |
---|---|---|---|
1 | the old man the boat | high | grammatical |
2 | the old put the boat | high | ungrammatical |
3 | the old run the boat | low | grammatical |
4 | the old owe the boat | low | ungrammatical |
The data came from a simulated experiment. No real people were involved. This is fake data.
BUT: if it were real data, this is the experiment it would have come from:
This dataset was specifically designed to teach linguists about data visualisation and analysis. Let’s take a look at what it contains to understand its specifically linguistic properties.
Now we can read it in to our R session:
# read in the data
data <- read.csv("../data/simulated-data.csv")
# check out what it contains
str(data)
## 'data.frame': 4000 obs. of 10 variables:
## $ subj : int 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 51 51 51 51 51 51 51 51 51 51 ...
## $ item : int 1 2 3 4 5 6 7 8 9 10 ...
## $ freq : chr "high" "high" "high" "high" ...
## $ gram : chr "yes" "yes" "yes" "yes" ...
## $ rating : int 4 5 5 5 5 4 3 1 4 1 ...
## $ accuracy: int 1 1 1 1 1 1 1 0 1 1 ...
## $ region : int 1 1 1 1 1 1 1 1 1 1 ...
## $ word : chr "the" "the" "the" "the" ...
## $ rt : num 446 318 204 464 242 ...
This data contains the following columns:
subj
: unique participant ID numbers to anonymise each person (integers)age
: the participant’s age in years (whole numbers, discrete data)item
: each participant was shown each of these itemsfreq
: whether the participant was shown the high or low frequency version of the specific itemgram
: whether the participant was shown the grammatical or ungrammatical version of the specific itemrating
: the acceptability rating that the participant gave this particular version of the itemaccuracy
: whether the participant answered a comprehension question correctly or notregion
: the order in which each word in the sentence occurredword
: the lexical content of each position in the sentencert
: the reaction time; the time it took for the participant to read the work and click a button