⇠ Turorial 1 || Tutorial 3 ⇢

logo
Learning Outcomes
1 You will be able to load tidyverse and use basic functions
2 You will be able to relabel and reorganise a dataset
3 You will be able to summarise a dataset
4 You will be able to generate a simple plot

Datasets to download:

  1. binomial-data.csv
  2. long-data.csv
  3. wide-data.csv

1 What is Tidyverse?

tidyverse
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Tidyverse is a package, or a set of add-on tools, that you can optionally use in R to easily and clearly process and visualise your data. In the tidyverse, there are a number of included packages. You do not need to use them all, nor do you need to load them all, but for simplicity’s sake, it’s easier to load the whole thing and then not worry about it.

The most important (and exciting!) difference between the way base R functions work and the way tidyverse functions work is the pipe: %>%

In short, the pipe (%>%) takes whatever has already been done in the preceding line(s) and funnels it into the next line. This means complex operations can be performed, including changing or manipulating the data.frame, but it is temporary within the piped lines and will not permanently alter the data. Each line that you pipe to will have a function, and the functions defined inside the tidyverse package are typically referred to as verbs. I will not use this terminology strictly, but it is good to know.

We’ll get into more complex examples later on.

1.1 Main packages

The most useful packages for general data manipulation and visualisation are discussed in this section. To start, let’s read in a data.frame so we can practice setting it up. You can download this dataset to put in your “data” folder (as we discussed in best practices last time).

Once you’ve got your dataset in your data folder, you can read the data into this R session with the following code:

data <- read.csv("data/binomial-data.csv", header=TRUE, as.is=TRUE)

We can view the data a few different ways. The function View() (note the capidal “V”) will open the data as a data.frame in a new tab and it will look like a spreadsheet. Try that now:

View(data)

Note that you can’t edit the data.frame but you can sort the data by column. This doesn’t change anything about the structure of the data.frame, which you can see because the row numbers stay with their original rows when you sort the data.

We can also view our data in the console. The function head() and tail() show us the first or last six rows of a data.frame, respectively. We can view more if you add a number as a second argument:

head(data, 10)
##    experiment subject item condition selection selectCode
## 1       first       1    1  Baseline  Option 2          0
## 2       first       1    2  Baseline  Option 1          1
## 3       first       1    3  Baseline  Option 2          0
## 4       first       1    4  Baseline  Option 2          0
## 5       first       1    5  Baseline  Option 1          1
## 6       first       1    6  Baseline  Option 1          1
## 7       first       1    7  Baseline  Option 2          0
## 8       first       1    8  Baseline  Option 1          1
## 9       first       1    9  Baseline  Option 2          0
## 10      first       1   10  Baseline  Option 2          0

1.1.1 tibble

tibble

When you import a data.frame, there are two arguments I’ve added in: header=TRUE and as.is=TRUE. The first one says that the first row is the header row that names the columns. This is optional, but good to specify explicitly. The second argument tells R to not change the class of the column. The most common data classes are:

  • int: integers
  • dbl: double (a continuous numerical value)
  • chr: character (letters and numbers stored as a string)
  • fctr: factor (a categorical value)

If you don’t specify as.is=TRUE (that is, if as.is=FALSE), then all non-numerical values will imported as factors. This often doesn’t matter, but it can potentially cause issues later on, so I typically assign the value TRUE.

A tibble is a massively simplified data.frame. You can read in detail here about the way tibbles differ from data.frames. Importantly, using the tidyverse to interact with your data will typically convert your data.frame to a tibble. This won’t make much of a difference 99.9% of the time, but you might notice some differences in the way it displays.

as.tibble(data)
## Warning: `as.tibble()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 1,440 x 6
##    experiment subject  item condition selection selectCode
##    <chr>        <int> <int> <chr>     <chr>          <int>
##  1 first            1     1 Baseline  Option 2           0
##  2 first            1     2 Baseline  Option 1           1
##  3 first            1     3 Baseline  Option 2           0
##  4 first            1     4 Baseline  Option 2           0
##  5 first            1     5 Baseline  Option 1           1
##  6 first            1     6 Baseline  Option 1           1
##  7 first            1     7 Baseline  Option 2           0
##  8 first            1     8 Baseline  Option 1           1
##  9 first            1     9 Baseline  Option 2           0
## 10 first            1    10 Baseline  Option 2           0
## # … with 1,430 more rows

1.1.2 tidyr

tidyr

The tidyr package is designed to help you tidy your data without having to go through it by hand. This cuts down on typos and the amount of time and effort needed to put your data into an analysable form. In particular, gather() and spread() will be useful for turning wide data into long data and vice versa. It also provides a number of tools for tidying your data.

1.1.3 dplyr

dplyr

Like tidyr, the package dplyr provide tools for organising and manipulating your data without having to go in and alter anything by hand. Some of the most useful components are mutate() and filter(), which add columns and reduce rows, respectively. This package also provides a number of ways to summarise your data for ease of display.

1.1.4 ggplot2

ggplot2

Finally, ggplot2 is probably the best known package in the tidyverse. It provides a clean, highly flexible, highly customisable way of visualising data. It does so by layering attributes one at a time and synthesising them as a whole so that the appearance and content can be tweaked and adjusted with high granularity.

1.2 Additional packages

The following packages are also included in tidyverse, but we will not discuss them.

  1. purrr: supports functional programming in R
purrr
  1. stringr: simplifies working with data of class string (i.e., text)
stringr
  1. forcats: helps deal with data of class factor
forcats
  1. readr: an alternative method for reading in data
readr
  • Not included in tidyverse: tidytext (see this page for more information)
tidytext

2 Basic data manipulation

When you initially read data into your R session, it may not yet be in the most useful or appropriate form. Therefore, we must pre-process the data before we can analyse or visualise it. This is where basic data manipulation can come in handy.

2.1 Piping

“Piping” data from one line to the next does not actually alter the data permanently unless you overwrite your original dataset.

As a quick demonstration, the following two chunks of code do the exact same thing, but one has many embedded functions, whereas the tidyverse version is much more legible. Each chunk of code finds within the iris dataset the rows that contain irises of species “versicolor” with sepal lengths of less than 5 and petal widths of greater than .3 (of which there is one).

Base R:

head(iris[iris$Sepal.Length<5 & iris$Species=="versicolor" & iris$Petal.Width > 0.3,])
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 58          4.9         2.4          3.3           1 versicolor

Tidyverse:

iris %>%
  filter(Sepal.Length < 5,
         Species == "versicolor",
         Petal.Width > 0.3) %>%
  head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          4.9         2.4          3.3           1 versicolor

In the first chunk (base R), the functions and logical operators are embedded within each other. This can cause problems if the brackets mismatch in type or number. It also can get confusing as the row specification is very long and complex so it can be easy to lose your place when reading (or for that matter, when writing it).

In the second chunk (tidyverse), the functions are ordered and laid out neatly. Because there is very little embedding, it is easier to keep track of brackets. It is also much easier to read because the verbs (i.e., functions) are explicitly named rather than having to remember what the square brackets [] do in contrast with the round parentheses ().

Let’s say we’re working with our imported data which has three experiments in it…

data %>%
  head() # show the first 6 lines of the data.frame
##   experiment subject item condition selection selectCode
## 1      first       1    1  Baseline  Option 2          0
## 2      first       1    2  Baseline  Option 1          1
## 3      first       1    3  Baseline  Option 2          0
## 4      first       1    4  Baseline  Option 2          0
## 5      first       1    5  Baseline  Option 1          1
## 6      first       1    6  Baseline  Option 1          1
  • NB: I’ll be using the hash symbol ###### to set off the new code we’re adding in as we build up our data manipulation workflow.

2.2 Filter

What if we only want to work with a subset of our data.frame? This is where filter() comes in. As the name implies, this function filters your dataset based on the specified set of criteria, but we only want to work with the second experiment. In that case, we can do the following:

data %>%
######
  filter(experiment == "second") %>% # only include the rows in which "experiment" contains the string "second"
######
  head() # show the first 6 lines of the data.frame
##   experiment subject item condition selection selectCode
## 1     second       1    1  Baseline  Option 1          1
## 2     second       1    2  Baseline  Option 1          1
## 3     second       1    3  Baseline  Option 2          0
## 4     second       1    4  Baseline  Option 2          0
## 5     second       1    5  Baseline  Option 1          1
## 6     second       1    6  Baseline  Option 1          1

Now, we can “pipe” this data to our next operations for visualisation or analysis and only the second experiment will available to the next operation.

2.3 Mutate

The verb mutate() allows us to add and change columns. The current dataset we’re working with doesn’t have a lot going on, but we can add a column renaming the conditions contingent on the experiment they’re in.

data %>%
######
  mutate(condition = case_when(experiment == "first" & condition == "Baseline" ~ "Gap", # exp1 has two conditions: Gap and RP
                               experiment == "first" & condition == "Treatment" ~ "RP",
                               experiment == "second" & condition == "Baseline" ~ "Old form", # exp 2 has two conditions: Old and New
                               experiment == "second" & condition == "Treatment" ~ "New form",
                               TRUE ~ as.character(condition) # exp3 conditions stay the same
                               )
         ) %>%
######
  head() # show the first 6 lines of the data.frame
##   experiment subject item condition selection selectCode
## 1      first       1    1       Gap  Option 2          0
## 2      first       1    2       Gap  Option 1          1
## 3      first       1    3       Gap  Option 2          0
## 4      first       1    4       Gap  Option 2          0
## 5      first       1    5       Gap  Option 1          1
## 6      first       1    6       Gap  Option 1          1

The other common way to use mutate() is when you have a numerical operation (such as calculating residualised value by, say, word length). In this case, we can quickly add some fake columns to demonstrate how this would work (while adding back in our filter() operation from before):

set.seed(12345)

data %>%
  filter(experiment == "second") %>% # only include the rows in which "experiment" contains the string "second"
######
  mutate(fakeValue = rnorm(n=selectCode, mean = 10, sd = 3), # create fake continuous data
         fakeLength = round(rnorm(n=selectCode, mean = 6, sd = .5), 0), # create fake discrete data
         fakeResid = fakeValue / fakeLength) %>% # calculate fake value residualised by fake length
######
  mutate(condition = case_when(experiment == "first" & condition == "Baseline" ~ "Gap", # exp1 has two conditions: Gap and RP
                               experiment == "first" & condition == "Treatment" ~ "Resumptive Pronoun",
                               experiment == "second" & condition == "Baseline" ~ "Old form", # exp 2 has two conditions: Old and New
                               experiment == "second" & condition == "Treatment" ~ "New form",
                               TRUE ~ as.character(condition) # exp3 conditions stay the same
                               )
         ) %>%
  head() # show the first 6 lines of the data.frame
##   experiment subject item condition selection selectCode fakeValue fakeLength
## 1     second       1    1  Old form  Option 1          1 11.756586          6
## 2     second       1    2  Old form  Option 1          1 12.128398          6
## 3     second       1    3  Old form  Option 2          0  9.672090          6
## 4     second       1    4  Old form  Option 2          0  8.639508          6
## 5     second       1    5  Old form  Option 1          1 11.817662          5
## 6     second       1    6  Old form  Option 1          1  4.546132          6
##   fakeResid
## 1 1.9594311
## 2 2.0213997
## 3 1.6120150
## 4 1.4399181
## 5 2.3635325
## 6 0.7576887

2.4 Transmute

Not every column in this dataset is useful to us right now. Let’s say we only care about “subject”, “condition” and “fakeResid” now that we’ve applied our filter. In this case, we can use transmute() to reorganise and manipulate only the columns we want to keep. (We could also reorder the columns if we wanted by changing the order in which we call them as arguments.)

set.seed(12345)

data %>%
  filter(experiment == "second") %>% # only include the rows in which "experiment" contains the string "second"
  mutate(fakeValue = rnorm(n=selectCode, mean = 10, sd = 3), # create fake continuous data
         fakeLength = round(rnorm(n=selectCode, mean = 6, sd = .5), 0), # create fake discrete data
         fakeResid = fakeValue / fakeLength) %>% # calculate fake value residualised by fake length
  mutate(condition = case_when(experiment == "first" & condition == "Baseline" ~ "Gap", # exp1 has two conditions: Gap and RP
                               experiment == "first" & condition == "Treatment" ~ "Resumptive Pronoun",
                               experiment == "second" & condition == "Baseline" ~ "Old form", # exp 2 has two conditions: Old and New
                               experiment == "second" & condition == "Treatment" ~ "New form",
                               TRUE ~ as.character(condition) # exp3 conditions stay the same
                               )
         ) %>%
######
  transmute(subject = subject, # subject stays the same
            condition = as.factor(condition), # convert condition to a factor
            residualValue = fakeResid) %>% # rename 'fakeResid'
######
  head() # show the first 6 lines of the data.frame
##   subject condition residualValue
## 1       1  Old form     1.9594311
## 2       1  Old form     2.0213997
## 3       1  Old form     1.6120150
## 4       1  Old form     1.4399181
## 5       1  Old form     2.3635325
## 6       1  Old form     0.7576887

Excellent! Now we’ve pre-processed our data and can pipe it to our analysis or visualisation. (But we won’t, not quite yet.) One thing to note here is that head() is used only as a convention for display on this page. When we use head() during our actual pre-processing stage, we are cutting out every row beyond the 6th. When you do this on your own, you should leave out the head() function.

2.5 Gather

A different kind of pre-processing operation changes the shape of the data.frame. Often, what can happen is that you receive un-processed data that has something like the following structure (one subject per row, many columns).

wide <- read.csv("ExampleProject/data/wide-data.csv")

wide %>%
  head()
##     name age nationality gender      L1      L2 favColour
## 1  James  21     English   male English    <NA>      blue
## 2  Jenny  20    American female English Spanish     green
## 3 Jordan  22    Canadian female  French English    purple

This can sometimes be useful, but for data analysis and visuatisation in R, it’s best to have one observation per row with very few columns. In this case, we can use gather(). The first argument is typically the data.frame we are working with, but since we can start with the name of our data frame (in this case, wide), we can pipe it directly into the gather() verb and leave the argument defining the data.frame implicit. The next two arguments are the names of the columns we’re creating. By default, these are “key” and “value”, but we can call them anything we’d like. The last argument picks out which columns to gather. In this case, I want names to stay separate, and I want to gather all other columns into one long column.

wide %>%
######
  gather("demographic","answer",age:favColour) %>%
######
  head()
##     name demographic   answer
## 1  James         age       21
## 2  Jenny         age       20
## 3 Jordan         age       22
## 4  James nationality  English
## 5  Jenny nationality American
## 6 Jordan nationality Canadian

2.6 Spread

This can also go the other direction. If we have long data but we want it to be wide (though this is less common), we can use the spread verb.

long <- read.csv("ExampleProject/data/long-data.csv",as.is = TRUE)

long %>%
  head()
##   subject     measure    value
## 1       1 Nationality  English
## 2       1         Age       21
## 3       1     Savings   85 GBP
## 4       2 Nationality Scottish
## 5       2         Age       22
## 6       2     Savings  200 GBP

Now, I want the column called “measure” to be spread across several columns.

long %>%
######
  spread(key=measure,value=value) %>%
######
  head()
##   subject Age Nationality Savings
## 1       1  21     English  85 GBP
## 2       2  22    Scottish 200 GBP
## 3       3  18     N.Irish 125 GBP
## 4       4  24       Welsh 300 GBP
## 5       5  20    American 105 USD

Although it’s less common, this verb is useful for displaying small tables and sometimes it can help generate graphs. In any case, it’s good to be aware of.

2.7 Separate

The last bit of tidyr we’ll discuss for now is separate, which takes a column and splits it into two based on some character (which should be present in most if not all rows of that column). Below, we can take our long dataset, once it’s been spread, and split the column called “Savings” into the numeric amount and the currency.

long %>%
  spread(key=measure,value=value) %>%
######
  separate(Savings,into=c("amount","currency"),sep=" ") %>%
######
  head()
##   subject Age Nationality amount currency
## 1       1  21     English     85      GBP
## 2       2  22    Scottish    200      GBP
## 3       3  18     N.Irish    125      GBP
## 4       4  24       Welsh    300      GBP
## 5       5  20    American    105      USD

Note quickly that “Age” and “amount” contain numbers but are still of class character. We can easily change this with mutate(), as we did before:

long %>%
  spread(key=measure,value=value) %>%
  separate(Savings,into=c("amount","currency"),sep=" ") %>%
######
  mutate(Age = as.integer(Age),
         amount = as.numeric(amount)) %>%
######
  head()
##   subject Age Nationality amount currency
## 1       1  21     English     85      GBP
## 2       2  22    Scottish    200      GBP
## 3       3  18     N.Irish    125      GBP
## 4       4  24       Welsh    300      GBP
## 5       5  20    American    105      USD

2.8 Combining datasets

If you have multiple datasets that you wish to combine, dplyr and tidyverse provide a number of elegant ways of doing so. As this is a more advanced procedure, I’ll leave you with this link: https://dplyr.tidyverse.org/reference/bind.html

3 Summarising a data table

Sometimes, we don’t actually care about our raw data but rather we want to visualise or analyse a summary of the raw data. This is particularly useful in visualisation, as bar charts and error bars can be difficult for the visualisation tools to calculate on the fly.

3.1 Group by

The function group_by() doesn’t appear to do anything on its own. What it does is flag columns for summary when passed to the next verb (i.e., summarise(), described below). This means we can use the group_by() verb to select which categories are relevant to our analysis. Let’s take our fake data from earlier:

set.seed(12345)

data %>%
  mutate(fakeValue = rnorm(n=selectCode, mean = 10, sd = 3), # create fake continuous data
         fakeLength = round(rnorm(n=selectCode, mean = 6, sd = .5), 0), # create fake discrete data
         fakeResid = fakeValue / fakeLength) %>% # calculate fake value residualised by fake length
  head() # show the first 6 lines of the data.frame
##   experiment subject item condition selection selectCode fakeValue fakeLength
## 1      first       1    1  Baseline  Option 2          0 11.756586          6
## 2      first       1    2  Baseline  Option 1          1 12.128398          6
## 3      first       1    3  Baseline  Option 2          0  9.672090          6
## 4      first       1    4  Baseline  Option 2          0  8.639508          6
## 5      first       1    5  Baseline  Option 1          1 11.817662          6
## 6      first       1    6  Baseline  Option 1          1  4.546132          6
##   fakeResid
## 1 1.9594311
## 2 2.0213997
## 3 1.6120150
## 4 1.4399181
## 5 1.9696104
## 6 0.7576887

Now, let’s group by experiment and condition.

set.seed(12345)

data %>%
  mutate(fakeValue = rnorm(n=selectCode, mean = 10, sd = 3), # create fake continuous data
         fakeLength = round(rnorm(n=selectCode, mean = 6, sd = .5), 0), # create fake discrete data
         fakeResid = fakeValue / fakeLength) %>% # calculate fake value residualised by fake length
  group_by(experiment,condition) %>% # flag 'experiment' and 'condition' as categories of interest
  head() # show the first 6 lines of the data.frame
## # A tibble: 6 x 9
## # Groups:   experiment, condition [1]
##   experiment subject  item condition selection selectCode fakeValue fakeLength
##   <chr>        <int> <int> <chr>     <chr>          <int>     <dbl>      <dbl>
## 1 first            1     1 Baseline  Option 2           0     11.8           6
## 2 first            1     2 Baseline  Option 1           1     12.1           6
## 3 first            1     3 Baseline  Option 2           0      9.67          6
## 4 first            1     4 Baseline  Option 2           0      8.64          6
## 5 first            1     5 Baseline  Option 1           1     11.8           6
## 6 first            1     6 Baseline  Option 1           1      4.55          6
## # … with 1 more variable: fakeResid <dbl>

Nothing has changed (yet). However…

3.2 Summarise

When we summarise our data, we can see how group_by() has flagged the categories we’re interested in.

set.seed(12345)

data %>%
  mutate(fakeValue = rnorm(n=selectCode, mean = 10, sd = 3), # create fake continuous data
         fakeLength = round(rnorm(n=selectCode, mean = 6, sd = .5), 0), # create fake discrete data
         fakeResid = fakeValue / fakeLength) %>% # calculate fake value residualised by fake length
  group_by(experiment,condition) %>% # flag 'experiment' and 'condition' as categories of interest
  summarise(mean = mean(fakeResid), # summarise by calculating the mean, standard deviation, and standard error for the categories of interest
            sd = sd(fakeResid),
            se = sd / sqrt(length(unique(item)))) %>%
  head() # show the first 6 lines of the data.frame
## `summarise()` regrouping output by 'experiment' (override with `.groups` argument)
## # A tibble: 6 x 5
## # Groups:   experiment [3]
##   experiment condition  mean    sd    se
##   <chr>      <chr>     <dbl> <dbl> <dbl>
## 1 first      Baseline   1.73 0.550 0.159
## 2 first      Treatment  1.71 0.504 0.145
## 3 second     Baseline   1.70 0.553 0.160
## 4 second     Treatment  1.68 0.537 0.155
## 5 third      Baseline   1.66 0.529 0.153
## 6 third      Treatment  1.65 0.546 0.157

We can visualise this summary table very easily, now!

## `summarise()` regrouping output by 'experiment' (override with `.groups` argument)

4 Ingredients of a ggplot

Since ggplot2 combines different layers into a complex plot, we need to go through what each layer and its respective components do. Many are optional and you will have to experiment to see what components are most important to you.

Crucially, ggplot allows inherited features. This means we can specify something important in the base plot, and each following layer will be aware of it. This is a common refrain in the tidyverse (see: piping). Not only does this cut down on typing, but it also cuts down on places for potential errors and conflicts in your code.

4.1 Base plot

Required

The base plot is a required layer of any ggplot. Not every base plot will contain the same information, but there are a few elements that must be specified. Some of these can be overwritten in later layers, but since that is a relatively advanced operation, we can ignore it for now.

The base plot does not generate a graph. All it does is instantiate the plotting function, specify what data.frame is being used, and list the required aesthetics aes().

ggplot(data, aes(x=condition))

4.1.0.1 Tidyverse and ggplot

Since we are learning the whole tidyverse at once, the following examples will all use the pipe operator instead of calling the data.frame as an argument of ggplot. That means instead of the code in the previous chunk (which is sort of a default syntax), we can use this:

data %>%
  ggplot(aes(x=condition))

It is functionally identical, but it keeps the same tidyverse format as the stuff we were doing previously. This will make it easier to combine data manipulation and visualisation into one sleek operation later on.

4.1.1 Aesthetics

Optional in base plot, required somewhere

Aesthetics are things like your x-axis, y-axis that are directly inferred from your data.frame. That is, if we are plotting a scatterplot from the built-in data iris, we should specify what the axes are so ggplot can figure out what the appropriate range and scale of the plot should be.

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width))

4.1.1.1 X-axis

Optional in base plot, required somewhere

The x-axis is required somewhere in the aesthetics of the plot. It does not necessarily need to be in the base plot, but if it’s not in the base plot, it might not be inherited to other components (thus would need to be re-specified each time).

4.1.1.2 Y-axis

Optional, depending on plot type

The y-axis is option because some types of plots like histograms and barcharts do not necessarily need it to be specified. However, for most plots, a y-axis will need to be specified either in the base plot or later in a different component’s aesthetics.

4.1.1.3 Colour

Optional

Colour refers to the line or point colour in a plot. For plots that do not need coloured lines to properly visualise the data, the colour aesthetic can remain unspecified. It can also be specified outside of the aesthetics to change the overall appearance of the plot without attributing colour to a factor in the dataset.

4.1.1.4 Fill

Optional

Fill refers to the colour of a region in a plot, such as the interior of a point, a bar or boxplot. For plots that do not need coloured areas to properly visualise the data, the fill aesthetic can remain unspecified. It can also be specified outside of the aesthetics to change the overall appearance of the plot without attributing fill to a factor in the dataset.

4.1.1.5 Size

Optional

Size is the point or line size. It can be discrete or continuous, but is larger than 0 and frequently an integer. It can also be specified outside of the aesthetics to change the overall appearance of the plot without attributing size to a factor in the dataset.

4.1.1.6 Alpha

Optional

Alpha is the transparency of a colour or fill, specified as a real number between 0 and 1. It is particularly useful when visualising a large number of points that overlap. Although it can be specified in aesthetics as a factor in the dataset, it is most often used as a ‘hard coded’ value outside of the aesthetics.

4.2 Geom

The geometric object (‘geom’) of a plot is effectively what kind of plot you want to make. There is a comprehensive list of geoms on the official website. We’ll explore a few of the most common here.

In order to create a geometric object and layer it on your base plot, you will have to add the new layer to what you already have.

4.2.1 Point

Scatter plots can be generated with geom_point() layers.

iris %>%
  ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point()

We can now combine this type of plot with colours for each of the species (a factor in the dataframe).

iris %>%
  ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point(aes(colour = Species))

However, we can’t tell if there are multiple points overlapping or not, so we can change the alpha so that when multiple points are overlapping, they appear darker (their opacity is compounded). Since this isn’t an explicit property of the data.frame (but rather a quirk of the visualisation), we can keep alpha outside of the aesthetics.

iris %>%
  ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point(aes(colour = Species), alpha=.5)

NB: The colours are automatically chosen from the ggplot palette. You can also change what colours are used, which is discussed in Danielle Turton’s lesson from Adventures in R.

4.2.2 Boxplot

Boxplots, unlike scatterplots, have a factor on the x-axis and continuous numerical data on the y-axis.

iris %>%
  ggplot(aes(x = Species, y = Sepal.Length)) +
    geom_boxplot()

This is a simple plot, but it doesn’t have to be. We can add in visual cues to highlight that these three groups differ.

iris %>%
  ggplot(aes(x = Species, y = Sepal.Length, fill=Species)) +
    geom_boxplot()

Much nicer! But now we have redundant information in the legend and the x-axis. We can easily get rid of the legend by adding a theme layer that says the position of the legend is null:

iris %>%
  ggplot(aes(x = Species, y = Sepal.Length, fill=Species)) +
    geom_boxplot() +
    theme(legend.position = "null")

4.2.3 Histogram

For our histogram, a simple version is not very visually pleasing.

iris %>%
  ggplot(aes(x = Sepal.Length)) +
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We also get a warning that we shoud choose a better “binwidth”:

iris %>%
  ggplot(aes(x = Sepal.Length)) +
    geom_histogram(binwidth = .5)

But what about demonstrating how the different species are distributed in this data?

iris %>%
  ggplot(aes(x = Sepal.Length)) +
    geom_histogram(aes(fill=Species), binwidth = .5)

This isn’t very helpful because the numbers are stacked. We can see that they seem to have different distributions because the data are quite simple, but we really can’t draw any inferences based on this visualisation by itself. Let’s first only look at ‘setosa’ and ‘virginica’ to simplify the data, then graph them so they are overlapping histograms.

iris %>%
  filter(Species != "versicolor") %>% # only plot 'setosa' and 'virginica', could also be written as the following line:
# filter(Species == "setosa" | Species == "virginica") %>%
  ggplot(aes(x = Sepal.Length)) +
    geom_histogram(aes(fill=Species), binwidth = .5, 
                   alpha = .75, position = "identity")

Now it’s clear that these two only overlap a little bit.

Exercise: On your own, add back in the third group and see if you can choose a binwidth and alpha that make the visualisation clear without being too confusing (this is very hard!). One potential answer is in the .Rmd.

4.2.3.1 Density

Maybe in this case, a density plot would be best because it simplifies and outlines the data (but this is your call in the end):

iris %>%
  ggplot(aes(x = Sepal.Length)) +
    geom_density(aes(fill=Species), alpha = .5, position = "identity")

4.2.4 Smooth

If we want to plot a regression or trend line, we can use geom_smooth(). Here, I’ll overlay the smooth later on a point layer, but this is optional.

iris %>%
  ggplot(aes(x = Sepal.Length, y = Sepal.Width, colour=Species)) +
    geom_point() +
    geom_smooth(method = "lm") # using a linear model for the regression line
## `geom_smooth()` using formula 'y ~ x'

Other options for method are described here.

4.2.5 Bar

Bar charts are slightly unusual in the ggplot world since they don’t necessarily need a y-axis specified, but often you will want to specify one anyway.

Let’s go back to the data.frame we’ve called data.

data %>%
  ggplot(aes(x=condition, fill=selection)) +
    geom_bar()

This just counts observations per factor, so we can see a very general overview of the data. However, we aren’t looking at the three experiments separately, so we’re losing some important information. One way to add another dimension to a plot like this is with facet_wrap.

data %>%
  ggplot(aes(x=condition, fill=selection)) +
    geom_bar() +
    facet_wrap(~experiment)

The problem here is that the y-axis is raw counts and we can’t tell what proportion of the data is in which category. We can now combine our summarising skills with our plotting skills to get a clearer picture!

data %>%
  group_by(experiment,condition,selection) %>%
  summarise(count = n(),
            length = 240) %>%
  mutate(proportion = count/length)
## `summarise()` regrouping output by 'experiment', 'condition' (override with `.groups` argument)
## # A tibble: 12 x 6
## # Groups:   experiment, condition [6]
##    experiment condition selection count length proportion
##    <chr>      <chr>     <chr>     <int>  <dbl>      <dbl>
##  1 first      Baseline  Option 1    130    240     0.542 
##  2 first      Baseline  Option 2    110    240     0.458 
##  3 first      Treatment Option 1    181    240     0.754 
##  4 first      Treatment Option 2     59    240     0.246 
##  5 second     Baseline  Option 1    106    240     0.442 
##  6 second     Baseline  Option 2    134    240     0.558 
##  7 second     Treatment Option 1    217    240     0.904 
##  8 second     Treatment Option 2     23    240     0.0958
##  9 third      Baseline  Option 1     84    240     0.35  
## 10 third      Baseline  Option 2    156    240     0.65  
## 11 third      Treatment Option 1     54    240     0.225 
## 12 third      Treatment Option 2    186    240     0.775
data %>%
  group_by(experiment,condition,selection) %>%
  summarise(count = n(),
            length = 240) %>%
  mutate(proportion = count/length) %>%
  ggplot(aes(x=condition, fill=selection)) +
    geom_bar(aes(y=proportion), stat="identity") + # now it won't count observations but will create a bar of the height of 'n'
    facet_wrap(~experiment)
## `summarise()` regrouping output by 'experiment', 'condition' (override with `.groups` argument)

But we can make our y-axis even nicer! Let’s change the theme so it is a little prettier and add a nice title.

data %>%
  group_by(experiment,condition,selection) %>%
  summarise(count = n(),
            length = 240) %>%
  mutate(proportion = count/length) %>%
  ggplot(aes(x=condition, fill=selection)) +
    geom_bar(aes(y=proportion), stat="identity", colour="grey40") + 
    theme_bw() +
    ggtitle("Example graph") +
    facet_wrap(~experiment)
## `summarise()` regrouping output by 'experiment', 'condition' (override with `.groups` argument)

5 Challenge

Can you recreate this pre-processed data.frame and graph from what you know now? Take your time, list what information you need to calculate, list what aesthetics you see. This is a huge challenge and I would not expect anyone to be able to do it on their own. Error bars for this kind of (binomial) data are notoriously difficult! Ask for help from colleagues and the internet. See if you can develop your problem-solving skills in advance of next week’s tutorial. Start easy – change the colours and the y-axis label. Next, try to figure out how to make the y-axis display percentages rather than proportions. If you’re feeling bored, only then try to tackle the error bars.

## `summarise()` regrouping output by 'experiment', 'condition' (override with `.groups` argument)
## # A tibble: 12 x 7
## # Groups:   experiment, condition [6]
##    experiment condition selection     n      y ciLower ciUpper
##    <chr>      <chr>     <chr>     <int>  <dbl>   <dbl>   <dbl>
##  1 first      Baseline  Option 1    130 0.542  NA       NA    
##  2 first      Baseline  Option 2    110 0.458   0.394    0.524
##  3 first      Treatment Option 1    181 0.754  NA       NA    
##  4 first      Treatment Option 2     59 0.246   0.194    0.306
##  5 second     Baseline  Option 1    106 0.442  NA       NA    
##  6 second     Baseline  Option 2    134 0.558   0.493    0.622
##  7 second     Treatment Option 1    217 0.904  NA       NA    
##  8 second     Treatment Option 2     23 0.0958  0.0630   0.142
##  9 third      Baseline  Option 1     84 0.35   NA       NA    
## 10 third      Baseline  Option 2    156 0.65    0.586    0.710
## 11 third      Treatment Option 1     54 0.225  NA       NA    
## 12 third      Treatment Option 2    186 0.775   0.716    0.825


Go to Tutorial 3 →