What is Tidyverse?
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Tidyverse is a package, or a set of add-on tools, that you can optionally use in R to easily and clearly process and visualise your data. In the tidyverse, there are a number of included packages. You do not need to use them all, nor do you need to load them all, but for simplicity’s sake, it’s easier to load the whole thing and then not worry about it.
The most important (and exciting!) difference between the way base R functions work and the way tidyverse functions work is the pipe: %>%
In short, the pipe (%>%
) takes whatever has already been done in the preceding line(s) and funnels it into the next line. This means complex operations can be performed, including changing or manipulating the data.frame, but it is temporary within the piped lines and will not permanently alter the data. Each line that you pipe to will have a function, and the functions defined inside the tidyverse package are typically referred to as verbs
. I will not use this terminology strictly, but it is good to know.
We’ll get into more complex examples later on.
Main packages
The most useful packages for general data manipulation and visualisation are discussed in this section. To start, let’s read in a data.frame so we can practice setting it up. You can download this dataset to put in your “data” folder (as we discussed in best practices last time).
Once you’ve got your dataset in your data folder, you can read the data into this R session with the following code:
data <- read.csv("data/binomial-data.csv", header=TRUE, as.is=TRUE)
We can view the data a few different ways. The function View()
(note the capidal “V”) will open the data as a data.frame
in a new tab and it will look like a spreadsheet. Try that now:
View(data)
Note that you can’t edit the data.frame
but you can sort the data by column. This doesn’t change anything about the structure of the data.frame
, which you can see because the row numbers stay with their original rows when you sort the data.
We can also view our data in the console. The function head()
and tail()
show us the first or last six rows of a data.frame, respectively. We can view more if you add a number as a second argument:
head(data, 10)
## experiment subject item condition selection selectCode
## 1 first 1 1 Baseline Option 2 0
## 2 first 1 2 Baseline Option 1 1
## 3 first 1 3 Baseline Option 2 0
## 4 first 1 4 Baseline Option 2 0
## 5 first 1 5 Baseline Option 1 1
## 6 first 1 6 Baseline Option 1 1
## 7 first 1 7 Baseline Option 2 0
## 8 first 1 8 Baseline Option 1 1
## 9 first 1 9 Baseline Option 2 0
## 10 first 1 10 Baseline Option 2 0
tibble
When you import a data.frame, there are two arguments I’ve added in: header=TRUE
and as.is=TRUE
. The first one says that the first row is the header row that names the columns. This is optional, but good to specify explicitly. The second argument tells R to not change the class of the column. The most common data classes are:
int
: integers
dbl
: double (a continuous numerical value)
chr
: character (letters and numbers stored as a string)
fctr
: factor (a categorical value)
If you don’t specify as.is=TRUE
(that is, if as.is=FALSE
), then all non-numerical values will imported as factors. This often doesn’t matter, but it can potentially cause issues later on, so I typically assign the value TRUE
.
A tibble is a massively simplified data.frame. You can read in detail here about the way tibbles differ from data.frames. Importantly, using the tidyverse to interact with your data will typically convert your data.frame to a tibble. This won’t make much of a difference 99.9% of the time, but you might notice some differences in the way it displays.
as.tibble(data)
## Warning: `as.tibble()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 1,440 x 6
## experiment subject item condition selection selectCode
## <chr> <int> <int> <chr> <chr> <int>
## 1 first 1 1 Baseline Option 2 0
## 2 first 1 2 Baseline Option 1 1
## 3 first 1 3 Baseline Option 2 0
## 4 first 1 4 Baseline Option 2 0
## 5 first 1 5 Baseline Option 1 1
## 6 first 1 6 Baseline Option 1 1
## 7 first 1 7 Baseline Option 2 0
## 8 first 1 8 Baseline Option 1 1
## 9 first 1 9 Baseline Option 2 0
## 10 first 1 10 Baseline Option 2 0
## # … with 1,430 more rows
tidyr
The tidyr
package is designed to help you tidy your data without having to go through it by hand. This cuts down on typos and the amount of time and effort needed to put your data into an analysable form. In particular, gather()
and spread()
will be useful for turning wide data into long data and vice versa. It also provides a number of tools for tidying your data.
dplyr
Like tidyr
, the package dplyr
provide tools for organising and manipulating your data without having to go in and alter anything by hand. Some of the most useful components are mutate()
and filter()
, which add columns and reduce rows, respectively. This package also provides a number of ways to summarise your data for ease of display.
ggplot2
Finally, ggplot2
is probably the best known package in the tidyverse. It provides a clean, highly flexible, highly customisable way of visualising data. It does so by layering attributes one at a time and synthesising them as a whole so that the appearance and content can be tweaked and adjusted with high granularity.
Basic data manipulation
When you initially read data into your R session, it may not yet be in the most useful or appropriate form. Therefore, we must pre-process the data before we can analyse or visualise it. This is where basic data manipulation can come in handy.
Piping
“Piping” data from one line to the next does not actually alter the data permanently unless you overwrite your original dataset.
As a quick demonstration, the following two chunks of code do the exact same thing, but one has many embedded functions, whereas the tidyverse version is much more legible. Each chunk of code finds within the iris
dataset the rows that contain irises of species “versicolor” with sepal lengths of less than 5 and petal widths of greater than .3 (of which there is one).
Base R:
head(iris[iris$Sepal.Length<5 & iris$Species=="versicolor" & iris$Petal.Width > 0.3,])
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 58 4.9 2.4 3.3 1 versicolor
Tidyverse:
iris %>%
filter(Sepal.Length < 5,
Species == "versicolor",
Petal.Width > 0.3) %>%
head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 4.9 2.4 3.3 1 versicolor
In the first chunk (base R), the functions and logical operators are embedded within each other. This can cause problems if the brackets mismatch in type or number. It also can get confusing as the row specification is very long and complex so it can be easy to lose your place when reading (or for that matter, when writing it).
In the second chunk (tidyverse), the functions are ordered and laid out neatly. Because there is very little embedding, it is easier to keep track of brackets. It is also much easier to read because the verbs (i.e., functions) are explicitly named rather than having to remember what the square brackets []
do in contrast with the round parentheses ()
.
Let’s say we’re working with our imported data which has three experiments in it…
data %>%
head() # show the first 6 lines of the data.frame
## experiment subject item condition selection selectCode
## 1 first 1 1 Baseline Option 2 0
## 2 first 1 2 Baseline Option 1 1
## 3 first 1 3 Baseline Option 2 0
## 4 first 1 4 Baseline Option 2 0
## 5 first 1 5 Baseline Option 1 1
## 6 first 1 6 Baseline Option 1 1
- NB: I’ll be using the hash symbol
######
to set off the new code we’re adding in as we build up our data manipulation workflow.
Filter
What if we only want to work with a subset of our data.frame? This is where filter() comes in. As the name implies, this function filters your dataset based on the specified set of criteria, but we only want to work with the second experiment. In that case, we can do the following:
data %>%
######
filter(experiment == "second") %>% # only include the rows in which "experiment" contains the string "second"
######
head() # show the first 6 lines of the data.frame
## experiment subject item condition selection selectCode
## 1 second 1 1 Baseline Option 1 1
## 2 second 1 2 Baseline Option 1 1
## 3 second 1 3 Baseline Option 2 0
## 4 second 1 4 Baseline Option 2 0
## 5 second 1 5 Baseline Option 1 1
## 6 second 1 6 Baseline Option 1 1
Now, we can “pipe” this data to our next operations for visualisation or analysis and only the second experiment will available to the next operation.
Mutate
The verb mutate() allows us to add and change columns. The current dataset we’re working with doesn’t have a lot going on, but we can add a column renaming the conditions contingent on the experiment they’re in.
data %>%
######
mutate(condition = case_when(experiment == "first" & condition == "Baseline" ~ "Gap", # exp1 has two conditions: Gap and RP
experiment == "first" & condition == "Treatment" ~ "RP",
experiment == "second" & condition == "Baseline" ~ "Old form", # exp 2 has two conditions: Old and New
experiment == "second" & condition == "Treatment" ~ "New form",
TRUE ~ as.character(condition) # exp3 conditions stay the same
)
) %>%
######
head() # show the first 6 lines of the data.frame
## experiment subject item condition selection selectCode
## 1 first 1 1 Gap Option 2 0
## 2 first 1 2 Gap Option 1 1
## 3 first 1 3 Gap Option 2 0
## 4 first 1 4 Gap Option 2 0
## 5 first 1 5 Gap Option 1 1
## 6 first 1 6 Gap Option 1 1
The other common way to use mutate() is when you have a numerical operation (such as calculating residualised value by, say, word length). In this case, we can quickly add some fake columns to demonstrate how this would work (while adding back in our filter() operation from before):
set.seed(12345)
data %>%
filter(experiment == "second") %>% # only include the rows in which "experiment" contains the string "second"
######
mutate(fakeValue = rnorm(n=selectCode, mean = 10, sd = 3), # create fake continuous data
fakeLength = round(rnorm(n=selectCode, mean = 6, sd = .5), 0), # create fake discrete data
fakeResid = fakeValue / fakeLength) %>% # calculate fake value residualised by fake length
######
mutate(condition = case_when(experiment == "first" & condition == "Baseline" ~ "Gap", # exp1 has two conditions: Gap and RP
experiment == "first" & condition == "Treatment" ~ "Resumptive Pronoun",
experiment == "second" & condition == "Baseline" ~ "Old form", # exp 2 has two conditions: Old and New
experiment == "second" & condition == "Treatment" ~ "New form",
TRUE ~ as.character(condition) # exp3 conditions stay the same
)
) %>%
head() # show the first 6 lines of the data.frame
## experiment subject item condition selection selectCode fakeValue fakeLength
## 1 second 1 1 Old form Option 1 1 11.756586 6
## 2 second 1 2 Old form Option 1 1 12.128398 6
## 3 second 1 3 Old form Option 2 0 9.672090 6
## 4 second 1 4 Old form Option 2 0 8.639508 6
## 5 second 1 5 Old form Option 1 1 11.817662 5
## 6 second 1 6 Old form Option 1 1 4.546132 6
## fakeResid
## 1 1.9594311
## 2 2.0213997
## 3 1.6120150
## 4 1.4399181
## 5 2.3635325
## 6 0.7576887
Transmute
Not every column in this dataset is useful to us right now. Let’s say we only care about “subject”, “condition” and “fakeResid” now that we’ve applied our filter. In this case, we can use transmute() to reorganise and manipulate only the columns we want to keep. (We could also reorder the columns if we wanted by changing the order in which we call them as arguments.)
set.seed(12345)
data %>%
filter(experiment == "second") %>% # only include the rows in which "experiment" contains the string "second"
mutate(fakeValue = rnorm(n=selectCode, mean = 10, sd = 3), # create fake continuous data
fakeLength = round(rnorm(n=selectCode, mean = 6, sd = .5), 0), # create fake discrete data
fakeResid = fakeValue / fakeLength) %>% # calculate fake value residualised by fake length
mutate(condition = case_when(experiment == "first" & condition == "Baseline" ~ "Gap", # exp1 has two conditions: Gap and RP
experiment == "first" & condition == "Treatment" ~ "Resumptive Pronoun",
experiment == "second" & condition == "Baseline" ~ "Old form", # exp 2 has two conditions: Old and New
experiment == "second" & condition == "Treatment" ~ "New form",
TRUE ~ as.character(condition) # exp3 conditions stay the same
)
) %>%
######
transmute(subject = subject, # subject stays the same
condition = as.factor(condition), # convert condition to a factor
residualValue = fakeResid) %>% # rename 'fakeResid'
######
head() # show the first 6 lines of the data.frame
## subject condition residualValue
## 1 1 Old form 1.9594311
## 2 1 Old form 2.0213997
## 3 1 Old form 1.6120150
## 4 1 Old form 1.4399181
## 5 1 Old form 2.3635325
## 6 1 Old form 0.7576887
Excellent! Now we’ve pre-processed our data and can pipe it to our analysis or visualisation. (But we won’t, not quite yet.) One thing to note here is that head() is used only as a convention for display on this page. When we use head() during our actual pre-processing stage, we are cutting out every row beyond the 6th. When you do this on your own, you should leave out the head() function.
Gather
A different kind of pre-processing operation changes the shape of the data.frame. Often, what can happen is that you receive un-processed data that has something like the following structure (one subject per row, many columns).
wide <- read.csv("ExampleProject/data/wide-data.csv")
wide %>%
head()
## name age nationality gender L1 L2 favColour
## 1 James 21 English male English <NA> blue
## 2 Jenny 20 American female English Spanish green
## 3 Jordan 22 Canadian female French English purple
This can sometimes be useful, but for data analysis and visuatisation in R, it’s best to have one observation per row with very few columns. In this case, we can use gather()
. The first argument is typically the data.frame we are working with, but since we can start with the name of our data frame (in this case, wide
), we can pipe it directly into the gather()
verb and leave the argument defining the data.frame implicit. The next two arguments are the names of the columns we’re creating. By default, these are “key” and “value”, but we can call them anything we’d like. The last argument picks out which columns to gather. In this case, I want names to stay separate, and I want to gather all other columns into one long column.
wide %>%
######
gather("demographic","answer",age:favColour) %>%
######
head()
## name demographic answer
## 1 James age 21
## 2 Jenny age 20
## 3 Jordan age 22
## 4 James nationality English
## 5 Jenny nationality American
## 6 Jordan nationality Canadian
Spread
This can also go the other direction. If we have long data but we want it to be wide (though this is less common), we can use the spread
verb.
long <- read.csv("ExampleProject/data/long-data.csv",as.is = TRUE)
long %>%
head()
## subject measure value
## 1 1 Nationality English
## 2 1 Age 21
## 3 1 Savings 85 GBP
## 4 2 Nationality Scottish
## 5 2 Age 22
## 6 2 Savings 200 GBP
Now, I want the column called “measure” to be spread across several columns.
long %>%
######
spread(key=measure,value=value) %>%
######
head()
## subject Age Nationality Savings
## 1 1 21 English 85 GBP
## 2 2 22 Scottish 200 GBP
## 3 3 18 N.Irish 125 GBP
## 4 4 24 Welsh 300 GBP
## 5 5 20 American 105 USD
Although it’s less common, this verb is useful for displaying small tables and sometimes it can help generate graphs. In any case, it’s good to be aware of.
Separate
The last bit of tidyr
we’ll discuss for now is separate, which takes a column and splits it into two based on some character (which should be present in most if not all rows of that column). Below, we can take our long
dataset, once it’s been spread, and split the column called “Savings” into the numeric amount and the currency.
long %>%
spread(key=measure,value=value) %>%
######
separate(Savings,into=c("amount","currency"),sep=" ") %>%
######
head()
## subject Age Nationality amount currency
## 1 1 21 English 85 GBP
## 2 2 22 Scottish 200 GBP
## 3 3 18 N.Irish 125 GBP
## 4 4 24 Welsh 300 GBP
## 5 5 20 American 105 USD
Note quickly that “Age” and “amount” contain numbers but are still of class character
. We can easily change this with mutate(), as we did before:
long %>%
spread(key=measure,value=value) %>%
separate(Savings,into=c("amount","currency"),sep=" ") %>%
######
mutate(Age = as.integer(Age),
amount = as.numeric(amount)) %>%
######
head()
## subject Age Nationality amount currency
## 1 1 21 English 85 GBP
## 2 2 22 Scottish 200 GBP
## 3 3 18 N.Irish 125 GBP
## 4 4 24 Welsh 300 GBP
## 5 5 20 American 105 USD
Combining datasets
If you have multiple datasets that you wish to combine, dplyr and tidyverse provide a number of elegant ways of doing so. As this is a more advanced procedure, I’ll leave you with this link: https://dplyr.tidyverse.org/reference/bind.html
Summarising a data table
Sometimes, we don’t actually care about our raw data but rather we want to visualise or analyse a summary of the raw data. This is particularly useful in visualisation, as bar charts and error bars can be difficult for the visualisation tools to calculate on the fly.
Group by
The function group_by() doesn’t appear to do anything on its own. What it does is flag columns for summary when passed to the next verb (i.e., summarise(), described below). This means we can use the group_by() verb to select which categories are relevant to our analysis. Let’s take our fake data from earlier:
set.seed(12345)
data %>%
mutate(fakeValue = rnorm(n=selectCode, mean = 10, sd = 3), # create fake continuous data
fakeLength = round(rnorm(n=selectCode, mean = 6, sd = .5), 0), # create fake discrete data
fakeResid = fakeValue / fakeLength) %>% # calculate fake value residualised by fake length
head() # show the first 6 lines of the data.frame
## experiment subject item condition selection selectCode fakeValue fakeLength
## 1 first 1 1 Baseline Option 2 0 11.756586 6
## 2 first 1 2 Baseline Option 1 1 12.128398 6
## 3 first 1 3 Baseline Option 2 0 9.672090 6
## 4 first 1 4 Baseline Option 2 0 8.639508 6
## 5 first 1 5 Baseline Option 1 1 11.817662 6
## 6 first 1 6 Baseline Option 1 1 4.546132 6
## fakeResid
## 1 1.9594311
## 2 2.0213997
## 3 1.6120150
## 4 1.4399181
## 5 1.9696104
## 6 0.7576887
Now, let’s group by experiment and condition.
set.seed(12345)
data %>%
mutate(fakeValue = rnorm(n=selectCode, mean = 10, sd = 3), # create fake continuous data
fakeLength = round(rnorm(n=selectCode, mean = 6, sd = .5), 0), # create fake discrete data
fakeResid = fakeValue / fakeLength) %>% # calculate fake value residualised by fake length
group_by(experiment,condition) %>% # flag 'experiment' and 'condition' as categories of interest
head() # show the first 6 lines of the data.frame
## # A tibble: 6 x 9
## # Groups: experiment, condition [1]
## experiment subject item condition selection selectCode fakeValue fakeLength
## <chr> <int> <int> <chr> <chr> <int> <dbl> <dbl>
## 1 first 1 1 Baseline Option 2 0 11.8 6
## 2 first 1 2 Baseline Option 1 1 12.1 6
## 3 first 1 3 Baseline Option 2 0 9.67 6
## 4 first 1 4 Baseline Option 2 0 8.64 6
## 5 first 1 5 Baseline Option 1 1 11.8 6
## 6 first 1 6 Baseline Option 1 1 4.55 6
## # … with 1 more variable: fakeResid <dbl>
Nothing has changed (yet). However…
Summarise
When we summarise our data, we can see how group_by() has flagged the categories we’re interested in.
set.seed(12345)
data %>%
mutate(fakeValue = rnorm(n=selectCode, mean = 10, sd = 3), # create fake continuous data
fakeLength = round(rnorm(n=selectCode, mean = 6, sd = .5), 0), # create fake discrete data
fakeResid = fakeValue / fakeLength) %>% # calculate fake value residualised by fake length
group_by(experiment,condition) %>% # flag 'experiment' and 'condition' as categories of interest
summarise(mean = mean(fakeResid), # summarise by calculating the mean, standard deviation, and standard error for the categories of interest
sd = sd(fakeResid),
se = sd / sqrt(length(unique(item)))) %>%
head() # show the first 6 lines of the data.frame
## `summarise()` regrouping output by 'experiment' (override with `.groups` argument)
## # A tibble: 6 x 5
## # Groups: experiment [3]
## experiment condition mean sd se
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 first Baseline 1.73 0.550 0.159
## 2 first Treatment 1.71 0.504 0.145
## 3 second Baseline 1.70 0.553 0.160
## 4 second Treatment 1.68 0.537 0.155
## 5 third Baseline 1.66 0.529 0.153
## 6 third Treatment 1.65 0.546 0.157
We can visualise this summary table very easily, now!
## `summarise()` regrouping output by 'experiment' (override with `.groups` argument)
Ingredients of a ggplot
Since ggplot2 combines different layers into a complex plot, we need to go through what each layer and its respective components do. Many are optional and you will have to experiment to see what components are most important to you.
Crucially, ggplot allows inherited features. This means we can specify something important in the base plot, and each following layer will be aware of it. This is a common refrain in the tidyverse (see: piping). Not only does this cut down on typing, but it also cuts down on places for potential errors and conflicts in your code.
Base plot
Required
The base plot is a required layer of any ggplot. Not every base plot will contain the same information, but there are a few elements that must be specified. Some of these can be overwritten in later layers, but since that is a relatively advanced operation, we can ignore it for now.
The base plot does not generate a graph. All it does is instantiate the plotting function, specify what data.frame is being used, and list the required aesthetics aes()
.
ggplot(data, aes(x=condition))
Tidyverse and ggplot
Since we are learning the whole tidyverse at once, the following examples will all use the pipe operator instead of calling the data.frame as an argument of ggplot. That means instead of the code in the previous chunk (which is sort of a default syntax), we can use this:
data %>%
ggplot(aes(x=condition))
It is functionally identical, but it keeps the same tidyverse format as the stuff we were doing previously. This will make it easier to combine data manipulation and visualisation into one sleek operation later on.
Aesthetics
Optional in base plot, required somewhere
Aesthetics are things like your x-axis, y-axis that are directly inferred from your data.frame. That is, if we are plotting a scatterplot from the built-in data iris
, we should specify what the axes are so ggplot can figure out what the appropriate range and scale of the plot should be.
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width))
X-axis
Optional in base plot, required somewhere
The x-axis is required somewhere in the aesthetics of the plot. It does not necessarily need to be in the base plot, but if it’s not in the base plot, it might not be inherited to other components (thus would need to be re-specified each time).
Y-axis
Optional, depending on plot type
The y-axis is option because some types of plots like histograms and barcharts do not necessarily need it to be specified. However, for most plots, a y-axis will need to be specified either in the base plot or later in a different component’s aesthetics.
Colour
Optional
Colour refers to the line or point colour in a plot. For plots that do not need coloured lines to properly visualise the data, the colour aesthetic can remain unspecified. It can also be specified outside of the aesthetics to change the overall appearance of the plot without attributing colour to a factor in the dataset.
Fill
Optional
Fill refers to the colour of a region in a plot, such as the interior of a point, a bar or boxplot. For plots that do not need coloured areas to properly visualise the data, the fill aesthetic can remain unspecified. It can also be specified outside of the aesthetics to change the overall appearance of the plot without attributing fill to a factor in the dataset.
Size
Optional
Size is the point or line size. It can be discrete or continuous, but is larger than 0 and frequently an integer. It can also be specified outside of the aesthetics to change the overall appearance of the plot without attributing size to a factor in the dataset.
Alpha
Optional
Alpha is the transparency of a colour or fill, specified as a real number between 0 and 1. It is particularly useful when visualising a large number of points that overlap. Although it can be specified in aesthetics as a factor in the dataset, it is most often used as a ‘hard coded’ value outside of the aesthetics.
Geom
The geometric object (‘geom’) of a plot is effectively what kind of plot you want to make. There is a comprehensive list of geoms on the official website. We’ll explore a few of the most common here.
In order to create a geometric object and layer it on your base plot, you will have to add the new layer to what you already have.
Point
Scatter plots can be generated with geom_point()
layers.
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
We can now combine this type of plot with colours for each of the species (a factor in the dataframe).
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(colour = Species))
However, we can’t tell if there are multiple points overlapping or not, so we can change the alpha so that when multiple points are overlapping, they appear darker (their opacity is compounded). Since this isn’t an explicit property of the data.frame (but rather a quirk of the visualisation), we can keep alpha outside of the aesthetics.
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(colour = Species), alpha=.5)
NB: The colours are automatically chosen from the ggplot palette. You can also change what colours are used, which is discussed in Danielle Turton’s lesson from Adventures in R.
Boxplot
Boxplots, unlike scatterplots, have a factor on the x-axis and continuous numerical data on the y-axis.
iris %>%
ggplot(aes(x = Species, y = Sepal.Length)) +
geom_boxplot()
This is a simple plot, but it doesn’t have to be. We can add in visual cues to highlight that these three groups differ.
iris %>%
ggplot(aes(x = Species, y = Sepal.Length, fill=Species)) +
geom_boxplot()
Much nicer! But now we have redundant information in the legend and the x-axis. We can easily get rid of the legend by adding a theme
layer that says the position of the legend is null
:
iris %>%
ggplot(aes(x = Species, y = Sepal.Length, fill=Species)) +
geom_boxplot() +
theme(legend.position = "null")
Histogram
For our histogram, a simple version is not very visually pleasing.
iris %>%
ggplot(aes(x = Sepal.Length)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We also get a warning that we shoud choose a better “binwidth”:
iris %>%
ggplot(aes(x = Sepal.Length)) +
geom_histogram(binwidth = .5)
But what about demonstrating how the different species are distributed in this data?
iris %>%
ggplot(aes(x = Sepal.Length)) +
geom_histogram(aes(fill=Species), binwidth = .5)
This isn’t very helpful because the numbers are stacked. We can see that they seem to have different distributions because the data are quite simple, but we really can’t draw any inferences based on this visualisation by itself. Let’s first only look at ‘setosa’ and ‘virginica’ to simplify the data, then graph them so they are overlapping histograms.
iris %>%
filter(Species != "versicolor") %>% # only plot 'setosa' and 'virginica', could also be written as the following line:
# filter(Species == "setosa" | Species == "virginica") %>%
ggplot(aes(x = Sepal.Length)) +
geom_histogram(aes(fill=Species), binwidth = .5,
alpha = .75, position = "identity")
Now it’s clear that these two only overlap a little bit.
Exercise: On your own, add back in the third group and see if you can choose a binwidth
and alpha
that make the visualisation clear without being too confusing (this is very hard!). One potential answer is in the .Rmd.
Density
Maybe in this case, a density plot would be best because it simplifies and outlines the data (but this is your call in the end):
iris %>%
ggplot(aes(x = Sepal.Length)) +
geom_density(aes(fill=Species), alpha = .5, position = "identity")
Smooth
If we want to plot a regression or trend line, we can use geom_smooth(). Here, I’ll overlay the smooth later on a point layer, but this is optional.
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, colour=Species)) +
geom_point() +
geom_smooth(method = "lm") # using a linear model for the regression line
## `geom_smooth()` using formula 'y ~ x'
Other options for method
are described here.
Bar
Bar charts are slightly unusual in the ggplot world since they don’t necessarily need a y-axis specified, but often you will want to specify one anyway.
Let’s go back to the data.frame we’ve called data
.
data %>%
ggplot(aes(x=condition, fill=selection)) +
geom_bar()
This just counts observations per factor, so we can see a very general overview of the data. However, we aren’t looking at the three experiments separately, so we’re losing some important information. One way to add another dimension to a plot like this is with facet_wrap
.
data %>%
ggplot(aes(x=condition, fill=selection)) +
geom_bar() +
facet_wrap(~experiment)
The problem here is that the y-axis is raw counts and we can’t tell what proportion of the data is in which category. We can now combine our summarising skills with our plotting skills to get a clearer picture!
data %>%
group_by(experiment,condition,selection) %>%
summarise(count = n(),
length = 240) %>%
mutate(proportion = count/length)
## `summarise()` regrouping output by 'experiment', 'condition' (override with `.groups` argument)
## # A tibble: 12 x 6
## # Groups: experiment, condition [6]
## experiment condition selection count length proportion
## <chr> <chr> <chr> <int> <dbl> <dbl>
## 1 first Baseline Option 1 130 240 0.542
## 2 first Baseline Option 2 110 240 0.458
## 3 first Treatment Option 1 181 240 0.754
## 4 first Treatment Option 2 59 240 0.246
## 5 second Baseline Option 1 106 240 0.442
## 6 second Baseline Option 2 134 240 0.558
## 7 second Treatment Option 1 217 240 0.904
## 8 second Treatment Option 2 23 240 0.0958
## 9 third Baseline Option 1 84 240 0.35
## 10 third Baseline Option 2 156 240 0.65
## 11 third Treatment Option 1 54 240 0.225
## 12 third Treatment Option 2 186 240 0.775
data %>%
group_by(experiment,condition,selection) %>%
summarise(count = n(),
length = 240) %>%
mutate(proportion = count/length) %>%
ggplot(aes(x=condition, fill=selection)) +
geom_bar(aes(y=proportion), stat="identity") + # now it won't count observations but will create a bar of the height of 'n'
facet_wrap(~experiment)
## `summarise()` regrouping output by 'experiment', 'condition' (override with `.groups` argument)
But we can make our y-axis even nicer! Let’s change the theme so it is a little prettier and add a nice title.
data %>%
group_by(experiment,condition,selection) %>%
summarise(count = n(),
length = 240) %>%
mutate(proportion = count/length) %>%
ggplot(aes(x=condition, fill=selection)) +
geom_bar(aes(y=proportion), stat="identity", colour="grey40") +
theme_bw() +
ggtitle("Example graph") +
facet_wrap(~experiment)
## `summarise()` regrouping output by 'experiment', 'condition' (override with `.groups` argument)