⇠ RfficeHours Home || Tutorial 2 ⇢
Learning Outcomes | |
---|---|
1 | You will be able to open RStudio and know where to find basic functions |
2 | You will be able to create and organise a new project |
3 | You will be able to load your data into R and view it |
4 | You will know the purpose and of code annotation and commenting |
Before you can use R and RStudio, it’s important to understand what you’re looking at and where to find things.
R is the program and programming language that allows you to input commands and get the computer to do things. In order to interact with R, some people use a simple R interface, some people use the command line, and some people use RStudio.
RStudio is a GUI (Graphical User Interface) that allows you to interact with R and keep everything organised.
We’ll be using RStudio because it is the easiest to use, with some point-and-click commands, but still with the full functionality and power of R. R runs in the background when you run RStudio, but RStudio takes care of that on its own so all you need to do is open RStudio.
The RStudio interface has (up to) four panes that you can rearrange and customise to suit your needs. Here is the default configuration:
When you open RStudio for the first time, there may only be three panes.
The console is your direct line of communication with R. It operates a bit like a chat window (if you’re familiar with that) because you can type things into the console, hit ENTER
, and R will do something (and sometimes, depending on what you type, it’ll respond). You know the console is listening and ready to accept a new command if you see a >
on the left edge of the window on the lowest line of text.
The environment is a set of three tabs that effectively lets us see into “R’s brain” (thanks Danielle!). This window lets you see what variables you’ve created, what datasets are loaded in, what packages and libraries are loaded, among many other things. Right now, it’s empty.
If we click on Global Environment
, we can see what packages have automatically be loaded into this R session. Once you load other packages in, you will see them here too.
The pane in the lower right has five tabs by default.
Zoom
to pop the graph out into a separate window and resize it.Export
to save the plot as a PNG, PDF, EPS, etc.How do you keep all your related files organised?
Keeping your files organised will make your life infinitely easier and will help you ease back into using R if you come back to it after a hiatus. The RStudio interface will help with this, but there is no better foundation than good file management.
Danielle Turton and I suggest the following as a best practice:
RStudio offers a neat feature called a project (file type .Rproj
). A project keeps all your scripts and datasets handy and can save variables for use later, even if you quit and restart RStudio. This can be useful if you are running complex models, for instance.
To create a new project, click on the small triangle in the upper right corner of the RStudio window.
This will bring up a menu with a number of useful options, but right now, we want to create a New Project…
Since we have already created a folder (i.e., a directory) for our project, we can click on the Existing Directory option. If you haven’t created a folder for your project, you can create a new directory, but in this case you should still follow best practices for file organisation.
From this window, we can browse our computers for the location of the folder we’ve set up.
I have noticed that sometimes students whose computers are set up with a non-English language, particularly with a non-Latin alphabet, can run into problems with setting up these folders. It is important to instruct them to use only Latin characters for the folders that R and RStudio will access, as non-Latin characters use a type of encoding that not all programs can read.
Scripts will appear in the fourth pane (top left by default).
Scripts are simply text files, they are not R and they don’t do anything unless you perform specific actions on them. They’re a bit like instruction manuals, but R can only read them if you manually send the instructions to the console (more on this later).
A script is a way to save your work so you only need to write the code once. Once you’ve written a script once, you can execute it as many times as you like, but you won’t need to write it again. You can copy and paste other people’s code into your script and tweak it to fit your needs. You can debug your code without having to type it in new each time. You can share your code with others, and you can leave comments to yourself in your code so that you can leave it sit for a while and then remember what you were doing when you come back to it.
On all platforms: the green + symbol in the top left corner will let you create a new script.
On a Mac: command
+shift
+N
On a PC:
The symbol #
hides the text that follows on that line from R, that is, it “comments it out”. This lets you write comments to yourself and to anyone else who might read your code. You’ll want to do this so you can remember what you were doing (trust me, you will not remember) and so other people can replicate what you did (even if it’s just to help you debug your code later).
If you type print("Hello World!")
(with a nice comment) and then hit ENTER
in your console, you will see something like this:
> print("Hello World!") # this line will produce the text between the quotes
[1] "Hello World!"
If you type print("Hello World!")
and hit ENTER
in a script, you will see something like this:
print("Hello World!")
(Note that nothing happens. There is no output. A script is just a text file.)
Open a new script and type print("Hello World!")
into it. How do you get R to execute your code? There are a couple ways:
Run
button on the top right portion of the script pane.
Command
+Enter
Command
+Enter
Command
+Shift
+Enter
When you run a script, the text in the script is sent, line by line, to the console. Once in the console, R executes the code. You can watch the code progress in the console. If there is any text or numerical output, it will appear in the console. If there is a graphical output, it will appear in the Plots tab of the lower right pane (Files etc).
print("Hello world!")
3 + 4
7^2
Your console should end up looking like this:
> print("Hello world!")
[1] "Hello world!"
> 3 + 4
[1] 7
> 7^2
[1] 49
>
(In the rest of this document, instead of output starting with >
, it may start with ##
. This is due to how R compiles a script to an HTML document and does not change anything in the script, code, or contentful output.)
[1]
mean?The number in square brackets to the left of the output indicates the number of values that have been printed by index number. That is, if there are multiple outputs on one line, [1] will appear. If the seventh item in the output wraps around and appears on the second line, both [1] and [7] will appear. This is useful when trying to make sense of long lists of numbers, for instance.
Below, I saved two longer strings of text into one variable called longText
. Then when it’s printed, it appears on two lines. The [1]
indicates the first item in the list is the first item on the line and the [2]
indicates the second item in the list is the first item that appears on that line.
longText <- c("i don't think we'll be able to fit this text on one line of the output console","so that it might wrap around and display on two lines")
print(longText)
## [1] "i don't think we'll be able to fit this text on one line of the output console"
## [2] "so that it might wrap around and display on two lines"
One of the most important components of (almost) any programming language is a variable. A variable is an object that can be assigned a value, a list of values, a matrix of values, or something along those lines. A variable in R can be named almost anything with a few exceptions:
A variable name…
.
or _
but cannot contain other non-alphanumeric charactersTo assign a value (etc) to a variable, there are three possible operators: =
, <-
, or occasionally ->
(the latter two look like arrows).
First we assign the value of 3 to the variable x
. Notice that spaces between the numbers and the operator are optional. I like to use them to help see each component more clearly.
x = 3 # or x=3
x
## [1] 3
We can also use one of the arrow operators. The two characters that comprise the arrow cannot have space between them. They act as a single unit. For the leftward-facing arrow, the value(s) on the right are assigned to the variable on the left.
y <- 4 # or y<-4
y
## [1] 4
Now that we’ve assigned values to x
and y
, we can do things to them. Below, I demonstrate some basic arthimetic operators that allow R to act like a calculator.
x + y
## [1] 7
x * y
## [1] 12
x / y
## [1] 0.75
x ^ y
## [1] 81
Finally, the rightward-pointing arrow is used very rarely, but functions in a similar way to the other two assignment operators. The value(s) on the left are assigned to the variable on the right.
x + y -> z # or z <- x + y or z=x+y
z
## [1] 7
To assign more than one value to a variable, there are several functions we can use, but the most common and easiest is c()
. To read more about the function c()
, you can type c
into the search bar in the help tab in the lower right pane, or you can input the following into your console:
?c
In short, this function combines a series of values into a vector or list.
c(1,2,3)
## [1] 1 2 3
c("1","2","3")
## [1] "1" "2" "3"
c("one","two","three")
## [1] "one" "two" "three"
c(one,two,three)
## Error in eval(expr, envir, enclos): object 'one' not found
Note that bare numbers are green and are output without quotation marks, numbers and words in quotation marks are output as such, but words without quotation marks produce an error. This is because R assumes that all words without quotation marks are variables, but we haven’t created variables named one
, two
, or three
.
one = 1
two = 2
three = 3
c(one,two,three)
## [1] 1 2 3
Now that we’ve gone through the fundamentals of organisation and basic R operators, we can get into the more important task of reading data and writing output (i.e., saving our work).
Once you’ve created a project (.Rproj), you can open the project and it should open right to where you left off. This means your scripts will open as well, as long as you did not close them the last time you worked on the project. However, if you want to open a script that isn’t currrently open but does exist, you can do so in the Files tab of the lower right pane. Click on the name of the script, and it will open in a new tab of your scripts pane (top left by default).
In your script (or, if you don’t take my advice, in your console), you can open and manipulate a dataset.
Right click and save this file to your folder named data
.
data <- read.csv("data/binomial-data.csv")
Now, if you want to view the first six rows of the dataset to see what types of things are in it, you can use the function head()
.
head(data)
## experiment subject item condition selection selectCode
## 1 first 1 1 Baseline Option 2 0
## 2 first 1 2 Baseline Option 1 1
## 3 first 1 3 Baseline Option 2 0
## 4 first 1 4 Baseline Option 2 0
## 5 first 1 5 Baseline Option 1 1
## 6 first 1 6 Baseline Option 1 1
You can also look at the bottom six rows with tail()
tail(data)
## experiment subject item condition selection selectCode
## 1435 third 20 19 Treatment Option 1 1
## 1436 third 20 20 Treatment Option 1 1
## 1437 third 20 21 Treatment Option 2 0
## 1438 third 20 22 Treatment Option 1 1
## 1439 third 20 23 Treatment Option 2 0
## 1440 third 20 24 Treatment Option 1 1
If we want to find out what each of the experiments are called, or how many conditions there are, we can use the function unique()
.
Note that we must specify what variable name we’ve given our dataset (data
) and what column we want to query (experiment
). The relationship between the dataset and the column in the dataset is specified with the $
operator.
unique(data$experiment)
## [1] "first" "second" "third"
There are three experiments, and their names are "first"
, "second"
, and "third"
.
If we want to know how many unique values there are in a column, we can embed the unique()
function in the length()
function.
length(unique(data$experiment))
## [1] 3
There are three different experiments in this dataset.
Finally, if we want to find out how many unique items there are in experiment “first” only, we can specify a subset of the dataset using square brackets []
. Before we do that, though, let’s go through what the arguments for the square brackets are.
An index refers to the location of an item in a list, vector, or matrix.
If you have a list that looks like the one below, green is at index 4
(in the R programming language).
["red", "orange", "yellow", "green", "blue", "indigo", "violet"]
If you have a matrix like the one below, you must specify the index of the row and/or column ([row,column]
).
colour | wordLength | waveLengthNM |
---|---|---|
red | 3 | 700 |
orange | 6 | 630 |
yellow | 6 | 600 |
green | 5 | 550 |
blue | 4 | 470 |
indigo | 6 | 425 |
violet | 6 | 400 |
The location of green is in the fourth row and the first column, so we can identify it as [4,1]
. If we only specify the row ([4,]
), we will get everything in the fourth row ("green", 4, 470
). If we only specify the first column, we will get a list of colours, like we did above:
("red", "orange", "yellow", "green", "blue", "indigo", "violet"
)
Now, going back to our dataset, if we want to view the content of one column, such as experiment
, we can specify that column with the $
operator. But if you want to view a subset of a column, we can also specify what criteria we want to view (or don’t want to view).
For instance, if we want to view only the second experiment (which is called second
in the experiment
column, helpfully enough), we can use the code below:
data[data$experiment=="second",]
## experiment subject item condition selection selectCode
## 481 second 1 1 Baseline Option 1 1
## 482 second 1 2 Baseline Option 1 1
## 483 second 1 3 Baseline Option 2 0
## 484 second 1 4 Baseline Option 2 0
## 485 second 1 5 Baseline Option 1 1
## 486 second 1 6 Baseline Option 1 1
## 487 second 1 7 Baseline Option 2 0
## 488 second 1 8 Baseline Option 1 1
## 489 second 1 9 Baseline Option 2 0
## 490 second 1 10 Baseline Option 2 0
Let’s break the syntax of this code down.
First, we write data
because we want to look within our dataset that we’ve named “data”. Then we put in the square brackets []
after data
because we want to look at a subset of the dataset. Inside of the square brackets, we want to specify what column has the information that we can use for identifying the target subset. Do to that, we write data$experiment
to say that the “experiment”column in “data” has the important information, then ==
to check each value to see if it’s the same as "second"
. We finally put a comma ,
as the last character within the square brackets to specify we want to see all the rows that match these criteria (remember: [row,column]
).
I won’t go into this more now, but ==
is a very different operator to =
. While =
assigns a value, ==
asks whether or not two things are equal. The opposite of ==
is !=
(pronounced “bang-equals” or “not-equal-to”).
Going back to the unique()
function, if we want to query how many unique items are in the second experiment only, we can use the following code:
length(unique(data$item[data$experiment=="second"]))
## [1] 24
This only differs from the previous example in a small number of ways. Instead of querying a two dimensional matrix (data
has rows and columns), we want to query a one dimensional vector (item
), which is a subset of data
. This is why there is no comma at the end of the expression in the square brackets. That is, we want to query a subset of a subset, so we specify the column with our target information and the column and subset information for the column that will help us narrow down what we’re looking at. Then, by using unique()
, we won’t count multiple tokens of an item, just the type count. Finally, by using length()
, we can find out how many types there are.
It starts to look messy and hard to read pretty quickly though. The next tutorial will introduce you to a different way of doing this that is easier to read.
Tab delimited files use white space (specifically, the tab character, unicode: U+0009 sometimes coded as \t
) to separate information in columns and line breaks (either “carriage return” or “new line” characters, sometimes coded as \r
and \n
respectively). This could result in a text file that looks something like this:
Note how the columns don’t necessarily line up neatly. This is because the tab character is not a fixed width, but rather refers to a location along the horizontal axis of the page (at least in this file type). For a human, this is harder to read than a spreadsheet, but for a computer it’s easier.
Importantly, text files can be encoded as UTF-8 (or UTF-16), which will accommodate most characters, even from other languages and alphabets. They can also be encoded as ASCII (a very basic encoding that only allows alphanumeric Latin characters and some punctuation), or certain types of old encodings that are not compatible across platforms. In particular, Microsoft Word will not produce a useful encoding.
DO NOT USE MICROSOFT WORD (OR SIMILAR) TO EDIT A DATASET
SERIOUSLY, DO NOT DO IT
On Windows, Notepad++ is one of the best plain text editors, and it’s free. On Macintosh, BBEdit is my personal favorite plain text editor. For example, in BBEdit, you can change the encoding of a file (in case the wrong one was chosen initially, or in case you wish to ensure your file has a compatible encoding). I recommend UTF-8.
It is possible to create a .txt tab delimited file by copying and pasting a spreadsheet (e.g., from Excel or Google Sheets) into a plain text file. This is also useful if you are more comfortable with spreadsheets than R at the moment, although you will soon find R to be easier and faster than Excel for most data organisation!
Once you have created and saved your dataset as a .txt
file, you can load it into R and save it with a variable name using the following code:
data <- read.delim("data/20180908-my_data_file.txt", header=TRUE, as.is=TRUE)
The function read.delim()
takes several arguments, but only one is obligatory. The bare minimum you must give it is the name of your .txt
file, within quotation marks to signify it is a string of characters and not a variable name. Since our folder for saving datasets is called “data”, we preface our dataset’s name with data
and a /
to indicate that the target file is inside the folder called data
. This is guaranteed to work on a Macintosh computer, but you may need to replace the /
character with a \
character if you are working on a Windows computer.
Another type of file is the .csv
or “comma-separated values” file. This type of file is functionally equivalent to a tab-delimited file, but instead of each value in a row being separated with tabs, they are separated with commas. Furthermore, a .csv file can distinguish between values that are supposed to be text and values that are numeric. It typically does this by putting text in quotation marks.
This kind of file might be generated automatically by your data gathering software, or you can save an Excel spreadsheet (or most other kinds of spreadsheets) as a .csv from within the native program interface. In many ways, the .csv is better than a tab delimited file. For instance, it is less likely to have problems with white space if, say, one of the columns contains sentences or longer strings of text. However, because the comma is the element that separates each value in a column, a .csv can sometimes have difficulty with punctuation in sentences. This is why enclosing strings of text in quotation marks is crucial. It indicates that any comma within quotation marks is meant to be treated as text and not the separating marker.
A .csv can be loaded into R as a variable using the following code:
data <- read.csv("data/20180908-my_data_file.csv", header=TRUE, as.is=TRUE)
As before, the only obligatory argument is the name and location of the file. However, I prefer to specify that the first row of the dataset is the header row, and I also prefer to specify as.is=TRUE
so that text is not automatically treated like a multi-level factor (but instead, as text characters).
There are ways to load Excel spreadsheets directly into R, but there are lots of good reasons not to. First of all, any formatting that you save in your spreadsheet will need to be filtered out. This means that colour-coding, fonts, wrapping text, dates, times, or any other cell-specific formatting will have to be removed. This can unintentionally alter the data (for instance, dates may not be properly converted to interpretable text or numbers). Moreover, R cannot tell you if it has misinterpreted something from an Excel file, so you would have to go through the file and check it anyway to make sure your columns and values have been correctly imported. Finally, because Microsoft Excel (.xlsx
) files are a proprietary brand that continuously updates and changes the program used to generate spreadsheets, there is no guarantee that your operating system will upload the file the same way someone else’s will, which makes your data less replicable. To give you an idea what this means, here is the “plain text” version of an .xslx spreadsheet generated by Microsoft Excel.
Rather than worrying about any of this potential complications, it is easy to save your .xlsx file as a .csv or paste the content into a .txt tab-delimited file. It takes very little time, the file size will be smaller than a .xlsx file, and the data file will be readable by software well into the future, even after Excel updates its file formatting again.
Literate programming is a useful way to annotate your code or navigate someone else’s code. In short, it allows you to write prose, insert pictures, view output, and produce tables within your code. In particular, R markdown (.Rmd
) files and R notebooks allow you to write R in a literate programming environment. In fact, this webpage was generated as an .Rmd file!
This section will briefly discuss why literate programming, and specifically .Rmd files, are useful to know about and learn to use.
Why do we annotate code?
Fundamentally, code can be difficult to read. We are not computers, we forget what we were doing or trying to do, we lose our place. Annotating code is one way to help us remember what our though process was at the time of writing. It also allows other people to use, adapt, and debug our code. This increases replicability and allows more open, transparent science, too.
As mentioned earlier, we can leave comments in our code by writing text after a #
symbol.
paste("Hello","world!") # the paste() function connects text values with one space.
## [1] "Hello world!"
paste0("Hello"," ","world","!") # the paste0() function connects text values with no intervening character(s).
## [1] "Hello world!"
But commenting code may only get you so far. It can be difficult to read, there is no formatting in case you want to make some text bold or more salient, and you can still lose your place in a long script.
This is where .Rmd files improve upon commenting.
In a .Rmd file, we can still comment code line-by-line, but we can also include longer (more visually pleasing) explanations of what we are doing and what we intend to do. This means you can leave yourself reminders! You can also indicate what doesn’t work (so that someone else may help), and leave notes to yourself or to your audience about your interpretations of the data. Most importantly, you can view output such as graphs in-line, which means they are saved adjacent to the code that produced them. This makes it much easier to return to an analysis or share it with someone who is not intimately familiar with your data.
In this graph, I’m using the automatically available dataset called cars
. The plot()
function is moderately flexible, so I can just tell it what columns in cars
I want for the x-axis and the y-axis, and it’ll generate a graph.
plot(cars$speed,cars$dist)
However, this graph is ugly and boring. I want to change the colour and shape of the points because I know my audience is actually interested in certain propterties of the graph and I want to make it as easy as possible for them to navigate it quickly.
Below, I’ve generated a graph that splits the data into four quadrants. I tried other numerical values for the horizontal and vertical divisions, but reducing the horizontal division meant the lower right (orange) quadrant became too sparse and shifting the vertical division to the left meant the upper left (blue) quadrant became too sparse. I have not tried increasing the vertical division, but I suspect it would also result in the lower right (orange) quandrant becoming too sparse.
plot(cars$speed,cars$dist,col="white") # generate the plot background at an appropriate size and scale
abline(h=41,lty="dotted",col="grey") # include the horizontal line
abline(v=15.5,lty="dotted",col="grey") # include the vertical line
lines(lowess(cars),type="l",lwd=2) # plot the black trend line
points(cars$speed[cars$speed<15.5&cars$dist<41],cars$dist[cars$speed<15.5&cars$dist<41],col="red", pch=15) # lower left quadrant
points(cars$speed[cars$speed>15.5&cars$dist<41],cars$dist[cars$speed>15.5&cars$dist<41],col="orange", pch=16) # lower right quadrant
points(cars$speed[cars$speed>15.5&cars$dist>41],cars$dist[cars$speed>15.5&cars$dist>41],col="green", pch=17) # upper right quadrant
points(cars$speed[cars$speed<15.5&cars$dist>41],cars$dist[cars$speed<15.5&cars$dist>41],col="blue", pch=18) # upper left quadrant
See how much clearer this is? There are still gaps in the explanation. I haven’t told you what pch=15
means or what it does. There is a lot of text and it’s difficult to read. These are things you can annotate and explain to yourself when you find and adapt other people’s code. Eventually, they’ll become second nature, and you won’t need to annotate everything, but at first, it can be helpful if you expect to forget what everything means by the next time you use your code.
pch
is the point type. Search “pch” in the help tab or type ?pch
into the console to learn more.col
lets you change the colour of the line or point. There are a large number of pre-named colours.lty
means “line type”. Typical values are “dashed”, “dotted”, or “solid”.lwd
means “line width”. It takes a numerical integer value.An R Notebook will allow you to easily create a document (e.g., an HTML document) that can be compiled for easy sharing or reporting analysis. It differs slightly from a regular .Rmd file, but can be easier to learn to use and update.
Create a new R Notebook by clicking the green +
in the upper left corner and selecting “R Notebook”. It will automatically give you a little information about how to write, run, and compile the document. Make sure to save it with a short, descriptive name!
Here are links that contain good cheat sheets and tips for formatting an R Notebook:
When you write your code, it should be kept mostly chronological. This way, the steps you take are in the correct order.
However, there are some circumstances where it makes more sense to go back and add code in the middle rather than at the end of a document. If you re-do a statistical analysis or want to compare two methods, those may be more coherent if placed adjacent in the document rather than in the order you performed them. This is especially the case if you learn a new technique and want to go back and re-do an older analysis. The same thing goes for graphs and other outputs.
If you are working directly in an R script (which I do not recommend if you can use an .Rmd or notebook instead), you can create section headings. To do so, you can type #### Section Title ####
with at least four hashes on either side of the text, and then you will be able to navigate directly to that heading using the small orange hash symbol in the lower left corner of the scripts pane.
An R Notebook also allows you to create section headings and navigate between them in a similar way, but when you compile and save the document, the section headings are also visually salient.
# Heading 1
## Heading 2
### Heading 3
#### Heading 4
Furthermore, there are options for creating a table of contents for your notebook, which makes it even easier. Here is an example of what you might put in the document header to create a nice, simple table of contents for your notebook.
---
title: "Untitled R Notebook"
output:
html_notebook:
toc: true
toc_float: true
---