Appendix D — Loading and Saving Data in R
This appendix will show you how to load and save data into R from plain-text files, R files, and Excel spreadsheets. It will also show you the R packages that you can use to load data from databases and other common programs, like SAS and MATLAB.
D.1 Data Sets in Base R
R comes with many data sets preloaded in the datasets
package, which comes with base R. These data sets are not very interesting, but they give you a chance to test code or make a point without having to load a data set from outside R. You can see a list of R’s data sets as well as a short description of each by running:
help(package = "datasets")
To use a data set, just type its name. Each data set is already presaved as an R object. For example:
iris## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
However, R’s data sets are no substitute for your own data, which you can load into R from a wide variety of file formats. But before you load any data files into R, you’ll need to determine where your working directory is.
D.2 Working Directory
Each time you open R, it links itself to a directory on your computer, which R calls the working directory. This is where R will look for files when you attempt to load them, and it is where R will save files when you save them. The location of your working directory will vary on different computers. To determine which directory R is using as your working directory, run:
getwd()
## "/Users/garrettgrolemund"
You can place data files straight into the folder that is your working directory, or you can move your working directory to where your data files are. You can move your working directory to any folder on your computer with the function setwd()
. Just give setwd()
the file path to your new working directory. I prefer to set my working directory to a folder dedicated to whichever project I am currently working on. That way I can keep all of my data, scripts, graphs, and reports in the same place. For example:
setwd("~/Users/garrettgrolemund/Documents/Book_Project")
If the file path does not begin with your root directory, R will assume that it begins at your current working directory.
You can also change your working directory by clicking on Session > Set Working Directory > Choose Directory in the RStudio menu bar. The Windows and Mac GUIs have similar options. If you start R from a UNIX command line (as on Linux machines), the working directory will be whichever directory you were in when you called R.
You can see what files are in your working directory with list.files()
. If you see the file that you would like to open in your working directory, then you are ready to proceed. How you open files in your working directory will depend on which type of file you would like to open.
D.3 Plain-text Files
Plain-text files are one of the most common ways to save data. They are very simple and can be read by many different computer programs—even the most basic text editors. For this reason, public data often comes as plain-text files. For example, the Census Bureau, the Social Security Administration, and the Bureau of Labor Statistics all make their data available as plain-text files.
Here’s how the royal flush data set from Section 3 would appear as a plain-text file (I’ve added a value column):
"card", "suit", "value"
"ace", "spades", 14
"king", "spades", 13
"queen", "spades", 12
"jack", "spades", 11
"ten", "spades", 10
A plain-text file stores a table of data in a text document. Each row of the table is saved on its own line, and a simple convention is used to separate the cells within a row. Often cells are separated by a comma, but they can also be separated by a tab, a pipe delimiter (i.e., |
), or any other character. Each file only uses one method of separating cells, which minimizes confusion. Within each cell, data appears as you’d expect to see it, as words and numbers.
All plain-text files can be saved with the extension .txt (for text), but sometimes a file will receive a special extension that advertises how it separates data-cell entries. Since entries in the data set mentioned earlier are separated with a comma, this file would be a comma-separated-values file and would usually be saved with the extension .csv.
D.3.1 read.table
To load a plain-text file, use read.table()
. The first argument of read.table()
should be the name of your file (if it is in your working directory), or the file path to your file (if it is not in your working directory). If the file path does not begin with your root directory, R will append it to the end of the file path that leads to your working directory.You can give read.table()
other arguments as well. The two most important are sep
and header
.
If the royal flush data set was saved as a file named poker.csv in your working directory, you could load it with:
<- read.table("poker.csv", sep = ",", header = TRUE) poker
D.3.1.1 sep
Use sep
to tell read.table()
what character your file uses to separate data entries. To find this out, you might have to open your file in a text editor and look at it. If you don’t specify a sep
argument, read.table()
will try to separate cells whenever it comes to white space, such as a tab or space. R won’t be able to tell you if read.table()
does this correctly or not, so rely on it at your own risk.
D.3.1.2 header
Use header
to tell read.table()
whether the first line of the file contains variable names instead of values. If the first line of the file is a set of variable names, you should set header = TRUE
.
D.3.1.3 na.strings
Oftentimes data sets will use special symbols to represent missing information. If you know that your data uses a certain symbol to represent missing entries, you can tell read.table()
(and the preceding functions) what the symbol is with the na.strings
argument. read.table()
will convert all instances of the missing information symbol to NA
, which is R’s missing information symbol (see Missing Information).
For example, your poker data set contained missing values stored as a .
, like this:
## "card","suit","value"
## "ace"," spades"," 14"
## "king"," spades"," 13"
## "queen",".","."
## "jack",".","."
## "ten",".","."
You could read the data set into R and convert the missing values into NAs as you go with the command:
<- read.table("poker.csv", sep = ",", header = TRUE, na.string = ".") poker
R would save a version of poker
that looks like this:
## card suit value
## ace spades 14
## king spades 13
## queen <NA> NA
## jack <NA> NA
## ten <NA> NA
D.3.1.4 skip and nrow
Sometimes a plain-text file will come with introductory text that is not part of the data set. Or, you may decide that you only wish to read in part of a data set. You can do these things with the skip
and nrow
arguments. Use skip
to tell R to skip a specific number of lines before it starts reading in values from the file. Use nrow
to tell R to stop reading in values after it has read in a certain number of lines.
For example, imagine that the complete royal flush file looks like this:
This data was collected by the National Poker Institute.
We accidentally repeated the last row of data.
"card", "suit", "value"
"ace", "spades", 14
"king", "spades", 13
"queen", "spades", 12
"jack", "spades", 11
"ten", "spades", 10
"ten", "spades", 10
You can read just the six lines that you want (five rows plus a header) with:
read.table("poker.csv", sep = ",", header = TRUE, skip = 3, nrow = 5)
## card suit value
## 1 ace spades 14
## 2 king spades 13
## 3 queen spades 12
## 4 jack spades 11
## 5 ten spades 10
Notice that the header row doesn’t count towards the total rows allowed by nrow
.
D.3.1.5 stringsAsFactors
R reads in numbers just as you’d expect, but when R comes across character strings (e.g., letters and words) it begins to act strangely. R wants to convert every character string into a factor. This is R’s default behavior, but I think it is a mistake. Sometimes factors are useful. At other times, they’re clearly the wrong data type for the job. Also factors cause weird behavior, especially when you want to display data. This behavior can be surprising if you didn’t realize that R converted your data to factors. In general, you’ll have a smoother R experience if you don’t let R make factors until you ask for them. Thankfully, it is easy to do this.
Setting the argument stringsAsFactors
to FALSE
will ensure that R saves any character strings in your data set as character strings, not factors. To use stringsAsFactors
, you’d write:
read.table("poker.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE)
If you will be loading more than one data file, you can change the default factoring behavior at the global level with:
options(stringsAsFactors = FALSE)
This will ensure that all strings will be read as strings, not as factors, until you end your R session, or rechange the global default by running:
options(stringsAsFactors = TRUE)
D.3.2 The read Family
R also comes with some prepackaged short cuts for read.table()
, shown in Table D.1.
Function | Defaults | Use |
---|---|---|
read.table() |
sep = ” “, header = FALSE | General-purpose read function |
read.csv() |
sep = “,”, header = TRUE | Comma-separated-variable (CSV) files |
read.delim() |
sep = “, header = TRUE | Tab-delimited files |
read.csv2() |
sep = “;”, header = TRUE, dec = “,” | CSV files with European decimal format |
read.delim2() |
sep = “, header = TRUE, dec =”,” | Tab-delimited files with European decimal format |
The first shortcut, read.csv()
, behaves just like read.table()
but automatically sets sep = ","
and header = TRUE
, which can save you some typing:
<- read.csv("poker.csv") poker
read.delim()
automatically sets sep
to the tab character, which is very handy for reading tab delimited files. These are files where each cell is separated by a tab. read.delim()
also sets header = TRUE
by default.
read.delim2()
and read.csv2()
exist for European R users. These functions tell R that the data uses a comma instead of a period to denote decimal places. (If you’re wondering how this works with CSV files, CSV2 files usually separate cells with a semicolon, not a comma.)
D.3.3 read.fwf
One type of plain-text file defies the pattern by using its layout to separate data cells. Each row is placed in its own line (as with other plain-text files), and then each column begins at a specific number of characters from the lefthand side of the document. To achieve this, an arbitrary number of character spaces is added to the end of each entry to correctly position the next entry. These documents are known as fixed-width files and usually end with the extension .fwf.
Here’s one way the royal flush data set could look as a fixed-width file. In each row, the suit entry begins exactly 10 characters from the start of the line. It doesn’t matter how many characters appeared in the first cell of each row:
card suit value
ace spades 14
king spades 13
queen spades 12
jack spades 11 10 spades 10
Fixed-width files look nice to human eyes (but no better than a tab-delimited file); however, they can be difficult to work with. Perhaps because of this, R comes with a function for reading fixed-width files, but no function for saving them. Unfortunately, US government agencies seem to like fixed-width files, and you’ll likely encounter one or more during your career.
You can read fixed-width files into R with the function read.fwf()
. The function takes the same arguments as read.table()
but requires an additional argument, widths
, which should be a vector of numbers. Each _i_th entry of the widths
vector should state the width (in characters) of the _i_th column of the data set.
If the aforementioned fixed-width royal flush data was saved as poker.fwf in your working directory, you could read it with:
<- read.fwf("poker.fwf", widths = c(10, 7, 6), header = TRUE) poker
D.3.4 HTML Links
Many data files are made available on the Internet at their own web address. If you are connected to the Internet, you can open these files straight into R with read.table()
, read.csv()
, etc. You can pass a web address into the file name argument for any of R’s data-reading functions. As a result, you could read in the poker data set from a web address like http://…/poker.csv with:
<- read.csv("http://.../poker.csv") poker
That’s obviously not a real address, but here’s something that would work—if you can manage to type it!
<- read.csv("https://gist.githubusercontent.com/garrettgman/9629323/raw/ee5dfc039fd581cb467cc69c226ea2524913c3d8/deck.csv") deck
Just make sure that the web address links directly to the file and not to a web page that links to the file. Usually, when you visit a data file’s web address, the file will begin to download or the raw data will appear in your browser window.
Note that websites that begin with _https://_ are secure websites, which means R may not be able to access the data provided at these links.
D.3.5 Saving Plain-Text Files
Once your data is in R, you can save it to any file format that R supports. If you’d like to save it as a plain-text file, you can use the +write+ family of functions. The three basic write functions appear in Table D.2. Use write.csv()
to save your data as a .csv file and write.table()
to save your data as a tab delimited document or a document with more exotic separators.
File format | Function and syntax |
---|---|
.csv | write.csv(r_object, file = filepath, row.names = FALSE) |
.csv (with European decimal notation) | write.csv2(r_object, file = filepath, row.names = FALSE) |
tab delimited | write.table(r_object, file = filepath, sep = "\t", row.names=FALSE) |
The first argument of each function is the R object that contains your data set. The file
argument is the file name (including extension) that you wish to give the saved data. By default, each function will save your data into your working directory. However, you can supply a file path to the file argument. R will oblige by saving the file at the end of the file path. If the file path does not begin with your root directory, R will append it to the end of the file path that leads to your working directory.
For example, you can save the (hypothetical) poker data frame to a subdirectory named data within your working directory with the command:
write.csv(poker, "data/poker.csv", row.names = FALSE)
Keep in mind that write.csv()
and write.table()
cannot create new directories on your computer. Each folder in the file path must exist before you try to save a file with it.
The row.names()
argument prevents R from saving the data frame’s row names as a column in the plain-text file. You might have noticed that R automatically names each row in a data frame with a number. For example, each row in our poker data frame appears with a number next to it:
poker## card suit value
## 1 ace spades 14
## 2 king spades 13
## 3 queen spades 12
## 4 jack spades 11
## 5 10 spades 10
These row numbers are helpful, but can quickly accumulate if you start saving them. R will add a new set of numbers by default each time you read the file back in. Avoid this by always setting row.names = FALSE
when you use a function in the write
family.
D.3.6 Compressing Files
To compress a plain-text file, surround the file name or file path with the function bzfile()
, gzfile()
, or xzfile()
. For example:
write.csv(poker, file = bzfile("data/poker.csv.bz2"), row.names = FALSE)
Each of these functions will compress the output with a different type of compression format, shown in Table D.3.
Function | Compression type |
---|---|
bzfile() |
bzip2 |
gzfile() |
gnu zip (gzip) |
xzfile() |
xz compression |
It is a good idea to adjust your file’s extension to reflect the compression. R’s read
functions will open plain-text files compressed in any of these formats. For example, you could read a compressed file named poker.csv.bz2 with:
read.csv("poker.csv.bz2")
or:
read.csv("data/poker.csv.bz2")
depending on where the file is saved.
D.4 R Files
R provides two file formats of its own for storing data, .RDS and .RData. RDS files can store a single R object, and RData files can store multiple R objects.
You can open a RDS file with readRDS()
. For example, if the royal flush data was saved as poker.RDS, you could open it with:
<- readRDS("poker.RDS") poker
Opening RData files is even easier. Simply run the function load()
with the file:
load("file.RData")
There’s no need to assign the output to an object. The R objects in your RData file will be loaded into your R session with their original names. RData files can contain multiple R objects, so loading one may read in multiple objects. load()
doesn’t tell you how many objects it is reading in, nor what their names are, so it pays to know a little about the RData file before you load it.
If worse comes to worst, you can keep an eye on the environment pane in RStudio as you load an RData file. It displays all of the objects that you have created or loaded during your R session. Another useful trick is to put parentheses around your load command like so, (load("poker.RData"))
. This will cause R to print out the names of each object it loads from the file.
Both readRDS()
and load()
take a file path as their first argument, just like R’s other read and write functions. If your file is in your working directory, the file path will be the file name.
D.4.1 Saving R Files
You can save an R object like a data frame as either an RData file or an RDS file. RData files can store multiple R objects at once, but RDS files are the better choice because they foster reproducible code.
To save data as an RData object, use the save()
function. To save data as a RDS object, use the saveRDS()
function. In each case, the first argument should be the name of the R object you wish to save. You should then include a file argument that has the file name or file path you want to save the data set to.
For example, if you have three R objects, a
, b
, and c
, you could save them all in the same RData file and then reload them in another R session:
<- 1
a <- 2
b <- 3
c save(a, b, c, file = "stuff.RData")
load("stuff.RData")
However, if you forget the names of your objects or give your file to someone else to use, it will be difficult to determine what was in the file—even after you (or they) load it. The user interface for RDS files is much more clear. You can save only one object per file, and whoever loads it can decide what they want to call their new data. As a bonus, you don’t have to worry about load()
overwriting any R objects that happened to have the same name as the objects you are loading:
saveRDS(a, file = "stuff.RDS")
<- readRDS("stuff.RDS") a
Saving your data as an R file offers some advantages over saving your data as a plain-text file. R automatically compresses the file and will also save any R-related metadata associated with your object. This can be handy if your data contains factors, dates and times, or class attributes. You won’t have to reparse this information into R the way you would if you converted everything to a text file.
On the other hand, R files cannot be read by many other programs, which makes them inefficient for sharing. They may also create a problem for long-term storage if you don’t think you’ll have a copy of R when you reopen the files.
D.5 Excel Spreadsheets
Microsoft Excel is a popular spreadsheet program that has become almost industry standard in the business world. There is a good chance that you will need to work with an Excel spreadsheet in R at least once in your career. You can read spreadsheets into R and also save R data as a spreadsheet in a variety of ways.
D.5.1 Export from Excel
The best method for moving data from Excel to R is to export the spreadsheet from Excel as a .csv or .txt file. Not only will R be able to read the text file, so will any other data analysis software. Text files are the lingua franca of data storage.
Exporting the data solves another difficulty as well. Excel uses proprietary formats and metadata that will not easily transfer into R. For example, a single Excel file can include multiple spreadsheets, each with their own columns and macros. When Excel exports the file as a .csv or .txt, it makes sure this format is transferred into a plain-text file in the most appropriate way. R may not be able to manage the conversion as efficiently.
To export data from Excel, open the Excel spreadsheet and then go to Save As in the Microsoft Office Button menu. Then choose CSV in the Save as type box that appears and save the files. You can then read the file into R with the read.csv()
function.
D.5.2 Copy and Paste
You can also copy portions of an Excel spreadsheet and paste them into R. To do this, open the spreadsheet and select the cells you wish to read into R. Then select Edit > Copy in the menu bar—or use a keyboard shortcut—to copy the cells to your clipboard.
On most operating systems, you can read the data stored in your clipboard into R with:
read.table("clipboard")
On Macs you will need to use:
read.table(pipe("pbpaste"))
If the cells contain values with spaces in them, this will disrupt read.table()
. You can try another read
function (or just formally export the data from Excel) before reading it into R.
D.5.3 XLConnect
Many packages have been written to help you read Excel files directly into R. Unfortunately, many of these packages do not work on all operating systems. Others have been made out of date by the .xlsx file format. One package that does work on all file systems (and gets good reviews) is the XLConnect package. To use it, you’ll need to install and load the package:
install.packages("XLConnect")
library(XLConnect)
XLConnect relies on Java to be platform independent. So when you first open XLConnect, RStudio may ask to download a Java Runtime Environment if you do not already have one.
D.5.4 Reading Spreadsheets
You can use XLConnect to read in an Excel spreadsheet with either a one- or a two-step process. I’ll start with the two-step process. First, load an Excel workbook with loadWorkbook()
. loadWorkbook()
can load both .xls and .xlsx files. It takes one argument: the file path to your Excel workbook (this will be the name of the workbook if it is saved in your working directory):
<- loadWorkbook("file.xlsx") wb
Next, read a spreadsheet from the workbook with readWorksheet()
, which takes several arguments. The first argument should be a workbook object created with loadWorkbook()
. The next argument, sheet
, should be the name of the spreadsheet in the workbook that you would like to read into R. This will be the name that appears on the bottom tab of the spreadsheet. You can also give sheet
a number, which specifies the sheet that you want to read in (one for the first sheet, two for the second, and so on).
readWorksheet()
then takes four arguments that specify a bounding box of cells to read in: startRow
, startCol
, endRow
, and endCol
. Use startRow
and startCol
to describe the cell in the top-left corner of the bounding box of cells that you wish to read in. Use endRow
and endCol
to specify the cell in the bottom-right corner of the bounding box. Each of these arguments takes a number. If you do not supply bounding arguments, readWorksheet()
will read in the rectangular region of cells in the spreadsheet that appears to contain data. readWorksheet()
will assume that this region contains a header row, but you can tell it otherwise with header = FALSE
.
So to read in the first worksheet from wb
, you could use:
<- readWorksheet(wb, sheet = 1, startRow = 0, startCol = 0,
sheet1 endRow = 100, endCol = 3)
R will save the output as a data frame. All of the arguments in readWorkbook()
except the first are vectorized, so you can use it to read in multiple sheets from the same workbook at once (or multiple cell regions from a single worksheet). In this case, readWorksheet()
will return a list of data frames.
You can combine these two steps with readWorksheetFromFile()
. It takes the file argument from loadWorkbook()
and combines it with the arguments from readWorksheet()
. You can use it to read one or more sheets straight from an Excel file:
<- readWorksheetFromFile("file.xlsx", sheet = 1, startRow = 0,
sheet1 startCol = 0, endRow = 100, endCol = 3)
D.5.5 Writing Spreadsheets
Writing to an Excel spreadsheet is a four-step process. First, you need to set up a workbook object with loadWorkbook()
. This works just as before, except if you are not using an existing Excel file, you should add the argument create = TRUE
. XLConnect will create a blank workbook. When you save it, XLConnect will write it to the file location that you specified here with loadWorkbook()
:
<- loadWorkbook("file.xlsx", create = TRUE) wb
Next, you need to create a worksheet inside your workbook object with createSheet()
. Tell createSheet()
which workbook to place the sheet in and which to use for the sheet.
createSheet(wb, "Sheet 1")
Then you can save your data frame or matrix to the sheet with writeWorksheet()
. The first argument of writeWorksheet()
, object
, is the workbook to write the data to. The second argument, data
, is the data to write. The third argument, sheet
, is the name of the sheet to write it to. The next two arguments, startRow
and startCol
, tell R where in the spreadsheet to place the upper-left cell of the new data. These arguments each default to 1. Finally, you can use header
to tell R whether your column names should be written with the data:
writeWorksheet(wb, data = poker, sheet = "Sheet 1")
Once you have finished adding sheets and data to your workbook, you can save it by running saveWorkbook()
on the workbook object. R will save the workbook to the file name or path you provided in loadWorkbook()
. If this leads to an existing Excel file, R will overwrite it. If it leads to a new file, R will create it.
You can also collapse these steps into a single call with writeWorksheetToFile()
, like this:
writeWorksheetToFile("file.xlsx", data = poker, sheet = "Sheet 1",
startRow = 1, startCol = 1)
The XLConnect package also lets you do more advanced things with Excel spreadsheets, such as writing to a named region in a spreadsheet, working with formulas, and assigning styles to cells. You can read about these features in XLConnect’s vignette, which is accessible by loading XLConnect and then running:
vignette("XLConnect")
D.6 Loading Files from Other Programs
You should follow the same advice I gave you for Excel files whenever you wish to work with file formats native to other programs: open the file in the original program and export the data as a plain-text file, usually a CSV. This will ensure the most faithful transcription of the data in the file, and it will usually give you the most options for customizing how the data is transcribed.
Sometimes, however, you may acquire a file but not the program it came from. As a result, you won’t be able to open the file in its native program and export it as a text file. In this case, you can use one of the functions in Table D.4 to open the file. These functions mostly come in R’s foreign
package. Each attempts to read in a different file format with as few hiccups as possible.
File format | Function | Library |
---|---|---|
ERSI ArcGIS | read.shapefile() |
shapefiles |
Matlab | readMat() |
R.matlab |
minitab | read.mtp() |
foreign |
SAS (permanent data set) | read.ssd() |
foreign |
SAS (XPORT format) | read.xport() |
foreign |
SPSS | read.spss() |
foreign |
Stata | read.dta() |
foreign |
Systat | read.systat() |
foreign |
D.6.1 Connecting to Databases
You can also use R to connect to a database and read in data.
Use the RODBC package to connect to databases through an ODBC connection.
Use the DBI package to connect to databases through individual drivers. The DBI package provides a common syntax for working with different databases. You will have to download a database-specific package to use in conjunction with DBI. These packages provide the API for the native drivers of different database programs. For MySQL use RMySQL, for SQLite use RSQLite, for Oracle use ROracle, for PostgreSQL use RPostgreSQL, and for databases that use drivers based on the Java Database Connectivity (JDBC) API use RJDBC. Once you have loaded the appropriate driver package, you can use the commands provided by DBI to access your database.