This is a brief guide to importing various data file formats into R
.
The file formats covered here are:
.CSV
, .TSV
).XLSX
).SAV
).DTA
), and.SAS7BDAT
)The following R
packages are used in this guide:
These can be installed using the install.packages()
command:
install.packages("readr")
install.packages("data.table")
install.packages("readxl")
install.packages("googlesheets4")
install.packages("haven")
Data in R
are typically stored as data frames where values are arranged into rows and columns akin to a table. Below are the first five rows of the example iris
6 data frame that comes packaged with R
:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
Data frames have a number of characteristics worth mentioning:
R
they must contain only a single data type, e.g., numeric, factor, logical, character, etc.$
) notation. For example, iris$Sepal.Width
returns the Sepal.Width
column of the iris
data frame as a vector.R
functions are built to accept data frames as input.In the following examples, data is imported into R
as data frames.
Before going further it is worth considering the arrangement of the values in your dataset as you import it. By this we mean what do the rows and columns of your dataset represent? In particular, I’d like to commend to you the so-called tidy data format.
Tidy data is tabular data that is organised such that:
This is sometimes referred to ‘tall’ or ‘long’ form data because of its shape. By contrast you may have ‘wide’ data which, for example, has columns containing measurements of the same variable but taken at different time points.
Both wide or tidy are valid ways to store data but I’d argue that tidy data is easier to manipulate. This is the philosophy behind the omnibus of packages that constitutes the tidyverse
bundle of R
packages.
We will see examples of both kinds of data in this guide but the usage of R
commands for transforming between these types is best covered in a separate tutorial.
The dataset used in this guide comes from Exploring Longitudinal Data on Change7 by Judith D. Singer and John B. Willett (2003), as accessed via the Textbook Examples page hosted by UCLA IDRE. It comprises survey data assessing the tolerance of teenage participants towards various deviant (!) behaviours.
Plain-text files are human-readable files when opened in a simple text editor. The most commonly encounted plain-text data files are comma separated values or tab separated values, .CSV
and .TSV
respectively. These are highly portable and often the de facto file format for sharing datasets.
The readr
package has three useful functions for reading plain-text data files:
read_csv()
read_tsv()
read_delim()
Below is a preview of the contents of the .CSV
file that we’ll be importing. Notice the commas ,
delimiting the values.
## "ID","TOL11","TOL12","TOL13","TOL14","TOL15","MALE","EXPOSURE"
## 9,2.23,1.79,1.9000000000000001,2.12,2.66,0,1.54
## 45,1.12,1.45,1.45,1.45,1.99,1,1.16
## 268,1.45,1.34,1.99,1.79,1.34,1,0.9
## 314,1.22,1.22,1.55,1.12,1.12,0,0.81
## 442,1.45,1.99,1.45,1.67,1.9000000000000001,0,1.1300000000000001
## 514,1.34,1.67,2.23,2.12,2.44,1,0.9
## 569,1.79,1.9000000000000001,1.9000000000000001,1.99,1.99,0,1.99
## 624,1.12,1.12,1.22,1.12,1.22,1,0.98
## 723,1.22,1.34,1.12,1,1.12,0,0.81
## 918,1,1,1.22,1.99,1.22,0,1.21
## 949,1.99,1.55,1.12,1.45,1.55,1,0.93
## 978,1.22,1.34,2.12,3.46,3.3200000000000003,1,1.59
## 1105,1.34,1.9000000000000001,1.99,1.9000000000000001,2.12,1,1.3800000000000001
## 1542,1.22,1.22,1.99,1.79,2.12,0,1.44
## 1653,1.11,1.11,1.34,1.55,2.12,0,1.25
## Parsed with column specification:
## cols(
## ID = col_double(),
## TOL11 = col_double(),
## TOL12 = col_double(),
## TOL13 = col_double(),
## TOL14 = col_double(),
## TOL15 = col_double(),
## MALE = col_double(),
## EXPOSURE = col_double()
## )
The read_csv()
and all its cousin read_X()
functions print out a summary of the data types it has detected and assigned to each loaded column. In this instance they are all double
s (sort for ‘double precision floating point number’) otherwise known as real numbers on the number line.
## # A tibble: 6 x 8
## ID TOL11 TOL12 TOL13 TOL14 TOL15 MALE EXPOSURE
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9 2.23 1.79 1.9 2.12 2.66 0 1.54
## 2 45 1.12 1.45 1.45 1.45 1.99 1 1.16
## 3 268 1.45 1.34 1.99 1.79 1.34 1 0.9
## 4 314 1.22 1.22 1.55 1.12 1.12 0 0.81
## 5 442 1.45 1.99 1.45 1.67 1.9 0 1.13
## 6 514 1.34 1.67 2.23 2.12 2.44 1 0.9
It would be worth doing some wrangling to correct some of the data types in this data. For example, id
, male
, and time
are probably more sensibly treated as categorical variables or factor
s in R
.
The sample .TSV
file has similar content but of course delimited by tab characters:
## ID TOL11 TOL12 TOL13 TOL14 TOL15 MALE EXPOSURE
## 9 2.23 1.79 1.9 2.12 2.66 0 1.54
## 45 1.12 1.45 1.45 1.45 1.99 1 1.16
## 268 1.45 1.34 1.99 1.79 1.34 1 0.9
## 314 1.22 1.22 1.55 1.12 1.12 0 0.81
## 442 1.45 1.99 1.45 1.67 1.9 0 1.13
## # A tibble: 6 x 8
## ID TOL11 TOL12 TOL13 TOL14 TOL15 MALE EXPOSURE
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9 2.23 1.79 1.9 2.12 2.66 0 1.54
## 2 45 1.12 1.45 1.45 1.45 1.99 1 1.16
## 3 268 1.45 1.34 1.99 1.79 1.34 1 0.9
## 4 314 1.22 1.22 1.55 1.12 1.12 0 0.81
## 5 442 1.45 1.99 1.45 1.67 1.9 0 1.13
## 6 514 1.34 1.67 2.23 2.12 2.44 1 0.9
Lastly, the delim =
argument to the read_delim()
function allows us to specify a custom delimiting character. Below are the first seven lines of a mocked-up file using dollar sign $
delimiters:
## "id"$"age"$"tolerance"$"male"$"exposure"$"time"
## 9$11$2.23$0$1.54$0
## 9$12$1.79$0$1.54$1
## 9$13$1.9$0$1.54$2
## 9$14$2.12$0$1.54$3
## 9$15$2.66$0$1.54$4
## 45$11$1.12$1$1.16$0
# Read a text file delimited by dollar signs
tol_delim <- read_delim("data/tolerance_pp.txt", delim = "$")
head(tol_delim)
## # A tibble: 6 x 6
## id age tolerance male exposure time
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9 11 2.23 0 1.54 0
## 2 9 12 1.79 0 1.54 1
## 3 9 13 1.9 0 1.54 2
## 4 9 14 2.12 0 1.54 3
## 5 9 15 2.66 0 1.54 4
## 6 45 11 1.12 1 1.16 0
Notice that of the three datasets imported above, only the third (tol_delim
) is in tidy format. The first two data frames have information stored in both their cells and their headers. For example, the column label TOL11
stands for tolerance value taken at age 11; it encodes the two variables ‘tolerance’ and ‘time’. In contrast, the tidy data frame splits out values for each time point into distinct rows and adds an extra column: ‘time’.
data.table
packageAnother function for importing plain-text data files is fread()
in the data.table
package. It is extremely fast and powerful and can often identify the delimiter characters automatically. Because of its speed I highly recommend it for reading large data files.
## ID TOL11 TOL12 TOL13 TOL14 TOL15 MALE EXPOSURE
## 1: 9 2.23 1.79 1.90 2.12 2.66 0 1.54
## 2: 45 1.12 1.45 1.45 1.45 1.99 1 1.16
## 3: 268 1.45 1.34 1.99 1.79 1.34 1 0.90
## 4: 314 1.22 1.22 1.55 1.12 1.12 0 0.81
## 5: 442 1.45 1.99 1.45 1.67 1.90 0 1.13
## 6: 514 1.34 1.67 2.23 2.12 2.44 1 0.90
In some disciplines Microsoft Excel spreadsheets are the standard file format to store and process data. However, since R
offers many more complex analyses it may be worthwhile bringing that data over.
The readxl
package contains the function read_excel()
for reading in .XLS
and .XLSX
files.
## # A tibble: 6 x 8
## ID TOL11 TOL12 TOL13 TOL14 TOL15 MALE EXPOSURE
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9 2.23 1.79 1.9 2.12 2.66 0 1.54
## 2 45 1.12 1.45 1.45 1.45 1.99 1 1.16
## 3 268 1.45 1.34 1.99 1.79 1.34 1 0.9
## 4 314 1.22 1.22 1.55 1.12 1.12 0 0.81
## 5 442 1.45 1.99 1.45 1.67 1.9 0 1.13
## 6 514 1.34 1.67 2.23 2.12 2.44 1 0.9
One important distinction between Excel spreadsheets and ‘flat’ plain-text files is that the former may comprise several worksheets. To see the worksheets in an Excel spreadsheet use the excel_sheets()
function:
## [1] "tolerance" "tolerance_pp"
By default the read_excel()
function reads the first worksheet in the .XLSX
file. The sheet =
argument specifies which worksheet to import, either the number or the name of the sheet.
## # A tibble: 6 x 6
## id age tolerance male exposure time
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9 11 2.23 0 1.54 0
## 2 9 12 1.79 0 1.54 1
## 3 9 13 1.9 0 1.54 2
## 4 9 14 2.12 0 1.54 3
## 5 9 15 2.66 0 1.54 4
## 6 45 11 1.12 1 1.16 0
One convenient way to store and share (non-sensitve) data is using Google Sheets. The package googlesheets4
is built for downloading online Google Sheets into local data frames and handles any Google authentication and sign-in.
The sheets_get()
function retrieves the metadata about a Google Sheet.
# URL to the gapminder Google Sheet
url <- "https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077"
sheets_deauth() # publicly shared sheets don't need authentication
sheets_get(url)
## Spreadsheet name: gapminder
## ID: 1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY
## Locale: en_US
## Time zone: America/Los_Angeles
## # of sheets: 5
##
## (Sheet name): (Nominal extent in rows x columns)
## Africa: 625 x 6
## Americas: 301 x 6
## Asia: 397 x 6
## Europe: 361 x 6
## Oceania: 25 x 6
##
## (Named range): (A1 range)
## canada: 'Americas'!A38:F49
Knowing the names of the worksheets we can import them by name or number using sheets_read()
:
# 4th worksheet of the gapminder Google Sheet corresponds to European data
gapminder_eu <- sheets_read(url, sheet = 4)
head(gapminder_eu)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Albania Europe 1952 55.2 1282697 1601.
## 2 Albania Europe 1957 59.3 1476505 1942.
## 3 Albania Europe 1962 64.8 1728137 2313.
## 4 Albania Europe 1967 66.2 1984060 2760.
## 5 Albania Europe 1972 67.7 2263554 3313.
## 6 Albania Europe 1977 68.9 2509048 3533.
The haven
package has functions to import SPSS, SAS, and STATA files.
Use read_sav()
or read_spss()
to import SPSS .SAV
files.
Use read_sas()
to import SAS .SAS7BDAT
files.
Use read_dta()
or read_stata()
to import STATA .DTA
files.
This guide has only covered a few of the data formats that R
can accommodate. Other common sources of data include SQL databases, cloud-based APIs, and even scraping directly from webpages. Take a look at the ‘R Data Import/Export’8 manual for a broader summary of what’s available.
data.table
package↩
googlesheets4
package↩
Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5↩
Judith D. Singer, John B. Willett (2003), Exploring Longitudinal Data on Change, DOI: 10.1093/acprof:oso/9780195152968.003.0002↩
R Data Import/Export manual↩