This is a short tutorial outlining the syntax of the four basic data tidying functions of the
tidyr package, namely:
Recall that tidy data is tabular data that is organised such that:
This is sometimes referred to ‘tall’ or ‘long’ form data because of its shape. By contrast you may have ‘wide’ data which, for example, has columns containing measurements of the same variable but taken at different time points.
Both wide or tidy are valid ways to store data but I’d argue that tidy data is easier to manipulate. This is the philosophy behind the omnibus of packages that constitutes the
tidyverse bundle of R packages.
The example dataset used in this tutorial is curated by the Center for Systems Science and Engineering at Johns Hopkins University, Whiting School of Engineering.
It is time series data updated nightly with the count of COVID-19 cases globally, grouped by various geographical regions. We can read the
.CSV directly into a data frame.
In this tutorial we’ll just examine the counts for Australian cases only.
Notice that the column headers (aside from the first) are dates in
m/d/yy format. This is a clue that this data frame is in ‘wide’ format since the column names, rather than being generic variable names, are themselves encoded with data about the observations.
In other words, these headers are not just labels indicating which column contains dates, they themselves are the dates.
Let’s reorganise this data frame into ‘long’ format using the
pivot_longer() function in
cols = argument specifies which columns we want to pivot. Here we want to pivot all columns except the
State column. This is achieved using the exclaimation prefix
! before the column name we want to exclude. The
names_to = argument indicates the name of the new column for storing what were previously column names. The third argument
values_to = is the name of another new column for storing the observation’s value.