This is a short tutorial outlining the syntax of the four basic data tidying functions of the tidyr package, namely:

  • pivot_longer(),
  • pivot_wider(),
  • separate(), and
  • unite().

1 Tidy data

Recall that tidy data is tabular data that is organised such that:

  1. each observation has its own row,
  2. each variable has its own column, and
  3. each value has its own cell.

This is sometimes referred to ‘tall’ or ‘long’ form data because of its shape. By contrast you may have ‘wide’ data which, for example, has columns containing measurements of the same variable but taken at different time points.

Both wide or tidy are valid ways to store data but I’d argue that tidy data is easier to manipulate. This is the philosophy behind the omnibus of packages that constitutes the tidyverse bundle of R packages.

2 COVID-19 time-series

The example dataset used in this tutorial is curated by the Center for Systems Science and Engineering at Johns Hopkins University, Whiting School of Engineering.

It is time series data updated nightly with the count of COVID-19 cases globally, grouped by various geographical regions. We can read the .CSV directly into a data frame.

In this tutorial we’ll just examine the counts for Australian cases only.

Notice that the column headers (aside from the first) are dates in m/d/yy format. This is a clue that this data frame is in ‘wide’ format since the column names, rather than being generic variable names, are themselves encoded with data about the observations.

In other words, these headers are not just labels indicating which column contains dates, they themselves are the dates.

3 Pivoting Longer

Let’s reorganise this data frame into ‘long’ format using the pivot_longer() function in tidyr.

The cols = argument specifies which columns we want to pivot. Here we want to pivot all columns except the State column. This is achieved using the exclaimation prefix ! before the column name we want to exclude. The names_to = argument indicates the name of the new column for storing what were previously column names. The third argument values_to = is the name of another new column for storing the observation’s value.