4 Wrangling Data

Setup Code (Click to Expand)

# packages needed to run the code in this section
# install.packages(c("tidyverse", "here", "janitor", "remotes"))
# remotes::install_github("NHS-South-Central-and-West/scwplot")
# remotes::install_github("rOpenSci/fingertipsR")

The first step in any analysis is to find the right data for the problem at hand, and to transform that data so that it is structured appropriately for the method of analysis. Finding the right data can be challenging, sometimes requiring getting data from multiple sources, and even then, the data might be a mess, with missing values, duplicates, formatting that makes it difficult to work with in R or Python, and a myriad of other problems. The process of transforming data to make it suitable for analysis is often referred to as data wrangling, or data munging. The data wrangling process can take many forms, from simple data cleaning to more complex transformations like combining multiple data sets, or creating new variables to enrich the data.

4.1 R Packages for Data Wrangling

Data wrangling is a task that R is extremely well suited for, particularly thanks to the tidyverse ecosystem, which provides several packages that can be used to do a wide variety of data wrangling tasks, including:

readr - package for reading tabular/rectangular data (e.g. csv files).
readxl - package for reading Excel files.
dplyr - set of functions for data manipulation (filtering rows, selecting columns, re-ordering rows, adding new variables with functions of existing variables, and collapsing many values down to a single summary).
tidyr - set of functions for data transformation (changing the representation of a dataset from wide to long, or vice versa, and splitting and combining columns).
stringr - tools for working with strings.
lubridate - tools for dealing with dates and times.

In addition to the tidyverse packages, there are a number of other packages that come in handy when dealing with data, including:

here - package that simplifies how you access files on your local machine by using the relative path based on the location of your R project.
janitor - set of tools for data cleaning.

Finally, there are a number of packages that help you work with particularly large data sets:

vroom - package that reads files quickly using the ALTREP framework.
data.table - package that provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed.
arrow - package for using Apache Arrow, which handles in-memory and larger-than-memory data very quickly.
sparklyr - R interface for Apache Spark.

Although the tidyverse packages are very popular, they are not required to work with data in R. There will be equivalent ways to do things using only base R functions too. However, I think the tidyverse is particularly user-friendly, especially for those that are coming to R from SQL, so I will focus on data wrangling using a tidyverse workflow in this guide.

4.2 Fingertips Data

We will use Public Health England’s Fingertips API (for which there are packages available in R and Python) as the data source for this guide, as it represents a good opportunity to give a brief overview of this data and how to use it. Fingertips is a repository of public health data on a wide range of topics. The data included in this tutorial relates to the wider determinants of health.

First, we can look at what is available. The fingertipsR::indicators_unique() function returns a list of all unique indicators from the Fingerips API, as well as their indicator IDs. We can use the View() function to open the data in a new window and take a closer look at what is available.

library(fingertipsR)

inds <- fingertipsR::indicators_unique()

# View(inds)

If we were interested in a particular indicator, this would be a good way of finding the ID and getting the data. However, if we’re looking for a group of indicators about a particular topic, then the Fingertips profiles are going to be more useful.

profiles <- fingertipsR::profiles()

# View(profiles)

From the profiles, we can identify topics that are of particular interest, for example the Public Health Outcomes Framework (profile ID = 19).

phof_inds <- fingertipsR::indicators(ProfileID=19)

# View(phof_inds)

There are lots of indicators available from this profile. We can also check the area types that the indicators in the Public Health Outcomes Framework are available for, using either the fingerprintsR::area_types() function, with the ProfileID argument set to the profile we are interested in, or the fingertipsR::indicator_area_types() function, with the IndicatorID argument set to one of the indicators that we are interested in.

fingertipsR::area_types(ProfileID=19) |>
  dplyr::select(AreaTypeID, AreaTypeName)

   AreaTypeID                             AreaTypeName
1           6                 Government Office Region
2         301 Lower tier local authorities (4/20-3/21)
3         301 Lower tier local authorities (4/20-3/21)
4         301 Lower tier local authorities (4/20-3/21)
5         301 Lower tier local authorities (4/20-3/21)
6         301 Lower tier local authorities (4/20-3/21)
7         301 Lower tier local authorities (4/20-3/21)
8         301 Lower tier local authorities (4/20-3/21)
9         302 Upper tier local authorities (4/20-3/21)
10        302 Upper tier local authorities (4/20-3/21)
11        302 Upper tier local authorities (4/20-3/21)
12        302 Upper tier local authorities (4/20-3/21)
13         15                                  England
14        501 Lower tier local authorities (post 4/23)
15        501 Lower tier local authorities (post 4/23)
16        501 Lower tier local authorities (post 4/23)
17        501 Lower tier local authorities (post 4/23)
18        501 Lower tier local authorities (post 4/23)
19        502 Upper tier local authorities (post 4/23)
20        502 Upper tier local authorities (post 4/23)
21        502 Upper tier local authorities (post 4/23)
22        502 Upper tier local authorities (post 4/23)

fingertipsR::indicator_areatypes(IndicatorID=phof_inds$IndicatorID[1])

# A tibble: 10 × 2
   IndicatorID AreaTypeID
         <int>      <int>
 1       90362        502
 2       90362        501
 3       90362        402
 4       90362        401
 5       90362        302
 6       90362        301
 7       90362        202
 8       90362        102
 9       90362         15
10       90362          6

The list of area types is quite long, but from the indicator_area_types() function, we can see that there are six area types available for the first indicator in the list of profile indicators. The area types ending in 2 are all upper tier local authorities (the different types being different local authority boundaries from different points in time), while the other two area types refer to Government Office Regions and England as a whole.

We will use the most recent upper tier local authority boundaries, which have the AreaTypeID, 502.

Having identified the profile and area type we are interested in, we can now pull the data from the Fingertips API using the fingertipsR::fingertips_data() function.

# pull wider determinants raw data from fingertips
phof_raw <-
  fingertipsR::fingertips_data(
    ProfileID = 19,
    AreaTypeID = 502
  )

dplyr::glimpse(phof_raw)

Rows: 453,531
Columns: 27
$ IndicatorID                         <int> 90362, 90362, 90362, 90362, 90362, 90362, 90362, 90362, 90362, 90362, 9036…
$ IndicatorName                       <chr> "A01a - Healthy life expectancy at birth", "A01a - Healthy life expectancy…
$ ParentCode                          <chr> NA, NA, "E92000001", "E92000001", "E92000001", "E92000001", "E92000001", "…
$ ParentName                          <chr> NA, NA, "England", "England", "England", "England", "England", "England", …
$ AreaCode                            <chr> "E92000001", "E92000001", "E06000001", "E06000002", "E06000003", "E0600000…
$ AreaName                            <chr> "England", "England", "Hartlepool", "Middlesbrough", "Redcar and Cleveland…
$ AreaType                            <chr> "England", "England", "Counties & UAs (from Apr 2023)", "Counties & UAs (f…
$ Sex                                 <chr> "Male", "Female", "Male", "Male", "Male", "Male", "Male", "Male", "Male", …
$ Age                                 <chr> "All ages", "All ages", "All ages", "All ages", "All ages", "All ages", "A…
$ CategoryType                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Category                            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Timeperiod                          <chr> "2009 - 11", "2009 - 11", "2009 - 11", "2009 - 11", "2009 - 11", "2009 - 1…
$ Value                               <dbl> 63.02647, 64.03794, 56.73376, 58.01944, 59.20328, 61.39797, 64.08556, 59.3…
$ LowerCI95.0limit                    <dbl> 62.87787, 63.88135, 55.13755, 56.43198, 57.51456, 59.75381, 62.56485, 57.6…
$ UpperCI95.0limit                    <dbl> 63.17508, 64.19453, 58.32998, 59.60690, 60.89201, 63.04214, 65.60628, 61.0…
$ LowerCI99.8limit                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ UpperCI99.8limit                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Count                               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Denominator                         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Valuenote                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ RecentTrend                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ComparedtoEnglandvalueorpercentiles <chr> "Not compared", "Not compared", "Worse", "Worse", "Worse", "Similar", "Sim…
$ Comparedtopercentiles               <chr> "Not compared", "Not compared", "Not compared", "Not compared", "Not compa…
$ TimeperiodSortable                  <int> 20090000, 20090000, 20090000, 20090000, 20090000, 20090000, 20090000, 2009…
$ Newdata                             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Comparedtogoal                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Timeperiodrange                     <chr> "3y", "3y", "3y", "3y", "3y", "3y", "3y", "3y", "3y", "3y", "3y", "3y", "3…

phof_raw |>
  dplyr::distinct(IndicatorName) |>
  head()

                                                  IndicatorName
1                       A01a - Healthy life expectancy at birth
2                               A01b - Life expectancy at birth
3 A02b - Inequality in healthy life expectancy at birth ENGLAND
4                                  A01b - Life expectancy at 65
5      A02c - Inequality in healthy life expectancy at birth LA
6                 A02a - Inequality in life expectancy at birth

With the raw data imported, we are now able to start the process of cleaning and wrangling the data into a format that is ready for analysis.

4.3 Data Wrangling

First, we will clean the column names using the janitor::clean_names() function. This will convert all column names to lower case, remove any spaces and replace them with underscores, and remove any special characters. This will make it easier to work with the data in the future.

phof_raw <-
  phof_raw |>
  janitor::clean_names()

We’ve got a total of 168 indicators, which means some of the indicators from the Public Health Outcomes Framework are not available for upper tier local authorities. We can filter the data to just the indicators we are interested in (as well as filtering any rows where the value is missing), using the dplyr::filter() function. We can also use the dplyr::group_by() and dplyr::summarise() functions to get the mean value of the indicators across all areas in the data. Below is an example:

phof_raw |>
  dplyr::filter(
    indicator_id %in% c(90366, 20101, 11601) &
    !is.na(value)) |>
    dplyr::group_by(indicator_name) |>
    dplyr::summarise(mean(value))

# A tibble: 3 × 2
  indicator_name                                                    `mean(value)`
  <chr>                                                                     <dbl>
1 A01b - Life expectancy at birth                                           80.3 
2 B16 - Utilisation of outdoor space for exercise or health reasons         16.2 
3 C04 - Low birth weight of term babies                                      2.87

The data pulled from the Fingertips API has a number of idiosyncracies that need addressing when wrangling the data. For example, the timeperiod_sortable column refers to the year that the data relates to, and it contains four superfluous zeros at the end of each year (i.e, 20230000). We can use the dplyr::mutate() function to create a year column which strips out the zeros from the timeperiod_sortable column (using the stringr::str_remove_all() function).

# filter for the indicators of interest and wrangle data into tidy structure
phof_raw |>
  dplyr::filter(indicator_id %in% c(90366, 20101, 11601) & area_code == "E92000001") |>
  dplyr::group_by(indicator_name, area_code, timeperiod_sortable) |>
  dplyr::summarise(value = mean(value)) |>
  dplyr::mutate(year = stringr::str_remove_all(timeperiod_sortable, "0000")) |>
  dplyr::select(indicator_name, area_code, year, value)

# A tibble: 43 × 4
# Groups:   indicator_name, area_code [3]
   indicator_name                  area_code year  value
   <chr>                           <chr>     <chr> <dbl>
 1 A01b - Life expectancy at birth E92000001 2001   78.4
 2 A01b - Life expectancy at birth E92000001 2002   78.7
 3 A01b - Life expectancy at birth E92000001 2003   79.0
 4 A01b - Life expectancy at birth E92000001 2004   79.4
 5 A01b - Life expectancy at birth E92000001 2005   79.6
 6 A01b - Life expectancy at birth E92000001 2006   79.8
 7 A01b - Life expectancy at birth E92000001 2007   80.1
 8 A01b - Life expectancy at birth E92000001 2008   80.4
 9 A01b - Life expectancy at birth E92000001 2009   80.7
10 A01b - Life expectancy at birth E92000001 2010   81.0
# ℹ 33 more rows

We can also remove the ID that appended to the beginning of each indicator name. We can use str_remove_all() again, using a regular expression to remove everything up to and including the dash in the indicator name (and the space immediately after).

phof_raw |>
  dplyr::filter(indicator_id %in% c(90366, 20101, 11601)) |>
  dplyr::mutate(indicator_name = stringr::str_remove(indicator_name, "^[^-]+- ")) |>
  dplyr::distinct(indicator_name)

                                               indicator_name
1                                    Life expectancy at birth
2 Utilisation of outdoor space for exercise or health reasons
3                             Low birth weight of term babies

Finally, the data is structured in a wide format, with each indicator having its own column. This is not a tidy structure, as each column should represent a variable, and each row should represent an observation. We can use the tidyr::pivot_wider() function to convert the data into a tidy structure, with each indicator having its own row. We can also use the dplyr::rename() function to rename the columns to something more meaningful, and the tidyr::drop_na() function to remove any rows where the value is missing.

phof_raw |>
  dplyr::filter(indicator_id %in% c(90366, 20101, 11601)) |>
  dplyr::select(indicator_id, indicator_name, area_code, timeperiod_sortable, sex, value) |>
  tidyr::pivot_wider(
    names_from = indicator_name,
    values_from = value
  ) |>
  dplyr::rename(
      life_expectancy = `A01b - Life expectancy at birth`,
      low_birth_weight = `C04 - Low birth weight of term babies`,
      outdoor_space = `B16 - Utilisation of outdoor space for exercise or health reasons`
  )

# A tibble: 9,791 × 7
   indicator_id area_code timeperiod_sortable sex    life_expectancy outdoor_space low_birth_weight
          <int> <chr>                   <int> <chr>  <list>          <list>        <list>          
 1        90366 E92000001            20010000 Male   <dbl [1]>       <NULL>        <NULL>          
 2        90366 E92000001            20010000 Female <dbl [1]>       <NULL>        <NULL>          
 3        90366 E06000001            20010000 Male   <dbl [1]>       <NULL>        <NULL>          
 4        90366 E06000002            20010000 Male   <dbl [1]>       <NULL>        <NULL>          
 5        90366 E06000003            20010000 Male   <dbl [1]>       <NULL>        <NULL>          
 6        90366 E06000004            20010000 Male   <dbl [1]>       <NULL>        <NULL>          
 7        90366 E06000005            20010000 Male   <dbl [1]>       <NULL>        <NULL>          
 8        90366 E06000006            20010000 Male   <dbl [1]>       <NULL>        <NULL>          
 9        90366 E06000007            20010000 Male   <dbl [1]>       <NULL>        <NULL>          
10        90366 E06000008            20010000 Male   <dbl [1]>       <NULL>        <NULL>          
# ℹ 9,781 more rows

We can combine all of the above steps into a single function, which we can then apply to the raw data to get the data into a tidy structure.

# function to wrangle data into tidy structure
wrangle_phof_data <- function(data) {
  data |>
    dplyr::filter(indicator_id %in% c(90366, 20101, 11601)) |>
    dplyr::group_by(indicator_name, area_code, timeperiod_sortable) |>
    dplyr::summarise(value = mean(value)) |>
    dplyr::mutate(year = stringr::str_remove_all(timeperiod_sortable, "0000")) |>
    dplyr::select(indicator_name, area_code, year, value) |>
    tidyr::pivot_wider(
      names_from = indicator_name,
      values_from = value
    ) |>
    dplyr::rename(
      life_expectancy = `A01b - Life expectancy at birth`,
      low_birth_weight = `C04 - Low birth weight of term babies`,
      outdoor_space = `B16 - Utilisation of outdoor space for exercise or health reasons`
    ) |>
    tidyr::drop_na()
}

phof <- wrangle_phof_data(phof_raw)

4.4 Next Steps

We have successfully imported, cleaned, and wrangled Fingertips data into a tidy structure. We can now move on to the next step of the data science process, which is to explore the data and perform some analysis.

We could save the data to a CSV file at this point, using the readr::write_csv() function (for example, readr::write_csv(phof, here::here("data/phof.csv"))), or we could perform the analysis on the dataframe object we have created.

4.1 R Packages for Data Wrangling

4.2 Fingertips Data

4.3 Data Wrangling

4.4 Next Steps

4.5 Resources