Lesson 8: Data Wrangling Part 2

Readings

Required:

Chapter 5.5-5.7 in R for Data Science by Hadley Wickham & Garrett Grolemund

Other resources:

The Introduction to dplyr vignette
Jenny Bryan’s lectures from STAT545 at UBC: Introduction to dplyr
Software Carpentry’s R for reproducible scientific analysis materials: Dataframe manipulation with dplyr

Learning objectives

Last class, we learned how to use dplyr functions

filter() for subsetting data with row logic
select() for subsetting data variable- or column-wise
Use piping (%>%) to implement function chains

Today, we’ll expand our data wrangling toolbox. By the end of today’s class, you should be able to:

Subset, rearrange, and summarize data with key dplyr functions:
- Create new variables with functions of existing variables with mutate()
- Reorder the rows with arrange()
- Collapse many values down to a single summary with summarize() and group_by()
Understand the basic differences between tidyverse and base R syntax

Acknowledgements: Today’s lecture is adapted (with permission) from the excellent Ocean Health Index Data Science Training with additional input from Jenny Bryan’s lectures from STAT545 at UBC: Introduction to dplyr and Grolemund and Wickham’s R for Data Science.

Getting set up a reloading the Coronavirus dataset

Let’s jump back in where we left on Monday. Let’s first clear out our workspace so we start with a fresh session by clicking “Session” -> “Restart R”. Then let’s open the R-script we were using to take notes, pull from GitHub to make sure we have the most recent version. You can use this script to type along as we’re working through demos today (if you want, it’s also fine to just watch).

Today we’ll also practice combining text and code in R Markdown files, so we’ll do our in-class exercises in an R Markdown file. Do you remember how to create a new RMarkdown file? Go File -> New File -> R Markdown. Then change the output to GitHub document either as you’re setting up the file or by manually editing the YAML header to say output: github_document. Now, delete the boilerplate text after the first setup code chunk and copy today’s exercise questions into your document from here. As we work through the exercises, you will want to add a code chunk under each question to complete your answer.

Finally, load the Coronavirus dataset back in directly from the GitHub URL and see whether it has been updated - what is the latest date included?

library(tidyverse)     ## install.packages("tidyverse")
library(skimr)        ## install.packages("skimr")

# read in corona .csv (don't worry for now about what the col_types parameter means, we'll discuss that next week)
coronavirus <- read_csv('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv')

Let’s remind ourselves of the data structure and content

skim(coronavirus)

Warm up - Exercise 1: Piping together `select()` and `filter()` commands

Subset the coronavirus dataset to only include the daily counts of confirmed cases in countries located above 60 degree latitude. What are those countries?

If you have time, pipe it into ggplot() to visualize the trends over time in these countries.

Answer

click to expand

# One way to do this:

coronavirus %>% 
  filter(lat > 60, type == "confirmed") %>% 
  select(country) %>% 
  table()


# If you try to plot the counts by date, you may see a very rugged pattern for Canada because case counts here are reported for different provinces. To get total daily counts by country we'll have to first group_by country and date and then summarize the counts, as outlined below.

`mutate()` adds new variables

Alright, let’s keep going.

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

Visually, we are doing this (thanks RStudio for your cheatsheet):

The current variables in the coronavirus dataset don’t lend themselves well to cross-computation, so to illustrate the power of the mutate() function, let’s reformat the dataset so that we get the counts of confirmed cases, deaths and recovered for each date and country in separate columns. The tidyverse has a very convenient function for making that kind of transformation. Don’t worry about how it works right now, we’ll get an opportunity to explore it in a few weeks.

For now, just copy the following code to summarize the total number of cases recorded by country and type (in the time period covered by this dataset: 2020-01-22 to 2021-04-10):

coronavirus_ttd <- coronavirus %>% 
  select(country, type, cases) %>%
  group_by(country, type) %>%
  summarize(total_cases = sum(cases)) %>%
  pivot_wider(names_from = type,
              values_from = total_cases) %>%
  arrange(-confirmed)

# Let's have a look at the structure of that new rearranged dataset
coronavirus_ttd

Imagine we want to compare the total death count to total the number of confirmed cases in each country. We can divide the case counts of death by confirmed to create a new column named deathrate. We do this with mutate() that is a function that defines and inserts new variables into a tibble. You can refer to existing variables diretly by name (i.e. without the $ operator).

coronavirus_ttd %>%
  mutate(deathrate = death / confirmed) 

# We can modify the mutate equation in many ways. For example, if we want to adjust the number of significant digits printed, we can type
coronavirus_ttd %>%
  mutate(deathrate = round(death / confirmed, 2))

Note, however, that these estimated death rates may be misleading and should be interpreted with due caution as testing strategies have varied a lot between countries (e.g. do asymptomatic people get tested). Also:

The comparison of total counts of confirmed cases and deaths in different countries is not an apple to apple comparison, as the outbreak did not start at the same time in all the affected countries.
As age plays a critical role in the probability of survival from the virus, we cannot make a comparison between different cases without having more demographic information.

Your turn - Exercise 2

Add a new variable that shows the proportion of confirmed cases for which the outcome is still unknown (i.e. not counted as dead or recovered) for each country and show only countries with more than 3 million confirmed cases. Which country has the highest proportion of undetermined outcomes? Why might that be?

When you’re done, sync your RMarkdown file to Github.com (pull, stage, commit, push).

Answer

click to expand

coronavirus_ttd %>%
  mutate(undet = (confirmed - death - recovered) / confirmed) %>% 
  filter(confirmed > 3000000)

`arrange()` orders rows

For examining the output of our previous calculations, we may want to re-arrange the countries in ascending order for the proportion of confirmed cases for which the outcome remains unknown. The dplyr function for sorting rows is arrange().

coronavirus_ttd %>%
  mutate(undet = (confirmed - death - recovered)/confirmed) %>% 
  filter(confirmed > 3000000) %>% 
  arrange(undet)

I advise that your analyses NEVER rely on rows or variables being in a specific order. But it’s still true that human beings write the code and the interactive development process can be much nicer if you reorder the rows of your data as you go along. Also, once you are preparing tables for human eyeballs, it is imperative that you step up and take control of row order.

Your turn - Exercise 3

How many countries have suffered more than 100,000 deaths so far and which five countries have recorded the highest death counts?

Answer

click to expand

coronavirus_ttd %>%
  filter(death > 100000) %>% 
  arrange(-death)

Your turn again - Exercise 4

Go back to our original dataset coronavirus and identify where and when the highest death count in a single day was observed. Hint: you can either use or base::max or dplyr::arrange().

The first case was confirmed in the US on January 20 2020, two days before the earliest day included in this dataset. When was the first confirmed case recorded in Canada?

Answer

click to expand

# Identifying the record with the highest death count
coronavirus %>% 
  filter(type == "death") %>% 
  arrange(-cases)

# We can also just identify the top hit 
coronavirus %>% 
  filter(type == "death") %>% 
  filter(cases == max(cases))

# The first recorded case in Canada
coronavirus %>% 
  filter(country == "Canada", cases > 0) %>% 
  arrange(date)

Knit your RMarkdown file, and sync it to GitHub (pull, stage, commit, push)

Grouped summaries with `summarize()` and `group_by`

The last key dplyr verb is summarize(). It collapses a data frame to a single row. Visually, we are doing this (thanks RStudio for your cheatsheet):

We can use it to calculate the total number of confirmed cases detected globally since 1-22-2020 (the beginning of this dataset)

coronavirus %>% 
  filter(type == "confirmed") %>% 
  summarize(sum = sum(cases))

This number could also easily have been computed with base-R functions. In general, summarize() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by country, we get the total number of confirmed cases for each country or region.

coronavirus %>% 
  filter(type == "confirmed") %>%
  group_by(country) %>% 
  summarize(total_cases = sum(cases))

Now that’s a lot more useful!

We can also use summarize() to check how many observations (dates) we have for each country

coronavirus %>% 
  filter(type == "confirmed") %>%
  group_by(country) %>% 
  summarize(n = n())

Why do some countries have much higher counts than others?

We can also do multi-level grouping. If we wanted to know how many of each type of case there were globally on Monday (March 1) we could chain these functions together:

coronavirus %>% 
  group_by(date, type) %>% 
  summarize(total = sum(cases)) %>%  # sums the count across countries
  filter(date == "2021-03-01")

Your turn - Exercise 5

Which day has had the highest total death count globally so far?

Pipe your global daily death counts into ggplot to visualize the trend over time.

Answer

click to expand

coronavirus %>% 
  filter(type == "death") %>% 
  group_by(date) %>% 
  summarize(total_deaths = sum(cases)) %>% 
  arrange(-total_deaths)

# Or

coronavirus %>% 
  filter(type == "death") %>% 
  group_by(date) %>% 
  summarize(total_deaths = sum(cases)) %>% 
  filter(total_deaths == max(total_deaths))

# With plotting

coronavirus %>% 
  filter(type == "death") %>% 
  group_by(date) %>% 
  summarize(total_deaths = sum(cases)) %>% 
  arrange(-total_deaths) %>% 
  ggplot() +
    geom_line(aes(x = date, y = total_deaths))

If you have more time, here is an optional question

The month() function from the package lubridate extracts the month from a date and year() extracts the year. How many countries already have more than 10,000 deaths in February of this year?

Answer

click to expand

library(lubridate) #install.packages('lubridate')

coronavirus %>% 
  mutate(month = month(date), year = year(date)) %>% 
  filter(type == "death", month == 2, year == 2021) %>% 
  group_by(country) %>% 
  summarize(total_death = sum(cases)) %>% 
  filter(total_death > 10000)

Extra in-class questions

Which country had the highest number of deaths on October 4 2020?

Answer

click to expand

coronavirus %>% 
  select(-lat, -long) %>% 
  filter(date == "2020-10-04", type == "death") %>% 
  arrange(-cases)

Which country had the highest count of confirmed cases in January of this year? [Hint: to address this question the functions `month()` and `year()` from the package lubridate might be helpful]. What about in March?

Answer

click to expand

library(lubridate) #install.packages('lubridate')

coronavirus %>% 
  mutate(month = month(date), year = year(date)) %>% 
  filter(type == "confirmed", month == 1, year == 2021) %>% 
  group_by(country) %>% 
  summarize(total_death = sum(cases)) %>% 
  arrange(-total_death)

If you’re used to working in base R, answer the same question with base R tools. Which coding approach do you like better or what are pros and cons of the two types of syntax?

Which countries have data for multiple states or provinces?

Answer

click to expand

coronavirus %>% 
  group_by(country, date) %>% 
  summarize(n = n()) %>% 
  group_by(country) %>% 
  summarize(maxcount = max(n)) %>% 
  filter(maxcount > 3)

Lesson 8: Data Wrangling Part 2

Readings

Required:

Other resources:

Learning objectives

Getting set up a reloading the Coronavirus dataset

Warm up - Exercise 1: Piping together select() and filter() commands

Answer

mutate() adds new variables

Your turn - Exercise 2

Answer

arrange() orders rows

Your turn - Exercise 3

Answer

Your turn again - Exercise 4

Answer

Grouped summaries with summarize() and group_by

Your turn - Exercise 5

Answer

If you have more time, here is an optional question

Answer

Extra in-class questions

Which country had the highest number of deaths on October 4 2020?

Which country had the highest count of confirmed cases in January of this year? [Hint: to address this question the functions month() and year() from the package lubridate might be helpful]. What about in March?

Which countries have data for multiple states or provinces?

Warm up - Exercise 1: Piping together `select()` and `filter()` commands

`mutate()` adds new variables

`arrange()` orders rows

Grouped summaries with `summarize()` and `group_by`

Which country had the highest count of confirmed cases in January of this year? [Hint: to address this question the functions `month()` and `year()` from the package lubridate might be helpful]. What about in March?