Jenny Bryan’s lectures from STAT545 at UBC: Introduction to dplyr
Software Carpentry’s R for reproducible scientific analysis materials: Dataframe manipulation with dplyr
Last class, we learned how to use dplyr
functions
filter()
for subsetting data with row logicselect()
for subsetting data variable- or column-wise%>%
) to implement function chainsToday, we’ll expand our data wrangling toolbox. By the end of today’s class, you should be able to:
dplyr
functions:
mutate()
arrange()
summarize()
and group_by()
Acknowledgements: Today’s lecture is adapted (with permission) from the excellent Ocean Health Index Data Science Training with additional input from Jenny Bryan’s lectures from STAT545 at UBC: Introduction to dplyr and Grolemund and Wickham’s R for Data Science.
select()
and filter()
commandsSubset the coronavirus dataset to only include the daily counts of confirmed cases in countries located above 60 degree latitude. What are those countries?
If you have time, pipe it into ggplot() to visualize the trends over time in these countries.
# One way to do this:
coronavirus %>%
filter(lat > 60, type == "confirmed") %>%
select(country) %>%
table()
# If you try to plot the counts by date, you may see a very rugged pattern for Canada because case counts here are reported for different provinces. To get total daily counts by country we'll have to first group_by country and date and then summarize the counts, as outlined below.
mutate()
adds new variablesAlright, let’s keep going.
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate()
.
Visually, we are doing this (thanks RStudio for your cheatsheet):
The current variables in the coronavirus dataset don’t lend themselves well to cross-computation, so to illustrate the power of the mutate()
function, let’s reformat the dataset so that we get the counts of confirmed cases, deaths and recovered for each date and country in separate columns. The tidyverse has a very convenient function for making that kind of transformation. Don’t worry about how it works right now, we’ll get an opportunity to explore it in a few weeks.
For now, just copy the following code to summarize the total number of cases recorded by country and type (in the time period covered by this dataset: 2020-01-22 to 2021-04-10):
coronavirus_ttd <- coronavirus %>%
select(country, type, cases) %>%
group_by(country, type) %>%
summarize(total_cases = sum(cases)) %>%
pivot_wider(names_from = type,
values_from = total_cases) %>%
arrange(-confirmed)
# Let's have a look at the structure of that new rearranged dataset
coronavirus_ttd
Imagine we want to compare the total death count to total the number of confirmed cases in each country. We can divide the case counts of death
by confirmed
to create a new column named deathrate
. We do this with mutate()
that is a function that defines and inserts new variables into a tibble. You can refer to existing variables diretly by name (i.e. without the $
operator).
coronavirus_ttd %>%
mutate(deathrate = death / confirmed)
# We can modify the mutate equation in many ways. For example, if we want to adjust the number of significant digits printed, we can type
coronavirus_ttd %>%
mutate(deathrate = round(death / confirmed, 2))
Note, however, that these estimated death rates may be misleading and should be interpreted with due caution as testing strategies have varied a lot between countries (e.g. do asymptomatic people get tested). Also:
Add a new variable that shows the proportion of confirmed cases for which the outcome is still unknown (i.e. not counted as dead or recovered) for each country and show only countries with more than 3 million confirmed cases. Which country has the highest proportion of undetermined outcomes? Why might that be?
When you’re done, sync your RMarkdown file to Github.com (pull, stage, commit, push).
coronavirus_ttd %>%
mutate(undet = (confirmed - death - recovered) / confirmed) %>%
filter(confirmed > 3000000)
arrange()
orders rowsFor examining the output of our previous calculations, we may want to re-arrange the countries in ascending order for the proportion of confirmed cases for which the outcome remains unknown. The dplyr
function for sorting rows is arrange()
.
coronavirus_ttd %>%
mutate(undet = (confirmed - death - recovered)/confirmed) %>%
filter(confirmed > 3000000) %>%
arrange(undet)
I advise that your analyses NEVER rely on rows or variables being in a specific order. But it’s still true that human beings write the code and the interactive development process can be much nicer if you reorder the rows of your data as you go along. Also, once you are preparing tables for human eyeballs, it is imperative that you step up and take control of row order.
How many countries have suffered more than 100,000 deaths so far and which five countries have recorded the highest death counts?
coronavirus_ttd %>%
filter(death > 100000) %>%
arrange(-death)
- Go back to our original dataset
coronavirus
and identify where and when the highest death count in a single day was observed. Hint: you can either use orbase::max
ordplyr::arrange()
.
- The first case was confirmed in the US on January 20 2020, two days before the earliest day included in this dataset. When was the first confirmed case recorded in Canada?
# Identifying the record with the highest death count
coronavirus %>%
filter(type == "death") %>%
arrange(-cases)
# We can also just identify the top hit
coronavirus %>%
filter(type == "death") %>%
filter(cases == max(cases))
# The first recorded case in Canada
coronavirus %>%
filter(country == "Canada", cases > 0) %>%
arrange(date)
Knit your RMarkdown file, and sync it to GitHub (pull, stage, commit, push)
summarize()
and group_by
The last key dplyr
verb is summarize()
. It collapses a data frame to a single row. Visually, we are doing this (thanks RStudio for your cheatsheet):
We can use it to calculate the total number of confirmed cases detected globally since 1-22-2020 (the beginning of this dataset)
coronavirus %>%
filter(type == "confirmed") %>%
summarize(sum = sum(cases))
This number could also easily have been computed with base-R functions. In general, summarize()
is not terribly useful unless we pair it with group_by()
. This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr
verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by country, we get the total number of confirmed cases for each country or region.
coronavirus %>%
filter(type == "confirmed") %>%
group_by(country) %>%
summarize(total_cases = sum(cases))
Now that’s a lot more useful!
We can also use summarize()
to check how many observations (dates) we have for each country
coronavirus %>%
filter(type == "confirmed") %>%
group_by(country) %>%
summarize(n = n())
Why do some countries have much higher counts than others?
We can also do multi-level grouping. If we wanted to know how many of each type of case there were globally on Monday (March 1) we could chain these functions together:
coronavirus %>%
group_by(date, type) %>%
summarize(total = sum(cases)) %>% # sums the count across countries
filter(date == "2021-03-01")
Which day has had the highest total death count globally so far?
Pipe your global daily death counts into ggplot to visualize the trend over time.
coronavirus %>%
filter(type == "death") %>%
group_by(date) %>%
summarize(total_deaths = sum(cases)) %>%
arrange(-total_deaths)
# Or
coronavirus %>%
filter(type == "death") %>%
group_by(date) %>%
summarize(total_deaths = sum(cases)) %>%
filter(total_deaths == max(total_deaths))
# With plotting
coronavirus %>%
filter(type == "death") %>%
group_by(date) %>%
summarize(total_deaths = sum(cases)) %>%
arrange(-total_deaths) %>%
ggplot() +
geom_line(aes(x = date, y = total_deaths))
The month()
function from the package lubridate
extracts the month from a date and year()
extracts the year. How many countries already have more than 10,000 deaths in February of this year?
library(lubridate) #install.packages('lubridate')
coronavirus %>%
mutate(month = month(date), year = year(date)) %>%
filter(type == "death", month == 2, year == 2021) %>%
group_by(country) %>%
summarize(total_death = sum(cases)) %>%
filter(total_death > 10000)
coronavirus %>%
select(-lat, -long) %>%
filter(date == "2020-10-04", type == "death") %>%
arrange(-cases)
month()
and year()
from the package lubridate might be helpful]. What about in March?library(lubridate) #install.packages('lubridate')
coronavirus %>%
mutate(month = month(date), year = year(date)) %>%
filter(type == "confirmed", month == 1, year == 2021) %>%
group_by(country) %>%
summarize(total_death = sum(cases)) %>%
arrange(-total_death)
If you’re used to working in base R, answer the same question with base R tools. Which coding approach do you like better or what are pros and cons of the two types of syntax?
coronavirus %>%
group_by(country, date) %>%
summarize(n = n()) %>%
group_by(country) %>%
summarize(maxcount = max(n)) %>%
filter(maxcount > 3)