Practice data tidying with tidyr
Continue to practice data visualization with ggplot2
Continue to practice data transformation with dplyr
Integrate 1), 2), and 3) to explore the whales
dataset* and the babynames
dataset
* Borrowed from Iain Carmichael’s STOR 390 course.
dplyr
, the data visualization tools in ggplot2
, and the data tidying tools in tidyr
to explore the patterns and trends in the whales
dataset and the babynames
dataset.github_document
, save it in your lab
folder as lab6.Rmd
, and work in this RMarkdown file for the rest of this lab.
Tidy up the messy whales
dataset in breakout rooms.
Instructions:
# Load required packages
library(tidyverse)
library(knitr)
# Read in the data
whales <- read_csv("https://raw.githubusercontent.com/nt246/NTRES-6100-data-science/main/datasets/whales.csv")
whales %>% head() %>% kable()
observer | blue | humpback | southern_right | sei | fin | killer_whale | bowhead | grey |
---|---|---|---|---|---|---|---|---|
1 | 1/20/15, death, , Indian | NA | NA | 8/9/11, injury, , indian | NA | NA | NA | NA |
2 | NA | 8/12/15, death, 50, atlantic | NA | NA | 8/2/13, death, 76, arctic | NA | 6/24/13, injury, 30, artic | NA |
3 | NA | NA | 7/14/13, injury, 47, pacific | NA | NA | NA | NA | NA |
4 | NA | 3/4/12, death, 56, pacific | NA | NA | NA | NA | NA | 5/24/16, death, , pacific |
5 | NA | NA | NA | 6/14/12, injury, 52, indian | NA | NA | NA | NA |
6 | 5/2/16, , 80, pacific | NA | NA | NA | NA | NA | NA | NA |
whales
dataset is a classic example of messy datasets. It was collected as follows: observers are asked for certain information about specific indicents they witnessed of ships striking whales and that information is compiled by whale type. The observers were asked to provide: type of whale, date of event (m/d/yr), outcome of event, approximate length of whale in feet, ocean in which event occurred.View()
, dim()
, colnames()
, and ?
.
whales_long
.whales_long <- whales %>%
pivot_longer(-1, names_to = "species", values_to = "info")
whales_long %>% head() %>% kable()
observer | species | info |
---|---|---|
1 | blue | 1/20/15, death, , Indian |
1 | humpback | NA |
1 | southern_right | NA |
1 | sei | 8/9/11, injury, , indian |
1 | fin | NA |
1 | killer_whale | NA |
whales_long
, create another data frame that includes only events for which there is information. Name this data frame whales_clean
.Hint: is.na()
might be helpful.
whales_clean <- whales_long %>%
filter(!is.na(info))
whales_clean %>% head() %>% kable()
observer | species | info |
---|---|---|
1 | blue | 1/20/15, death, , Indian |
1 | sei | 8/9/11, injury, , indian |
2 | humpback | 8/12/15, death, 50, atlantic |
2 | fin | 8/2/13, death, 76, arctic |
2 | bowhead | 6/24/13, injury, 30, artic |
3 | southern_right | 7/14/13, injury, 47, pacific |
whales_clean
, create another data frame with one variable per type of information, one piece of information per cell. Some cells might be empty. Name this data frame whales_split
.Your new data frame should have six variables: observer, species, date, outcome, size, ocean.
whales_split <- whales_clean %>%
separate(info, c("date", "outcome", "size", "ocean"), ",")
whales_split %>% head() %>% kable()
observer | species | date | outcome | size | ocean |
---|---|---|---|---|---|
1 | blue | 1/20/15 | death | Indian | |
1 | sei | 8/9/11 | injury | indian | |
2 | humpback | 8/12/15 | death | 50 | atlantic |
2 | fin | 8/2/13 | death | 76 | arctic |
2 | bowhead | 6/24/13 | injury | 30 | artic |
3 | southern_right | 7/14/13 | injury | 47 | pacific |
whales_split
, create another data frame in which all columns are parsed as instructed below. Name this data frame whales_parsed
.The columns should parsed to the following types
* observer
: double
* species
: character
* date
: date
* outcome
: character
* size
: integer
* ocean
: character
whales_parsed <- whales_split %>%
type_convert(
col_types = cols(
date = col_date(format = "%m/%d/%y"),
size = col_integer()
)
)
whales_parsed %>% head()
## # A tibble: 6 x 6
## observer species date outcome size ocean
## <dbl> <chr> <date> <chr> <int> <chr>
## 1 1 blue 2015-01-20 death NA Indian
## 2 1 sei 2011-08-09 injury NA indian
## 3 2 humpback 2015-08-12 death 50 atlantic
## 4 2 fin 2013-08-02 death 76 arctic
## 5 2 bowhead 2013-06-24 injury 30 artic
## 6 3 southern_right 2013-07-14 injury 47 pacific
whales_parsed
, print a summary table with: 1) number ship strikes by species, 2) average whale size by species, omitting NA values in the calculation.whales_parsed %>%
group_by(species) %>%
summarise(number_of_ship_strikes = n(), average_size = mean(size, na.rm = T)) %>%
kable()
species | number_of_ship_strikes | average_size |
---|---|---|
blue | 5 | 67.50000 |
bowhead | 5 | 43.75000 |
fin | 4 | 78.50000 |
grey | 7 | 36.83333 |
humpback | 7 | 44.33333 |
killer_whale | 2 | 15.00000 |
sei | 5 | 54.75000 |
southern_right | 7 | 47.00000 |
whales_parsed
as possible in one plot.What are some challenges in this?
whales_parsed %>%
mutate(ocean = ifelse(ocean == "artic", "arctic", ocean)) %>%
ggplot(aes(x=date, y = size, color=outcome)) +
geom_point() +
facet_grid(ocean~species)
## Warning: Removed 8 rows containing missing values (geom_point).
You can continue to work on Exercise 2 if you have finished before the break.
Share your findings, challenges, and questions with the class.
Use data tidying, transformation, and visualization to answer the following questions about baby names in breakout rooms
top boy names | top girl names |
---|---|
Instructions:
# Load required packages
library(babynames) # install.packages("babynames")
babynames %>% head() %>% kable()
year | sex | name | n | prop |
---|---|---|---|---|
1880 | F | Mary | 7065 | 0.0723836 |
1880 | F | Anna | 2604 | 0.0266790 |
1880 | F | Emma | 2003 | 0.0205215 |
1880 | F | Elizabeth | 1939 | 0.0198658 |
1880 | F | Minnie | 1746 | 0.0178884 |
1880 | F | Margaret | 1578 | 0.0161672 |
babynames
dataset provides the number of children of each sex given each name from 1880 to 2017 in the US. All names with more than 5 uses are included. This dataset is provided by the US Social Security Administration.View()
, dim()
, colnames()
, and ?
.
Hint: You can start by finding the 6 most popular names for each sex separately.
# number of passengers in the dataset
top_6_boy_names <- babynames %>%
filter(sex == "M") %>%
group_by(name) %>%
summarise(total_count=sum(n)) %>%
slice_max(order_by = total_count, n = 6)
top_6_girl_names <- babynames %>%
filter(sex == "F") %>%
group_by(name) %>%
summarise(total_count=sum(n)) %>%
slice_max(order_by = total_count, n = 6)
babynames %>%
filter(
(name %in% top_6_boy_names$name & sex == "M") | (name %in% top_6_girl_names$name & sex == "F")
) %>%
ggplot(aes(x=year, y=prop, group=name, color=sex)) +
geom_line() +
facet_wrap(~name)
Note:
slice_max(order_by = total_count, n = 6)
select 6 rows with the highest values in total_count; in this instance, using arrange()
and head()
is equivalent
There will be a more efficient solution after you’ve learned relational data.
Hint: You can create a new variable called decade
. The floor()
function may be helpful in this step.
Hint: To get the most popular names, group_by()
in combination with slice_max()
can be very efficient.
set.seed(42)
babynames %>%
mutate(decade = floor(year/10)*10) %>%
group_by(sex, decade, name) %>%
summarise(total_count = sum(n)) %>%
group_by(sex, decade) %>%
slice_max(order_by = total_count, n=1) %>%
ggplot(aes(x=decade, y=total_count, color=sex)) +
geom_line(size = 1.5) +
geom_point(size = 3)+
ggrepel::geom_label_repel(aes(label=name)) +
cowplot::theme_cowplot()
Note: In this case, slice_max()
cannot be replaced by arrange()
and head()
, because the latter does not work well with group_by()
babynames
dataset.Suggested activities:
Polish your plots in Exercise 2. Try to put more thought into editing the aesthetics of your figures and tables to make them easier to understand and nicer to look at (e.g. choose the most appropriate geometric object, aesthetic mapping, facetting, position adjustment; add meaningful axis labels, figure titles, legend titles; change the background; be creative; etc.).
Read the example code that we provided in Exercise 2. Make sure that you understand each line, and try to reproduce the output/computations on your own.
Think of other interesting questions you can answer with this dataset and explore different strategies for getting your answer.
Share your findings, challenges, and questions with the class.
END LAB 4