NTRES 6100
Collaborative and Reproducible Data Science in R
Cornell University, Spring 2021

Course Info

All class activities are online.

Lectures: Mondays and Wednesdays 2:45pm - 4:00pm (Feb 8 - Apr 16, 2021)

Optional lab sessions: Fridays 12:25-2:20pm

Instructor: Assistant Professor Nina Overgaard Therkildsen ()

TA: PhD Student Nicolas Lou ()

Office hours: Nina: by appointment; Nicolas: Tuesday 2:30-3:30 or by appointment

Grading: S/U (2 credits / 3 credits with lab)


Course description

As datasets grow larger and more complex across all areas of science, computational skills are increasingly in high demand. This course introduces a series of practical tools that enable researchers to spend less time wrestling with software or repeating error-prone manual data processing and more time getting research done in efficient and transparent ways that facilitate collaboration and reproducibility. We will work in R/RStudio, primarily with the tidyverse packages and with Git and GitHub integration. Topics covered include 1) tidy data formatting, 2) rearrangement, filtering, exploration, and visualization of complex datasets, 3) basic programming for building and automating custom tools, 4) tracking the history of file changes (version control) with Git and GitHub, 5) strategies for effective collaboration on data processing pipelines, and 6) using R Markdown to combine text, equations, code, tables, and figures into reports, websites, and presentations. The course emphasizes practical skill development and will be structured around hands-on (the keyboard) learning.


Learning outcomes

By the end of this course, students will be able to:

  • Describe strategies for ensuring that their data analysis is reproducible
  • Demonstrate best practices for coding and project-oriented workflows in RStudio
  • Import and clean messy data files using a variety of packages and functions in R
  • Subset, reorganize, and merge diverse datasets in R
  • Effectively explore and visualize patterns in complex datasets with ggplot in R
  • Write simple functions/programs and data analysis pipelines in R
  • Automate repeated analysis tasks in R
  • Track the history of file changes (version control) and collaborate effectively on scripts with others with Git and GitHub
  • Use R Markdown to combine text, equations, code, tables, and figures into reports, websites, and presentations

Prerequisites

A basic working knowledge of R will be helpful, but no prior experience with the tidyverse packages or with Git, GitHub, or R Markdown is assumed. If you have never worked in R before, we recommend working through one or more of the following tutorials before the course:


Course format

The two weekly lectures will introduce new concepts and provide opportunities to practice through hands-on exercises. To participate effectively, you must have completed the assigned readings prior to class. Each Wednesday, we will assign a problem set that applies the concepts covered in class in a new context to reinforce your learning. The problem sets are due the following Wednesday at 10pm. We offer an optional Friday afternoon lab session for more opportunities to practice in groups and with TA support.


Evaluation

It takes practice to acquire and internalize data science skills, and what you get out of this course will be proportional to the effort you put in. Practice as much as you can. To pass, students are expected to attend all lectures (and lab sessions if enrolled), participate actively during class, and submit at least 7 of the 9 problem sets with demonstrated effort to complete all questions. If you are unable to make a lecture or can not meet a problem set deadline, please email the instructor beforehand.


Course materials

All assigned readings are freely available online and will be linked to from the course website. We will draw from a variety of sources, primarily Grolemund and Wickham’s R For Data Science and the STAT545 course developed by Jenny Bryan.

All students will need to bring a laptop to each session. Students who do not have their own laptop can arrange to borrow one from the Mann Library.

Please follow the instructions here to install the software we will need prior to the first class.


Code of conduct

We are dedicated to providing a welcoming and supportive environment for everyone, regardless of background, identity and prior experience level. Everyone in this course will be coming from a different place with different experiences and expectations. We will not tolerate any form of language or behavior used to exclude, intimidate, or cause discomfort. This applies to all course participants (instructor, students, guests). In order to foster a positive and professional learning environment, we encourage the following kinds of behaviors:

  • Use welcoming and inclusive language
  • Be respectful of different viewpoints and experiences
  • Gracefully accept constructive criticism
  • Show courtesy and respect towards others
  • Help each other - you may well learn something or reinforce your own skills in the process

Student accommodations

In compliance with the Cornell University policy and equal access laws, we are available to discuss appropriate academic accommodations that may be required for student with disabilities. Requests for academic accommodations are to be made during the first two weeks of the course, except for unusual circumstances, so arrangements can be made. Students are encouraged to register with Student Disability Services to verify their eligibility for appropriate accommodations.


Tentative schedule

Subject to adjustment

Class# Day Date Topic Assignment due dates
1 Mon 2/8 Intro to the course and R/RStudio
2 Wed 2/10 Markdown and GitHub
Fri 2/12 Lab 1
3 Mon 2/15 The Git workflow (version control)
4 Wed 2/17 Collaborating with GitHub Part 1 Assignment 1
Fri 2/19 Lab 2
5 Mon 2/22 Collaborating with GitHub Part 2
6 Wed 2/24 Plotting with ggplot part 1 Assignment 2
Fri 2/26 Lab 3
7 Mon 3/1 Data wrangling part 1 (dplyr::filter, mutate, select, arrange)
8 Wed 3/3 Data wrangling part 2 (dplyr::summarize, group_by) Assignment 3
Fri 3/5 Lab 4
9 Mon 3/8 Plotting with ggplot part 2 + good coding practices
Wed 3/10 NO CLASS - Cornell Wellness Break
Fri 3/12 Lab 5 Assignment 4
10 Mon 3/15 Tidy data
11 Wed 3/17 Data import, export, and conversion between data types Assignment 5
Fri 3/19 Lab 6
12 Mon 3/22 Good coding practices, debugging strategies, and getting help
13 Wed 3/24 Relational data Assignment 6
Fri 3/26 Lab 7
14 Mon 3/29 Iteration (for loops) and conditional execution part 1
15 Wed 3/31 Iteration (for loops) and conditional execution part 2 Assignment 7
Fri 4/2 Lab 8
16 Mon 4/5 Functions
17 Wed 4/7 Factors in R Assignment 8
Fri 4/9 Lab 9
18 Mon 4/12 Wrapping up and resources for learning more
19 Wed 4/14 Student presentations, wrapping up and looking ahead Assignment 9
Fri 4/16 OPTIONAL: Lab 10 - Bring your own project