Day 1: Introduction to “Reproducible Data Analysis”

Instructor: Joel Nitta

Image of Joel Nitta in field

Instructor: Joel Nitta

  • Born and raised in California

  • Fourth generation Japanese-American

  • First came to Japan as high school exchange student

Image of California map

Ice-breaker

  • Answer the question: “Why are you interested in data analysis?”

  • Introduce yourself and discuss with your neighbor

Data analysis image

https://www.odelama.com/data-analysis/

What is data analysis?

  • Obtaining insight from data

  • Important for many careers (academic and industry)

Employment of data scientists is projected to grow 35% from 2022 to 2032, much faster than the average for all occupations.

Why programming?

Who has used Excel? Who has used a programming language?

What are the advantages and disadvantages of each for data analysis?

  • Discuss with your neighbor

Why programming?

  • Programming allows you to record every step of data analysis
    • This means you can repeat your analysis!

It takes some time to get used to, but eventually you will feel more comfortable with it because you can re-trace your steps and have confidence in your results.

Why reproducibility?

When might you want to repeat an analysis? Why?

  • Discuss with your neighbor

Why reproducibility?

  • If new data comes in and you need to update an analysis

  • If you want to double-check your own results

  • If you want to repeat somebody else’s analysis

  • If you switch between different projects and can’t remember exactly what you were doing

Goals of this class

The goal of this class is to learn the fundamentals of reproducible data analysis by doing it yourself.

By the end of the course, you will be able to:

  • load, clean, and visualize data using R
  • track changes to code using Git and GitHub
  • write a reproducible report using Quarto

Expectations of this class

  • I expect you to participate in discussions

  • I expect you to ask questions

Language of this class

  • This class is conducted in English

  • But, you can ask questions in Japanese and I will explain in Japanese if needed

Homework assignments

  • There will be a homework assignment on GitHub for each class, starting next week.

  • You submit the assignment by making a commit in Git (more about this on Day 2)

Final project

  • You will need to analyze a dataset of your own choosing for your final project, due 2025-07-30 11:59 PM

  • The last homework assignment is due 2025-07-16 11:59 PM, so you have 2 weeks to work on the final project

Schedule

  • Day 1 (2025-06-12): Introduction: Why code? Why reproducibility?

  • Day 2 (2025-06-19): Git and GitHub

  • Day 3 (2025-06-26): Basic usage of R and RStudio

  • Day 4 (2025-07-03): Data loading and tidying with tidyverse

Schedule (cont’d)

  • Day 5 (Media Day): Joining data

  • Day 6 (2025-07-10): Data visualization with ggplot2

  • Day 7 (2025-07-17): Writing documents with Quarto

  • Day 8 (2025-07-24): Quarto, part II

Grades

  • Homework (4 assignments) 50%
  • Final report 50%

No late submissions allowed

Only the top three homework assignments will be used for grading (so you get one free exception if you forget to turn it in etc.)

Course website and slides

Moodle

  • Assignments (GitHub classroom repos) will be posted on Moodle

  • Check Moodle every week

Office hours

By appointment: contact me at

Questions?

AI

  • Who has used AI (for example, ChatGPT) before?

  • You may use AI for your homework and final project

  • But first you need to know how to use it

AI

  • AI makes statistical predictions about words based on training data (it does not “think”)

  • AI is designed to produce sentences that sound as natural as possible

  • AI may lie to you or make up facts (called “hallucination”; this is especially common when it lacks adequate training data)

AI policies (DOs)

  • Do try by yourself first (without AI)

  • Do ask it detailed, specific questions (prompts)

  • Do double-check the results: does the AI’s code produce the expected result?

  • Do make sure you understand the code that the AI provides

AI policies (DON’Ts)

  • Don’t copy-paste directly from AI for your report.
    • Typing the code yourself will help you remember it and understand what you are doing. Copy-pasting text for a paper is plagiarism.
  • Don’t submit an answer from AI without trying/checking it yourself first
    • The AI could very well be wrong!

Setting up RStudio

Setting up Git

We will follow instructions for Day 2 to set up Git