Reproducible Research¶

This is a brief introduction on common practices for reproducible research.
For a more detailed and free course please audit this free course by Harvard.

Informative naming variables¶

Make sure that your variable name is easy to understand and reflect directly its content
Avoid general terms like test, test1, ...

This will improve the readability of your code
Bad example

# Analyzing patient data
test <- read.csv("data.csv")
stuff <- test[test$group == "treatment", ]
final <- mean(stuff$bloodpressure)
x <- sd(stuff$bloodpressure)
result2 <- final/x
y <- stuff[stuff$age > 40, ]
z <- mean(y$bloodpressure)

Good example

# Analyzing patient data
patient_data <- read.csv("data.csv")
treatment_group <- patient_data[patient_data$group == "treatment", ]
mean_blood_pressure <- mean(treatment_group$bloodpressure)
blood_pressure_sd <- sd(treatment_group$bloodpressure)
standardized_bp <- mean_blood_pressure/blood_pressure_sd
older_patients <- treatment_group[treatment_group$age > 40, ]
mean_bp_older_patients <- mean(older_patients$bloodpressure)

Overcommenting¶

Most of the time, your code makes perfect sense for you ... when you write it. But visiting it after a year is another story. I cannot stress how much time I wasted figuring out what was my past self thinking. Therefore, always overcomment things!

A tip: Put yourself in the position of someone who will read your code without knowing too much about the project

These comments should:
1. Provides context about the analysis purpose 2. Explains data content and processing decisions 3. Describes what diagnostics are checking and why

Bad example

# Read the CSV file
df <- read.csv("clinical_trial.csv")  # Reading the CSV file

# Remove rows with missing values
df <- df[complete.cases(df), ]  # This removes rows with NA

# Calculate BMI
df$bmi <- df$weight / (df$height/100)^2  # BMI formula

# Create linear model
model <- lm(response ~ treatment + age + bmi + gender, data = df)  # Building model

Good example

# Clinical trial analysis examining treatment effect on blood pressure
# Dataset contains: patient demographics, treatment assignment, and outcomes
# Date: 2023-03-15, Author: J. Smith

# Remove subjects with missing data (10% of original sample)
# Decision made with PI to use complete case analysis instead of imputation
df <- read.csv("clinical_trial.csv")
df <- df[complete.cases(df), ]

# Calculate BMI - using height in cm converting to meters
# Formula: weight(kg) / height(m)^2
df$bmi <- df$weight / (df$height/100)^2

# Primary analysis: Effect of treatment on blood pressure response
# Adjusting for pre-specified covariates: age, BMI, gender
model <- lm(response ~ treatment + age + bmi + gender, data = df)

# Diagnostics for linear model assumptions
# Checking normality of residuals (required for valid p-values)
residuals <- model$residuals
hist(residuals, main="Histogram of Residuals", xlab="Residual Value")
qqnorm(residuals)
qqline(residuals)
shapiro_result <- shapiro.test(residuals)

# If p > 0.05, normality assumption is reasonable
print(paste("Shapiro-Wilk p-value:", shapiro_result$p.value))

Version control¶

If you use Git, make a habit of commiting often
- After each logical change:
  - Complete a function/analysis step
  - Fix a bug
  - Add a new feature
- At the minimum, commit at the end of your work session
- At any given point that you think you may want to return to
Most of your version control can be done via Git but it is always a good idea to add a time stamp to your deliverables like figures, tables.