Welcome

These slides available at: https://arcus.github.io/demystifying_r_rstudio_skills_series/session_1.html

  • Use keyboard arrow keys to
    • advance ( → ) and
    • go back ( ← )
  • Type “s” to see speaker notes
  • Type “?” to see other keyboard shortcuts

About Arcus / Your Presenter

Arcus is an initiative by the Research Institute aimed at promoting data discovery and reuse and increasing research reproducibility.

Among the many teams in Arcus, I represent Arcus Education!

Arcus Education

Arcus education provides data science training to researchers …

(and often this is useful to non-researchers too!).

https://arcus.chop.edu/i-want-to/arcus-education

Email us!

Demystifying R and RStudio

Arcus Education provides “Skills Series” for the entire CHOP community.

This Skills Series is a short, 2-session series aimed at Demystifying R and RStudio!

  • Session 1: Introduction to R/RStudio
  • Session 2: Introduction to Literate Statistical Programming

Session 1 Itinerary

Introduction to R/RStudio

  • R is a programming language created for statistical data analysis
  • Why scripts? Reproducibility and open source data science
  • RStudio is one way to work with R
  • Considerations for working with R and RStudio at CHOP
  • Posit.Cloud

Goals:

  • Be able to describe the difference between R and RStudio
  • Be able to give one advantage for using scripts written in R for data analysis
  • Have a concrete next step for obtaining access to R and RStudio at CHOP

R is a Programming Language

R is a programming language. This is what it looks like:

# Ingest data from REDCap
arcus_101_feedback_token <- readr::read_file("secrets/quick_arcus_101_feedback_token.txt")
arcus_101_feedback <- get_data(arcus_101_feedback_token)

# Get raw data and add the labels back in the correct order, show change over time
arcus_101_feedback_updated <- arcus_101_feedback %>%
  
  # We don't need the "completeness" value
  select(-arcus_101_effectiveness_complete) %>%
  
  # Transform all the "knowledge" questions
  mutate(across(starts_with("knowledge"),
                ~ factor(.x, levels = c("Very little knowledge", 
                                      "Some knowledge", 
                                      "Lots of knowledge", 
                                      "Expert"))),
         
         # Transform all the "opinion" questions (pre)
         opinion_pre = factor(opinion_pre, 
                              levels = c("Largely negative, I didn't think Arcus was useful or helpful to CHOP.", 
                                         "Somewhat negative, I had doubts about how useful or helpful Arcus was to CHOP.", 
                                         "Neutral, I didn't have a strong opinion.",
                                         "Somewhat positive, I believed that Arcus was useful or helpful to CHOP.", 
                                         "Largely positive, I was certain that Arcus was useful or helpful to CHOP.")),
         
         # Transform all the "opinion" questions (post)
         opinion_post = factor(opinion_post, 
                              levels = c("Largely negative, I don't think Arcus is useful or helpful to CHOP.", 
                                         "Somewhat negative, I have doubts about how useful or helpful Arcus is to CHOP.", 
                                         "Neutral, I don't have a strong opinion.",
                                         "Somewhat positive, I believe that Arcus is useful or helpful to CHOP.", 
                                         "Largely positive, I am certain that Arcus is useful or helpful to CHOP.")),
         
         # Measure change (pre to post)
         
         knowledge_change = as.numeric(knowledge_post)-as.numeric(knowledge_pre),
         opinion_change = as.numeric(opinion_post)-as.numeric(opinion_pre),
         )

# Make a bar chart showing pre-intervention knowledge

ggplot(arcus_101_feedback_updated) +
         geom_bar(aes(x=knowledge_pre)) +
  scale_x_discrete(drop=FALSE) +
  labs(title = "Knowledge of Arcus Before 101") +
  xlab("")

# Save this graph for later

ggsave("figures/pre_101_knowledge.png")

R is a Programming Language

R is a statistical programming language.

  • Like other programming languages (Javascript, Python, C++):

    • R has specific syntax rules
    • R gives error messages that you might have to search online for
    • R has online communities that can help you learn (Stack Overflow, etc.)
  • Unlike other programming languages:

    • R was written specifically for statistical data analysis

Why Does This Matter?

Which is a better tool?

  • A multi-tool (like a Swiss Army knife)
  • A mostly mono-task tool (like a cherry pitter)

It depends! R is more focused / narrow… which can be good for beginners.

“Stainless 2CR Multi-tool”, Santeri Viinamäki, CC BY-SA 4.0, via Wikimedia Commons

Why Not Just Use Excel?

“Why even write code? Point and click is so much easier!”

These can be useful:

  • Excel
  • Point and click statistical analysis software (e.g. SPSS, SAS)

But they can also be:

  • Very manual / lots of steps you have to explain
  • Costly

One potential answer? Scripts!

Used with permission by Ed Himelblau. See his work or subscribe to his newsletter at https://www.himelblau.com/

Scripts

In data analysis, scripts are a series of computer code instructions that handle things like:

  • Ingesting data
  • Preparing data
  • Doing descriptive statistics
  • Conducting statistical tests
  • Creating models
  • Saving interim datasets
  • Creating data visualizations
  • Communicating information

Why Scripts?

In science, we’ve been hearing a lot about the “reproducibility crisis”.

It’s hard to re-do other people’s analyses… both for checking their work and for trying it in a new situation. This is bad for science!

One of the most important reasons to learn R is to improve the reproducibility of your work. One of the most powerful aspects of working in the R environment is that it makes it straightforward to produce reproducible data analyses, which will reduce risk and make “future you” much happier.

Used with permission by Ed Himelblau. See his work or subscribe to his newsletter at https://www.himelblau.com/

R Vs RStudio

R Programming language for data analysis

RStudio Integrated development environment (IDE)

Using R Alone vs With RStudio

The R App

RStudio

RStudio: Runs Lots of Places

Posit.cloud

Hosted by Posit (in the cloud)

Posit Workbench

Hosted by a company, on prem or in the cloud

RStudio Desktop

Installed on your computer

Working with R at CHOP

  • We work with regulated data
  • IRB protocols and other regulations might override what I say here!
  • You can work with R and RStudio on a CHOP device
    • You will probably have to request an install via a service ticket
    • You’ll need a cost center / grant / project number (even though there’s no cost)
    • Yes, this software has been used at CHOP before
    • You’ll need to give a reason (“I need to analyze data for my job…”)
    • You’ll need to provide the MAC address of your computer

What To Get Installed

What I recommend you install / get installed on your own CHOP device:

  • R – the language we use to clean, analyze, and visualize data
  • RStudio Desktop – an IDE for writing R
  • Git – version control software that will allow you to easily get the latest version of our course materials and will also be helpful for tracking changes in your own projects
  • GitHub Desktop – a helper, or “client” software that makes working with Git easier

Researchers ONLY at CHOP

(You’ll need a research cost center to refer to for most of these)

Posit.Cloud (be careful!)

You can use R and RStudio using online services like https://posit.cloud.

Posit.cloud is a great place for learning or practice with public datasets, BUT is not a safe or compliant place to put your regulated data.

Q&A / Was This Effective?

As you can tell from our data analysis, we like to measure our effectiveness.

Goals:

  • Be able to describe the difference between R and RStudio
  • Be able to give one advantage for using scripts written in R for data analysis
  • Have a concrete next step for obtaining access to R and RStudio at CHOP

Homework

If you want, totally optional additional learning:

Module Description Duration
Learning to Learn Data Science Discover how learning data science is different than learning other subjects. 20 mins
Reproducibility, Generalizability, and Reuse This module provides learners with an approachable introduction to the concepts and impact of research reproducibility, generalizability, and data reuse, and how technical approaches can help make these goals more attainable. 60 min

Acknowledgements

  • R User Group leadership, especially Stephan Kadauke
  • Former learners at CHOP, Penn, Drexel, University of Botswana
  • DART study participants and pilots around the world
  • Ed Himelblau
  • You!

Next Session

Literate Statistical Programming

Friday, March 7, 2025 at 12 pm sign up link

Tuesday, March 11, 2025 at 12 pm sign up link

  • Review of R and RStudio
  • Literate programming is a programming paradigm
  • Research reproducibility reminders
  • Quarto documents
  • Next steps