Installing R & RStudio EARN Talk

Author

Emma Cohn & Daniel Perez

Welcome to the Preparing for Analysis with R landing page!

This training and code workflow was originally delivered as an EARN Talk on July 29th, 2025.

Missed this talk? See the recording here: Preparing for Analysis with R: A guided tutorial for installing R and RStudio.

Passcode: Prep4R2025!

Presentation slides

Workflow example

This example demonstrates a simple data analysis workflow.

In short, we will use the Tidyverse package to load a dataset from a .csv file, describe and analyze the dataset, and export a final analysis to a new .csv file.

Load a library

Recall, you can install R packages using the install.packages() command.

install.packages('tidyverse')

After installing a package, you can load it using the library() command.

# Loading a library 
library(tidyverse)

Read data from a csv file

Download the dataset at this link: counties_per_capita_income.csv
Place the dataset in your root directory In this case, our data lives in a folder named “data” within this root directory. You can name your folder whatever you’d like!
Use the read.csv() function to load the data into R.

counties_income <- read.csv("data/counties_per_capita_income.csv")

Descriptive analysis

Some useful commands to describe your dataset

# Print the number of rows and columns in your table 
dim(counties_income)

[1] 3231    8

# View the top 5 rows of your dataset
head(counties_income)

             county     states statefips   pci household_income family_income
1   New York County   New York        36 76592            69659         86553
2         Arlington   Virginia        51 62018           103208        139244
3 Falls Church City   Virginia        51 59088           120000        152857
4             Marin California         6 56791            90839        117357
5       Santa Clara California         6 56248           124055        124055
6   Alexandria City   Virginia        51 54608            85706        107511
  population num_of_households
1    1628706            759460
2     214861             94454
3      12731              5020
4     254643            102912
5    1927852            640215
6     143684             65369

# List the variable names of your dataset
names(counties_income)

[1] "county"            "states"            "statefips"        
[4] "pci"               "household_income"  "family_income"    
[7] "population"        "num_of_households"

# View a transposed table of your data
glimpse(counties_income)

Rows: 3,231
Columns: 8
$ county            <chr> "New York County", "Arlington", "Falls Church City",…
$ states            <chr> "New York", "Virginia", "Virginia", "California", "C…
$ statefips         <chr> "36", "51", "51", "6", "6", "51", "8", "35", "51", "…
$ pci               <int> 76592, 62018, 59088, 56791, 56248, 54608, 51814, 510…
$ household_income  <int> 69659, 103208, 120000, 90839, 124055, 85706, 72745, …
$ family_income     <int> 86553, 139244, 152857, 117357, 124055, 107511, 93981…
$ population        <int> 1628706, 214861, 12731, 254643, 1927852, 143684, 171…
$ num_of_households <int> 759460, 94454, 5020, 102912, 640215, 65369, 7507, 75…

# Generate summaries for each of your variables
summary(counties_income)

    county             states           statefips              pci       
 Length:3231        Length:3231        Length:3231        Min.   : 5441  
 Class :character   Class :character   Class :character   1st Qu.:19674  
 Mode  :character   Mode  :character   Mode  :character   Median :22782  
                                                          Mean   :23268  
                                                          3rd Qu.:26136  
                                                          Max.   :76592  
 household_income family_income      population      num_of_households
 Min.   : 11680   Min.   : 13582   Min.   :     17   Min.   :      6  
 1st Qu.: 37622   1st Qu.: 46942   1st Qu.:  11232   1st Qu.:   4302  
 Median : 43853   Median : 54461   Median :  25975   Median :   9792  
 Mean   : 45220   Mean   : 55752   Mean   :  97552   Mean   :  36130  
 3rd Qu.: 50854   3rd Qu.: 62850   3rd Qu.:  65806   3rd Qu.:  25014  
 Max.   :124055   Max.   :152857   Max.   :9893481   Max.   :3230383

# Generate a summary of one variable in your datset using the $ operator
summary(counties_income$household_income)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  11680   37622   43853   45220   50854  124055

Simple summary statistics

Now that we’ve loaded our dataset, let’s run a simple analysis.

The following code block uses the summarize() function to “collapse” variables from the counties_income dataset into a smaller table with summary statistics. Specifically, we calculate the median household income and a count of counties for each state.

In this example

median(household_income) calculates the median income.
n() is a function that counts the number of observations in our sample.

summarize(counties_income,
            med_hhinc = median(household_income),
            n=n())

  med_hhinc    n
1     43853 3231

We can also create summary statistics by specific groupings. For example, if we want to calculate the median household income for each state, we can use the .by= argument inside summarize()

summarize(counties_income, 
          med_hhinc = median(household_income), 
          n = n(), 
          .by = states)

                     states med_hhinc   n
1                  New York   50467.5  62
2                  Virginia   47549.0 134
3                California   53532.0  58
4                  Colorado   46793.5  64
5                New Mexico   37933.0  33
6                New Jersey   70912.0  21
7                     Texas   44150.5 254
8                  Maryland   66587.5  24
9               Connecticut   68960.5   8
10            Massachusetts   63225.0  14
11                   Alaska   62519.0  29
12                   Hawaii   62052.0   5
13                  Wyoming   55347.0  23
14                     Utah   49506.0  29
15             Pennsylvania   46044.0  67
16                Wisconsin   48709.5  72
17                Tennessee   37618.0  95
18             Rhode Island   71238.0   5
19                     Ohio   44654.5  88
20               Washington   47195.0  39
21                  Indiana   46746.0  92
22                   Kansas   45830.0 105
23             North Dakota   50274.0  53
24                 Illinois   47385.0 102
25            New Hampshire   56904.5  10
26                Minnesota   50030.0  87
27                  Florida   43413.0  67
28                  Georgia   37487.0 159
29                 Michigan   41759.0  83
30             South Dakota   47043.5  66
31                     Iowa   48601.0  99
32                    Idaho   42072.0  44
33                 Missouri   39519.0 115
34           North Carolina   41030.5 100
35                   Nevada   52101.0  17
36                 Kentucky   38776.0 120
37                  Alabama   36447.0  67
38                  Vermont   52470.0  14
39              Mississippi   33562.0  82
40                  Montana   43368.0  56
41                    Maine   44327.5  16
42                   Oregon   43524.5  36
43           South Carolina   39271.0  46
44                 Delaware   55149.0   3
45                 Nebraska   45610.0  93
46                Louisiana   40792.0  64
47            West Virginia   37895.0  55
48                 Oklahoma   42751.0  77
49                 Arkansas   35153.0  75
50                  Arizona   42987.0  15
51      U.S. Virgin Islands   38232.0   3
52              Puerto Rico   17434.0  78
53                     Guam   48274.0   1
54 Northern Mariana Islands   23125.0   3
55           American Samoa   24027.5   4

This tells R to calculate the summary statistics separately for each unique value in the states column

Creating new variables with mutate

The mutate() function adds new variables to a dataset without collapsing it, unlike summarize() which reduces the dataset to summary rows.

In the example below, we use mutate() to create a new column called rank, which ranks counties from highest to lowest based on their per capita income (pci).

rank(-pci) ranks the values of pci in descending order (the minus sign indicates descending).
The result is a new dataset ranked_income where each county keeps its original data, now with an added rank column.

ranked_income <- mutate(counties_income, rank = rank(-pci))

# Print the top 5 rows of our new dataframe/table
head(ranked_income)

             county     states statefips   pci household_income family_income
1   New York County   New York        36 76592            69659         86553
2         Arlington   Virginia        51 62018           103208        139244
3 Falls Church City   Virginia        51 59088           120000        152857
4             Marin California         6 56791            90839        117357
5       Santa Clara California         6 56248           124055        124055
6   Alexandria City   Virginia        51 54608            85706        107511
  population num_of_households rank
1    1628706            759460    1
2     214861             94454    2
3      12731              5020    3
4     254643            102912    4
5    1927852            640215    5
6     143684             65369    6

How to look up a function’s arguments

Not sure what a function does or what arguments it accepts? You can use the help command in R to look it up!

Type a question mark (?) followed by the function name in your console or code chunk:

?mutate

?summarize

# alternatively, you can type
help(summarize)

This opens the help file for the function, which includes a short description of what the function does, a list of arguments you can use, and examples showing how to apply it. This is a great habit to adopt when trying new functions!

Using pipes to create multi-line commands

In R, the pipe operator |> allows you to chain together multiple steps of a data transformation in a readable, top-to-bottom format. Each step passes its result to the next command. This makes your code easier to read and avoids creating lots of intermediate objects.

Here’s an example using the |> pipe to filter and summarize data across multiple lines:

# Executing a multi-line command using pipes 
cleaned_data <- counties_income |>
    # Here we use the filter() command to restrict our sample to a few states
    filter(states %in% c('New York', 'California', 'North Carolina' )) |>
    # Use summarise (Note, summarize() and summarise() are interchangeable. British spelling fans, rejoice!)
    summarise(mean_hhinc = mean(household_income), 
              n = n(), 
              .by = states)

What this code does:

Starts with the counties_income dataset
Filters it to only include counties in New York, California, and North Carolina
Summarizes the data by state, calculating:
1. The mean household income
2. The number of counties

Exporting your results

Once you’ve created a cleaned or summarized dataset, you may want to save it to a file. The write.csv() function lets you export your data to a .csv file that you can open in Excel or share with others.

# Writing a csv to a path
write.csv(cleaned_data, "output/state_incomes.csv")

This line:

Writes the cleaned_data dataset to a file called state_incomes.csv
Saves it in the output/ folder within your project directory

Note: Make sure the "output" folder already exists in your project. R will return an error if the folder doesn’t exist.

Additional resources

As you embark on your R journey, check out the Other Resources page for additional materials!

Happy coding!