install.packages('tidyverse')
Installing R & RStudio EARN Talk
Welcome to the Preparing for Analysis with R landing page!
This training and code workflow was originally delivered as an EARN Talk on July 29th, 2025.
Missed this talk? See the recording here: Preparing for Analysis with R: A guided tutorial for installing R and RStudio.
Passcode: Prep4R2025!
Workflow example
This example demonstrates a simple data analysis workflow.
In short, we will use the Tidyverse package to load a dataset from a .csv file, describe and analyze the dataset, and export a final analysis to a new .csv file.
Load a library
Recall, you can install R packages using the install.packages()
command.
After installing a package, you can load it using the library()
command.
# Loading a library
library(tidyverse)
Read data from a csv file
Download the dataset at this link: counties_per_capita_income.csv
Place the dataset in your root directory In this case, our data lives in a folder named “data” within this root directory. You can name your folder whatever you’d like!
Use the
read.csv()
function to load the data into R.
<- read.csv("data/counties_per_capita_income.csv") counties_income
Descriptive analysis
Some useful commands to describe your dataset
# Print the number of rows and columns in your table
dim(counties_income)
[1] 3231 8
# View the top 5 rows of your dataset
head(counties_income)
county states statefips pci household_income family_income
1 New York County New York 36 76592 69659 86553
2 Arlington Virginia 51 62018 103208 139244
3 Falls Church City Virginia 51 59088 120000 152857
4 Marin California 6 56791 90839 117357
5 Santa Clara California 6 56248 124055 124055
6 Alexandria City Virginia 51 54608 85706 107511
population num_of_households
1 1628706 759460
2 214861 94454
3 12731 5020
4 254643 102912
5 1927852 640215
6 143684 65369
# List the variable names of your dataset
names(counties_income)
[1] "county" "states" "statefips"
[4] "pci" "household_income" "family_income"
[7] "population" "num_of_households"
# View a transposed table of your data
glimpse(counties_income)
Rows: 3,231
Columns: 8
$ county <chr> "New York County", "Arlington", "Falls Church City",…
$ states <chr> "New York", "Virginia", "Virginia", "California", "C…
$ statefips <chr> "36", "51", "51", "6", "6", "51", "8", "35", "51", "…
$ pci <int> 76592, 62018, 59088, 56791, 56248, 54608, 51814, 510…
$ household_income <int> 69659, 103208, 120000, 90839, 124055, 85706, 72745, …
$ family_income <int> 86553, 139244, 152857, 117357, 124055, 107511, 93981…
$ population <int> 1628706, 214861, 12731, 254643, 1927852, 143684, 171…
$ num_of_households <int> 759460, 94454, 5020, 102912, 640215, 65369, 7507, 75…
# Generate summaries for each of your variables
summary(counties_income)
county states statefips pci
Length:3231 Length:3231 Length:3231 Min. : 5441
Class :character Class :character Class :character 1st Qu.:19674
Mode :character Mode :character Mode :character Median :22782
Mean :23268
3rd Qu.:26136
Max. :76592
household_income family_income population num_of_households
Min. : 11680 Min. : 13582 Min. : 17 Min. : 6
1st Qu.: 37622 1st Qu.: 46942 1st Qu.: 11232 1st Qu.: 4302
Median : 43853 Median : 54461 Median : 25975 Median : 9792
Mean : 45220 Mean : 55752 Mean : 97552 Mean : 36130
3rd Qu.: 50854 3rd Qu.: 62850 3rd Qu.: 65806 3rd Qu.: 25014
Max. :124055 Max. :152857 Max. :9893481 Max. :3230383
# Generate a summary of one variable in your datset using the $ operator
summary(counties_income$household_income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
11680 37622 43853 45220 50854 124055
Simple summary statistics
Now that we’ve loaded our dataset, let’s run a simple analysis.
The following code block uses the summarize()
function to “collapse” variables from the counties_income
dataset into a smaller table with summary statistics. Specifically, we calculate the median household income and a count of counties for each state.
In this example
median(household_income)
calculates the median income.n()
is a function that counts the number of observations in our sample.
summarize(counties_income,
med_hhinc = median(household_income),
n=n())
med_hhinc n
1 43853 3231
We can also create summary statistics by specific groupings. For example, if we want to calculate the median household income for each state, we can use the .by=
argument inside summarize()
summarize(counties_income,
med_hhinc = median(household_income),
n = n(),
.by = states)
states med_hhinc n
1 New York 50467.5 62
2 Virginia 47549.0 134
3 California 53532.0 58
4 Colorado 46793.5 64
5 New Mexico 37933.0 33
6 New Jersey 70912.0 21
7 Texas 44150.5 254
8 Maryland 66587.5 24
9 Connecticut 68960.5 8
10 Massachusetts 63225.0 14
11 Alaska 62519.0 29
12 Hawaii 62052.0 5
13 Wyoming 55347.0 23
14 Utah 49506.0 29
15 Pennsylvania 46044.0 67
16 Wisconsin 48709.5 72
17 Tennessee 37618.0 95
18 Rhode Island 71238.0 5
19 Ohio 44654.5 88
20 Washington 47195.0 39
21 Indiana 46746.0 92
22 Kansas 45830.0 105
23 North Dakota 50274.0 53
24 Illinois 47385.0 102
25 New Hampshire 56904.5 10
26 Minnesota 50030.0 87
27 Florida 43413.0 67
28 Georgia 37487.0 159
29 Michigan 41759.0 83
30 South Dakota 47043.5 66
31 Iowa 48601.0 99
32 Idaho 42072.0 44
33 Missouri 39519.0 115
34 North Carolina 41030.5 100
35 Nevada 52101.0 17
36 Kentucky 38776.0 120
37 Alabama 36447.0 67
38 Vermont 52470.0 14
39 Mississippi 33562.0 82
40 Montana 43368.0 56
41 Maine 44327.5 16
42 Oregon 43524.5 36
43 South Carolina 39271.0 46
44 Delaware 55149.0 3
45 Nebraska 45610.0 93
46 Louisiana 40792.0 64
47 West Virginia 37895.0 55
48 Oklahoma 42751.0 77
49 Arkansas 35153.0 75
50 Arizona 42987.0 15
51 U.S. Virgin Islands 38232.0 3
52 Puerto Rico 17434.0 78
53 Guam 48274.0 1
54 Northern Mariana Islands 23125.0 3
55 American Samoa 24027.5 4
This tells R to calculate the summary statistics separately for each unique value in the states
column
Creating new variables with mutate
The mutate()
function adds new variables to a dataset without collapsing it, unlike summarize()
which reduces the dataset to summary rows.
In the example below, we use mutate()
to create a new column called rank
, which ranks counties from highest to lowest based on their per capita income (pci).
rank(-pci) ranks the values of pci in descending order (the minus sign indicates descending).
The result is a new dataset
ranked_income
where each county keeps its original data, now with an added rank column.
<- mutate(counties_income, rank = rank(-pci))
ranked_income
# Print the top 5 rows of our new dataframe/table
head(ranked_income)
county states statefips pci household_income family_income
1 New York County New York 36 76592 69659 86553
2 Arlington Virginia 51 62018 103208 139244
3 Falls Church City Virginia 51 59088 120000 152857
4 Marin California 6 56791 90839 117357
5 Santa Clara California 6 56248 124055 124055
6 Alexandria City Virginia 51 54608 85706 107511
population num_of_households rank
1 1628706 759460 1
2 214861 94454 2
3 12731 5020 3
4 254643 102912 4
5 1927852 640215 5
6 143684 65369 6
How to look up a function’s arguments
Not sure what a function does or what arguments it accepts? You can use the help command in R to look it up!
Type a question mark (?) followed by the function name in your console or code chunk:
?mutate
?summarize
# alternatively, you can type
help(summarize)
This opens the help file for the function, which includes a short description of what the function does, a list of arguments you can use, and examples showing how to apply it. This is a great habit to adopt when trying new functions!
Using pipes to create multi-line commands
In R, the pipe operator |>
allows you to chain together multiple steps of a data transformation in a readable, top-to-bottom format. Each step passes its result to the next command. This makes your code easier to read and avoids creating lots of intermediate objects.
Here’s an example using the |> pipe to filter and summarize data across multiple lines:
# Executing a multi-line command using pipes
<- counties_income |>
cleaned_data # Here we use the filter() command to restrict our sample to a few states
filter(states %in% c('New York', 'California', 'North Carolina' )) |>
# Use summarise (Note, summarize() and summarise() are interchangeable. British spelling fans, rejoice!)
summarise(mean_hhinc = mean(household_income),
n = n(),
.by = states)
What this code does:
Starts with the
counties_income
datasetFilters it to only include counties in New York, California, and North Carolina
Summarizes the data by state, calculating:
The mean household income
The number of counties
Exporting your results
Once you’ve created a cleaned or summarized dataset, you may want to save it to a file. The write.csv()
function lets you export your data to a .csv
file that you can open in Excel or share with others.
# Writing a csv to a path
write.csv(cleaned_data, "output/state_incomes.csv")
This line:
Writes the
cleaned_data
dataset to a file calledstate_incomes.csv
Saves it in the
output/
folder within your project directory
Note: Make sure the "output"
folder already exists in your project. R will return an error if the folder doesn’t exist.
Additional resources
As you embark on your R journey, check out the Other Resources page for additional materials!
Happy coding!