Data cleaning
This page guides you through basic data cleaning methods in R, using clear and practical examples. As you follow along, you will see how each technique works and why it is useful for preparing your data for analysis. You will learn how to:
- Select variables from a dataset to work with a subsample of data.
- Create dummy variables from categorical responses.
- Generate and recode categorical variables, including grouping values and adding descriptive labels.
- Filter data based on conditions (similar to Stata’s
keep if).
- Create variables conditionally on other variables’ values.
- Handle missing values by removing rows with
NAs in key fields.
- Averaging variables by groups.
By the end of this tutorial, you will know how to clean raw data. You will do this using R’s dplyr library, which you can load into your R session by running the following command:
Select a subsample of data
Before starting our data cleaning, we select only a subsample of variables from our dataset. We keep only three variables:
vote: whether the respondent voted
agea: age of respondent
eisced: education level (ES-ISCED classification)
and save them into the dataframe df_cleaned, using the function select():
df_cleaned<-select(df, vote, agea, eisced)
Generate a dummy variable
A dummy variable is a binary indicator that takes a value of 1 or 0. Here, we create a dummy variable, vote_variable, which takes the value 1 if the respondent reports voting in the last elections and 0 otherwise. Before we begin, let us check the structure of the variable vote using the tabulate() command that we created in How to create a function in R, and why should you do it?:
tabulate(df_cleaned$vote)
[Output]
Response Frequency Percentage Cumulative
1 Yes 30815 69.42 69.42
2 No 9417 21.22 90.64
3 Not eligible to vote 3687 8.31 98.95
4 Refusal 250 0.56 99.51
5 Don't know 209 0.47 99.98
6 No answer 9 0.02 100.00
7 Total 44387 100.00 NA
We then check the values assigned to each label using the command attr(). Alternatively, one can use the command table(), which provides the value and the number of observations per value.
attr(df_cleaned$vote, "labels")
[Output]
Yes No Not eligible to vote Refusal
1 2 3 NA
Don't know No answer
NA NA
We create a dummy variable, vote_variable, which takes the value 1 if a respondent voted in the last election, and 0 if the respondent did not vote or was not eligible to vote. NA values remain as NA values even in the new dummy variable.
df_cleaned$vote_variable<-ifelse(df_cleaned$vote==1, 1, 0)
tabulate(df_cleaned$vote_variable)
[Output]
Response Frequency Percentage Cumulative
1 0 13104 29.52 29.52
2 1 30815 69.42 98.94
3 NA 468 1.05 99.99
4 Total 44387 99.99 NA
Generate a categorical variable
Another important coding practice is generating a new categorical variable from an existing one. This is useful when you want to group a few categories because they have only a few observations, or when you want to simplify the categorical variable by creating broader categories.
We choose the variable eisced, which gives us the ISCED educational levels of each respondent. As before, we check the structure of the variable using the commands tabulate() and attr().
tabulate(df_cleaned$eisced)
[Output]
Response Frequency Percentage Cumulative
1 ES-ISCED I , less than lower secondary 3861 8.70 8.70
2 ES-ISCED II, lower secondary 7388 16.64 25.34
3 ES-ISCED IIIb, lower tier upper secondary 7163 16.14 41.48
4 ES-ISCED IIIa, upper tier upper secondary 8720 19.65 61.13
5 ES-ISCED IV, advanced vocational, sub-degree 6265 14.11 75.24
6 ES-ISCED V1, lower tertiary education, BA level 4760 10.72 85.96
7 ES-ISCED V2, higher tertiary education, >= MA level 6013 13.55 99.51
8 Other 88 0.20 99.71
9 Refusal 82 0.18 99.89
10 Don't know 39 0.09 99.98
11 No answer 8 0.02 100.00
12 Total 44387 100.00 NA
attr(df_cleaned$eisced, "labels")
[Output]
Not possible to harmonise into ES-ISCED
0
ES-ISCED I , less than lower secondary
1
ES-ISCED II, lower secondary
2
ES-ISCED IIIb, lower tier upper secondary
3
ES-ISCED IIIa, upper tier upper secondary
4
ES-ISCED IV, advanced vocational, sub-degree
5
ES-ISCED V1, lower tertiary education, BA level
6
ES-ISCED V2, higher tertiary education, >= MA level
7
Other
55
Refusal
NA
Don't know
NA
No answer
NA
Notice that while the category “Not possible to harmonise into ES-ISCED” exists, it has 0 observations and therefore does not appear in the output of tabulate() or table(). We can verify the number of observations in each category using the sum() command, escluding the missing values (i.e., na.rm = TRUE):
sum(df_cleaned$eisced == 0, na.rm = TRUE)
[Output]
[1] 0
We simplify our eisced variable by grouping it into three categories: (1) Lower education (ES-ISCED I, ES-ISCED II), (2) Middle education (ES-ISCED IIIb, ES-ISCED IIIa), and (3) Higher education (ES-ISCED IV, ES-ISCED V1, ES-ISCED V2). To do this, we use a nested ifelse function. An ifelse statement is a function that assigns a specific value if a condition is true, and another value if it is false. It is called nested because multiple ifelse functions are combined within one another. Notice that %in% can be read as “belongs to the vector”. Essentially, we can read the first line of code as: “If the value in df_cleaned$eisced is 1 or 2, assign value 1 to df_cleaned$education”. The other lines follow a similar logic.
# Use nested ifelse
df_cleaned$education <- ifelse(df_cleaned$eisced %in% c(1, 2), 1, # If the value in df_cleaned$eisced takes value 1 or 2, assign value 1 to df_cleaned$education
ifelse(df_cleaned$eisced %in% c(3, 4), 2,
ifelse(df_cleaned$eisced %in% c(5, 6, 7), 3, NA)))
tabulate(df_cleaned$education)
[Output]
Response Frequency Percentage Cumulative
1 1 11249 25.34 25.34
2 2 15883 35.78 61.12
3 3 17038 38.39 99.51
4 NA 217 0.49 100.00
5 Total 44387 100.00 NA
We can now relabel the values associated with each response. We choose three simple labels: Education: lower, Education: middle and Education: higher.
df_cleaned <- df_cleaned %>%
mutate(education = case_when(
education == 1 ~ "Education: lower",
education == 2 ~ "Education: middle",
education == 3 ~ "Education: higher",
TRUE ~ NA_character_
))
tabulate(df_cleaned$education)
[Output]
Response Frequency Percentage Cumulative
1 Education: higher 17038 38.39 38.39
2 Education: middle 15883 35.78 74.17
3 Education: lower 11249 25.34 99.51
4 NA 217 0.49 100.00
5 Total 44387 100.00 NA
The above code applies the mutate() function to modify the values in education. Precisely, df_cleaned %>% mutate(education = ...) can be read as Take the dataset df_cleaned and then (%>%) mutate the variable education). Then, we use case_when(). This function modifies a variable (or creates a new variable) based on multiple conditions (e.g., If education is 1, it assigns the label “Education: lower”):
case_when(
condition1 ~ value1,
condition2 ~ value2,
...,
TRUE ~ default_value
)
Finally, TRUE ~ default_value is the equivalent of an else statement. Namely, if none of the above conditions are met, the code assigns NA of type character.
Keep a subset of the original data that meets certain conditions
Another useful function is filter(). This function is the Stata equivalent of keep if. It allows us to keep only observations satisfying specific conditions. For instance, assume we want to focus on the working-age population by keeping only respondents between 15 and 64 years old and excluding pensioners and people who are too young to work. First, let’s see the summary statistics for the variable agea:
[Output]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
15.00 34.00 49.00 49.14 64.00 100.00 155
We can then filter out the values we are not interested in by using our filter() function:
df_cleaned <- filter(df_cleaned, agea>=15 & agea<=64)
summary(df_cleaned$agea)
[Output]
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.00 30.00 42.00 41.42 53.00 64.00
Create a variable conditional on the values from another variable
Sometimes we want to keep only the values of a variable var1 that depend on specific values of another variable var2. We can do this by combining the mutate() function with the ifelse() function. For example, in the example below, we want to see how many people under the age of 18 did not vote because they were classified as Not eligible to vote. To do so, we write:
df_cleaned<-mutate(df_cleaned, no_vote_18 =ifelse(vote==3 & agea<18, 1, 0))
tabulate(df_cleaned$no_vote_18)
This time, the tabulate() function rounds numbers to three decimal places:
[Output]
Response Frequency Percentage Cumulative
1 0 32493 96.665 96.665
2 1 1119 3.329 99.994
3 NA 2 0.006 100.000
4 Total 33614 100.000 NA
Remove all observations with missing values in one variable.
Sometimes, when cleaning our data, we need to drop all missing values. In Stata, this can be done with drop if missing(var). In R, we can use the filter() function together with !is.na() (meaning it’s not a missing value). In this example, we drop all the rows where vote is missing to ensure a clean dataset for analysis.
df_cleaned <- filter(df_cleaned, !is.na(vote))
tabulate(df_cleaned$vote)
The result is straightforward as the number of observations drop:
[Output]
Response Frequency Percentage Cumulative
1 Yes 22113 66.47 66.47
2 No 7666 23.04 89.51
3 Not eligible to vote 3490 10.49 100.00
4 Total 33269 100.00 NA
Remove all rows with missing values in at least one variable.
We can also drop all missing values in our dataset. I present three options here: Option 1 uses the function if_all(c(var1, var2, ..., varN)) combined with filter(), which is useful when creating a subsample of the data (for instance, including only a few specific variables). Option 2 uses na.omit(), which drops a row if there are NA values in any variable of the dataset. Option 3 does the same as Option 2.
# Option 1
df_cleaned <- df_cleaned %>%
filter(if_all(c(vote, agea, eisced), ~ !is.na(.)))
# Option 2
df_cleaned <- na.omit(df_cleaned)
# Option 3
df_cleaned <- df_cleaned[complete.cases(df_cleaned), ]
For options 2) and 3), if you want to obtain the same result as option 1), it is enough to select another subdataset from df_cleaned' (or from the original df’). For instance, you can select vote, agea, eisced from the original dataset df and obtain the same results as option 1:
df3 <-select(df, vote, agea, eisced)
df_cleaned <- na.omit(df3)
summary(df_cleaned)
[Output]
# Summary shows no NA values exist
vote agea eisced
Min. :1.000 Min. : 15.00 Min. : 1.000
1st Qu.:1.000 1st Qu.: 34.00 1st Qu.: 2.000
Median :1.000 Median : 49.00 Median : 4.000
Mean :1.382 Mean : 49.13 Mean : 4.111
3rd Qu.:2.000 3rd Qu.: 64.00 3rd Qu.: 5.000
Max. :3.000 Max. :100.00 Max. :55.000
Average a variable by group
Sometimes it is useful to average a variable by group. For instance, it can be interesting to see the average age in each country of our sample. To do so, we can use the command group_by(). First of all, we select the variables we need from our original dataset to simplify our original sample df and save them under the dataset named df_avg:
df_avg<-select(df, agea, cntry)
We can then create a variable with the average age by the country of residence (cntry) using the group_by() command. In doing so, we exclude the missing values for the age variable (agea), using the command na.rm = TRUE:
df_avg <- df_avg %>%
group_by(cntry) %>%
mutate(avg_age = mean(agea, na.rm = TRUE)) # exlcude missing values in the computation of the average
To the summary statistics by country for the variable agea, it is enough to combine group_by() with summarise():
#Create values for table
avg_age_tab <- df_avg %>%
group_by(cntry) %>%
summarise(
mean_age = mean(agea, na.rm = TRUE), #average by country
n = n() # Number of observations by country
)
#Print table
print(avg_age_tab, n = Inf) # n=Inf shows all the columns
[Output]
cntry mean_age n
1 AT 49.7 2010
2 BE 47.0 1766
3 CH 47.8 1525
4 CZ 46.1 2269
5 DE 48.6 2852
6 EE 49.6 2019
7 ES 49.6 1958
8 FI 50.1 1925
9 FR 52.4 2070
10 GB 51.4 1959
11 HU 50.8 1614
12 IE 50.2 2757
13 IL 46.9 2557
14 IS 48.7 880
15 IT 48.8 2626
16 LT 49.9 2122
17 NL 51.2 1681
18 NO 47.0 1545
19 PL 47.2 1694
20 PT 52.0 1270
21 RU 46.7 2430
22 SE 51.6 1551
23 SI 49.1 1307
Go back to the Introduction webpage ↩