Summary Statistics in R

How to Create a STATA-Style Summary Statistics Table and Export It to Excel or LaTeX

This page presents a quick way to replicate a STATA-style summary statistics table in R. STATA’s summarize command with the detail option provides key descriptive statistics such as the mean, standard deviation, minimum, maximum, number of observations and many others. Using R, we can achieve a similar output. Below, I share an example that summarises a set of variables in a format that closely resembles STATA’s output. Let’s start by importing the necessary libraries.

pacman::p_load(
  haven,     
  dplyr,
  summarytools,    
  writexl,  
  tibble,   
  xtable    
)

Import your dataset and select a subsample of variable using the select() function. In our example, we will use the variable vote, polintr and ipstrgv from a dataset that we named as df.

#Select a subsample of variables
df_cleaned<-select(df, vote, polintr, ipstrgv)

A basic way to generate summary statistics is using the function summary(), which provide summary statistics in a wide format if more than one variable is considered. This is a good starting point as summary() already provides a wide set of descriptive statistics such as: minimum value, 1st and 3rd quartiles, median, mean, maximum value, number of missing values.

summary_stats <- summary(df_cleaned)
print(summary_stats)

[Output]

     vote          polintr         ipstrgv     
   Min.   :1.000   Min.   :1.000   Min.   :1.000  
   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000  
   Median :1.000   Median :3.000   Median :2.000  
   Mean   :1.382   Mean   :2.587   Mean   :2.342  
   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:3.000  
   Max.   :3.000   Max.   :4.000   Max.   :6.000  
   NA's   :468     NA's   :97      NA's   :1002

An alternative way to display the table is in a long format and similar to the descriptive statistics of the command summarize in STATA. In this example, I share a code that generates a summary statistics table showing the total number of valid (i.e., non-missing) observations, the mean, standard deviation, minimum, and maximum values. For clarity, I also round all values to two decimal places if they have decimals.

# Extract variable labels
var_labels <- sapply(df_cleaned, function(x) {
              lbl <- attr(x, "label")
              if (is.null(lbl)) NA else lbl })
# Replace variable names with labels where available. Create a copy in which you change the names
df_cleaned_var <-as.data.frame(df_cleaned)
names(df_cleaned_var) <- ifelse(is.na(var_labels), names(df_cleaned_var), var_labels)
# Transpose and make it long
summary_df <- descr(df_cleaned_var, stats = c("n.valid", "mean", "sd", "min", "max"), transpose = TRUE)
# Convert to data frame
summary_df_df <- as.data.frame(summary_df)
# Convert all columns except 'N. Valid' to numeric and round to 2 decimals
numeric_cols <- setdiff(names(summary_df_df), "N. Valid")
summary_df_df[numeric_cols] <- lapply(summary_df_df[numeric_cols], function(x) {
  round(as.numeric(as.character(x)), 2)
})

[Output]


                                                       N.Valid Mean Std.Dev Min Max
How interested in politics                               44290 2.59    0.92   1   4
Important that government is strong and ensures safety   43385 2.34    1.20   1   6
Voted last national election                             43919 1.38    0.64   1   3

We can also change the names of the labels if they are too long. For instance, I change the label ‘‘Important that government is strong and ensures safety’’ into ‘‘Government ensures safety’’ and ‘‘Voted last national election’’ into ‘‘Voted last election’’:

# Generate new names for the variable we want to change
new_labels <- c("Important that government is strong and ensures safety" = "Government ensures safety",
               "Voted last national election" = "Voted last election")

# Generate new dataframe to not overlap the previous one
summary_new_labels <- as.data.frame(summary_df_df)

# Change row names using the new names assigned in new_names
rownames(summary_new_labels) <- ifelse(rownames(summary_df_df) %in% names(new_labels),
                                  new_labels[rownames(summary_df_df)],
                                  rownames(summary_df_df))
print(summary_new_labels)

[Output]


                           N.Valid Mean Std.Dev Min Max
How interested in politics   44290 2.59    0.92   1   4
Government ensures safety    43385 2.34    1.20   1   6
Voted last election          43919 1.38    0.64   1   3

Finally, we can export the results in LaTeX or Excel by simply converting the “row names” into a column. Henceforth, I will use the dataset summary_df_df:

# Move rownames into a column called "Variable"
summary_df_export <- rownames_to_column(summary_df_df, var = "Variable")

#Export in Excel
writexl::write_xlsx(summary_df_export, "summary_table.xlsx")

#Export in Latex
latex_code <- xtable(summary_df_export)
sink("summary_table.tex") 
print(latex_code, type = "latex", include.rownames = FALSE)  
sink()  

Go back to the Introduction webpage ↩