Home

Tabulate in R

How To Replicate STATA Tabulate Command

A particularly useful command in STATA is tabulate, which is commonly used to create frequency tables or cross-tabulations between two categorical variables. It provides a simple way to understand how values are distributed across categories, and is especially helpful in exploring relationships in survey or categorical data.

One-way tabulate

In R, we can replicate its functionality using a combination of different R functions. Below is an example of how to generate a STATA-style frequency table for the variable polintr (interest in politics) using R.

Let’s start by importing the necessary packages.

pacman::p_load(
  haven,     
  dplyr
)

To generate a Tabulate command that includes all the values:


# Generate Tabulate for Variable "polintr"

labels <- attr(df$polintr, "labels")
response_labels <- names(labels)
polintr_factor <- factor(as_factor(df$polintr), levels = response_labels)
freq_table <- table(polintr_factor, useNA = "ifany")
percent_table <- round(prop.table(freq_table) * 100, 2)

# Cumulative percentage
cum_percent <- round(cumsum(percent_table), 2)

# Combine into a data frame
stat_table <- data.frame(
  Response = names(freq_table),
  Frequency = as.vector(freq_table),
  Percentage = as.vector(percent_table),
  Cumulative = as.vector(cum_percent)
)

# Combine into a data frame
total_row <- data.frame(
  Response = "Total",
  Frequency = sum(stat_table$Frequency),
  Percentage = sum(stat_table$Percentage),
  Cumulative = NA_real_, 
  stringsAsFactors = FALSE
)

stat_table_final <- rbind(stat_table, total_row)

# Print tabel with variable label
attr(df$polintr, "label")
print(stat_table_final)
[Output]
[1] "How interested in politics"
    
               Response Frequency Percentage Cumulative
1       Very interested      5415      12.20      12.20
2      Quite interested     15539      35.01      47.21
3     Hardly interested     15248      34.35      81.56
4 Not at all interested      8088      18.22      99.78
5               Refusal        36       0.08      99.86
6            Don't know        57       0.13      99.99
7             No answer         4       0.01     100.00
8                 Total     44387     100.00         NA

To remove missing values and responses such as Refusal, Don’t know, and No answer, I filter a subsample of the data to include only the gndr and polintr variables. I name the resulting dataset df_oneway.


#Drop observations that are missings in both variables
df_oneway<-select(df, gndr, polintr)
df_oneway <- df_oneway[complete.cases(df_oneway), ]

# Consider labels
df_oneway$polintr_label<-as_factor(df_oneway$polintr)
df_oneway$gndr_label<-as_factor(df_oneway$gndr)

# Removes unused factor levels (levels with 0 observations)
df_oneway <- df_oneway %>%
  mutate(
    polintr_label = droplevels(polintr_label),
    gndr_label = droplevels(gndr_label)
  )

Notice that the command %>% must be read as “and then apply to df_oneway the function mutate()”. However, we can also write:

df_oneway <- mutate(df_oneway,
    polintr_label = droplevels(polintr_label),
    gndr_label = droplevels(gndr_label)
  )

Then, I replicate the tabulate code using the dataset df_oneway:

# Generate Tabulate for Variable "polintr"

labels <- attr(df_oneway$polintr, "labels")
response_labels <- names(labels)
polintr_factor <- factor(as_factor(df_oneway$polintr))
freq_table <- table(polintr_factor, useNA = "ifany")
percent_table <- round(prop.table(freq_table) * 100, 2)

# Cumulative percentage
cum_percent <- round(cumsum(percent_table), 2)

# Combine into a data frame
stat_table <- data.frame(
  Response = names(freq_table),
  Frequency = as.vector(freq_table),
  Percentage = as.vector(percent_table),
  Cumulative = as.vector(cum_percent)
)

# Add a final row for total
total_row <- data.frame(
  Response = "Total",
  Frequency = sum(stat_table$Frequency),
  Percentage = sum(stat_table$Percentage),
  Cumulative = NA_real_,  
  stringsAsFactors = FALSE
)

stat_table_final <- rbind(stat_table, total_row)

attr(df_oneway$polintr, "label")
print(stat_table_final)
rm(df_oneway)

[Output]
[1] "How interested in politics"
    
               Response Frequency Percentage Cumulative
1       Very interested      5414      12.23      12.23
2      Quite interested     15535      35.08      47.31
3     Hardly interested     15245      34.43      81.74
4 Not at all interested      8087      18.26     100.00
5                 Total     44281     100.00         NA

Two-way tabulate

A two-way tabulation is a statistical tool used to display the relationship between two categorical variables. Each cell in the table represents the number of observations (or frequency) that fall into the combination of a level from each variable. For example, if we are analysing the variables gender (gndr) and interest in politics (polintr), a two-way tabulation will show how many individuals of each gender report different levels of political interest. Before creating such a table, I remove or exclude missing values or responses such as Refusal, Don’t know, and No answer.


#Drop observations that are missings in both variables
df_twoway<-select(df, gndr, polintr)
df_twoway <- df_twoway[complete.cases(df_twoway), ]

# Consider labels
df_twoway$polintr_label<-as_factor(df_twoway$polintr)
df_twoway$gndr_label<-as_factor(df_twoway$gndr)

# Removes unused factor levels (levels with 0 observations)
df_twoway <- mutate(df_twoway,
    polintr_label = droplevels(polintr_label),
    gndr_label = droplevels(gndr_label)
  )

# Generate the 2-way tabulate
tab_2w <- table(df_twoway$polintr_label, df_twoway$gndr_label, useNA = c("no"))
addmargins(tab_2w)

[Output]
                       Male Female   Sum
  Very interested        3323   2091  5414
  Quite interested       7926   7609 15535
  Hardly interested      6557   8688 15245
  Not at all interested  3184   4903  8087
  Sum                   20990  23291 44281

Go back to the Introduction webpage ↩