1 Introduction

The previous guides showed you how to prepare your data, including working with messy data, and how to describe and visualize it in useful ways. In this guide, I go through the final step of using data for researchers and other interested parties: working with models and model output.

This section does require some working knowledge of OLS. However, it does not require working knowledge of modeling functions in R, which I’ll cover in a little bit of detail.

After producing results from models, I’ll show how to easily iterate a particular model specification over subsets of your data using purrr and take those results for visualization purposes. I’ll also use the very useful dotwhisker package for producing coefficient plots.


2 Model data

To keep things simple and focus on models specifically, we’ll work with the ready-to-go Volden and Wiseman Legislative Effectiveness Data. Go ahead and download the House data from the 93rd to 115th Congress and save it to your ‘data’ directory, then read it in:

library(tidyverse)
library(readxl)

vw <- read_excel("data/CELHouse93to115.xlsx")

Now let’s rename some of the variables:

vw <- vw %>% 
  rename(icpsr = `ICPSR number, according to Poole and Rosenthal`, 
         cong = `Congress number`, 
         state = `Two-letter state code`, 
         district = `Congressional district number`,
         name = `Legislator name, as given in THOMAS`,
         les = `Legislative Effectiveness Score (1-5-10)`,
         female = `1 = female`,
         dem = `1 = Democrat`,
         ch_senior = `Seniority, number of terms served counting current`,
         votepct = `Percent vote received to enter this Congress`,
         power = `1 = Member of Appropriation, Rules, or Ways and Means`,
         chair = `1 = Chair, of standing committee or intellgence/homeland sec according to Stewar`,
         sub_chr = `1 = Chair of subcommittee, according to Almanac of American Politics and Adler/W`,
         dw_nom = `First-dimension DW-NOMINATE score`)


2.1 Basic linear models

We can now run some basic linear models inline with what Volden and Wiseman run in their book:

summary(lm(les ~ chair + ch_senior + votepct, data=vw))
## 
## Call:
## lm(formula = les ~ chair + ch_senior + votepct, data = vw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8892 -0.5842 -0.3386  0.2929 13.6423 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.4278273  0.0650288   6.579 4.97e-11 ***
## chair       3.2661451  0.0624438  52.305  < 2e-16 ***
## ch_senior   0.0668750  0.0033913  19.720  < 2e-16 ***
## votepct     0.0008457  0.0009613   0.880    0.379    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.275 on 9896 degrees of freedom
##   (363 observations deleted due to missingness)
## Multiple R-squared:  0.3082, Adjusted R-squared:  0.308 
## F-statistic:  1469 on 3 and 9896 DF,  p-value: < 2.2e-16

This is a very simple model showing that being a committee chair as well as a more senior member predicts a higher legislative effectiveness score. Let’s do something more interesting:

summary(lm(les ~ chair + ch_senior + power + votepct + female + dem + abs(dw_nom) + as.factor(cong), data=vw))
## 
## Call:
## lm(formula = les ~ chair + ch_senior + power + votepct + female + 
##     dem + abs(dw_nom) + as.factor(cong), data = vw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7743 -0.5933 -0.3190  0.3028 13.8013 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.3455976  0.0912363   3.788 0.000153 ***
## chair               3.1959373  0.0629271  50.788  < 2e-16 ***
## ch_senior           0.0750028  0.0035469  21.146  < 2e-16 ***
## power              -0.2446507  0.0305499  -8.008  1.3e-15 ***
## votepct             0.0011702  0.0009946   1.177 0.239388    
## female             -0.0510063  0.0428547  -1.190 0.233991    
## dem                 0.0100592  0.0273120   0.368 0.712651    
## abs(dw_nom)         0.2680342  0.0769161   3.485 0.000495 ***
## as.factor(cong)94   0.0105389  0.0868961   0.121 0.903471    
## as.factor(cong)95   0.0238368  0.0877376   0.272 0.785871    
## as.factor(cong)96   0.0322048  0.0871850   0.369 0.711849    
## as.factor(cong)97   0.0518149  0.0874480   0.593 0.553514    
## as.factor(cong)98   0.0224198  0.0872752   0.257 0.797272    
## as.factor(cong)99   0.0044360  0.0870996   0.051 0.959382    
## as.factor(cong)100 -0.0282837  0.0870450  -0.325 0.745240    
## as.factor(cong)101 -0.0241841  0.0872723  -0.277 0.781701    
## as.factor(cong)102 -0.0577942  0.0871890  -0.663 0.507435    
## as.factor(cong)103 -0.0122956  0.0876458  -0.140 0.888436    
## as.factor(cong)104  0.0227211  0.0875135   0.260 0.795155    
## as.factor(cong)105  0.0555917  0.0881773   0.630 0.528412    
## as.factor(cong)106 -0.0005897  0.0873469  -0.007 0.994614    
## as.factor(cong)107 -0.0327318  0.0879493  -0.372 0.709777    
## as.factor(cong)108 -0.0520967  0.0881601  -0.591 0.554579    
## as.factor(cong)109 -0.1033902  0.0883869  -1.170 0.242132    
## as.factor(cong)110 -0.0890764  0.0884895  -1.007 0.314136    
## as.factor(cong)111 -0.0942078  0.0875937  -1.076 0.282174    
## as.factor(cong)112 -0.0968844  0.0889577  -1.089 0.276134    
## as.factor(cong)113 -0.0639433  0.0889902  -0.719 0.472440    
## as.factor(cong)114 -0.0357081  0.0875520  -0.408 0.683393    
## as.factor(cong)115 -0.0461652  0.0873452  -0.529 0.597138    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.271 on 9821 degrees of freedom
##   (412 observations deleted due to missingness)
## Multiple R-squared:  0.3158, Adjusted R-squared:  0.3138 
## F-statistic: 156.3 on 29 and 9821 DF,  p-value: < 2.2e-16

I’ve added in some additional covariates as well as abs(dw_nom) which is a way of measuring ideological extremity. I’ve also included as.factor(cong) which turns the congress number into a fixed effect (or dummy variable). These cross-sectional comparisons are interesting, but what if we want to include some higher dimensional fixed effects, such as a standard two-way fixed effects model?


2.2 Two-way fixed effects:

For this we’ll turn to the lfe package, by far the best package for working with fixed effects in R. I’m also going to load in the broom package, another incredibly useful package when working with model objects. We’ll also assign the model result to an object.

library(lfe)
library(broom)

Let’s run the same model as above with felm:

m <- felm(les ~ chair + ch_senior + power + votepct + female + dem + abs(dw_nom) | cong | 0 | icpsr, data=vw)
m %>% tidy
## # A tibble: 7 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 chair        3.20      0.227      14.1   1.33e-44
## 2 ch_senior    0.0750    0.00761     9.86  8.24e-23
## 3 power       -0.245     0.0530     -4.62  3.92e- 6
## 4 votepct      0.00117   0.00136     0.859 3.90e- 1
## 5 female      -0.0510    0.0556     -0.918 3.59e- 1
## 6 dem          0.0101    0.0477      0.211 8.33e- 1
## 7 abs(dw_nom)  0.268     0.137       1.95  5.06e- 2

There are a couple new things here. First, note that the model results are the same as when we used the lm function except for the standard errors.

Second, note the syntax for felm:

  • felm(y ~ x | fixed_effect | instrument | cluster_standard_error, data=data)

This includes the same model specification, but we can now put the fixed effect separately into the model so the output is not displayed (as is typical in published work). You can also include instruments if you have them (which we don’t), and you can tell felm which variable to cluster on – in this case, the member.

Finally, I pipe the model object into broom::tidy which turns our model output into an easy to use dataframe!

Let’s try something similar with two-way fixed effects – fixed effects for units (members) and time (congresses):

m <- felm(les ~ chair + ch_senior + power + votepct + female + dem + abs(dw_nom) | cong + icpsr | 0 | icpsr, data=vw)
## Warning in chol.default(mat, pivot = TRUE, tol = tol): the matrix is either
## rank-deficient or indefinite
m %>% tidy
## Warning in chol.default(mat, pivot = TRUE, tol = tol): the matrix is either
## rank-deficient or indefinite
## # A tibble: 7 x 5
##   term         estimate std.error statistic    p.value
##   <chr>           <dbl>     <dbl>     <dbl>      <dbl>
## 1 chair         2.91      0.245      11.9     2.50e-32
## 2 ch_senior     0.0170    0.0218      0.780   4.35e- 1
## 3 power        -0.0355    0.0521     -0.682   4.96e- 1
## 4 votepct       0.00692   0.00142     4.87    1.14e- 6
## 5 female      NaN         0         NaN     NaN       
## 6 dem          -0.967     0.580      -1.67    9.52e- 2
## 7 abs(dw_nom)   0.616     0.406       1.52    1.29e- 1

Uh oh. We get a warning. Why? It’s because the variable female is perfectly correlated with the member fixed effect – gender does not vary at all within member. Additionally, though we don’t get an error from dem, it doesn’t make sense to include here as there are only a hand full of members that change their party which is producing all of the variation in this variable. Let’s take those out and run this again:

m <- felm(les ~ chair + ch_senior + power + votepct + abs(dw_nom) | cong + icpsr | 0 | icpsr, data=vw)
m %>% tidy
## # A tibble: 5 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 chair        2.91      0.245      11.9   2.53e-32
## 2 ch_senior    0.0170    0.0218      0.779 4.36e- 1
## 3 power       -0.0371    0.0521     -0.712 4.77e- 1
## 4 votepct      0.00687   0.00142     4.84  1.29e- 6
## 5 abs(dw_nom)  0.618     0.406       1.52  1.28e- 1

Very cool. However, these models have a different interpretation. Substantively, the member fixed effect is holding fixed any time-invariant member trait, so the variation comes through changes to the member’s status such as gaining a committee chair position, becoming more senior, or changes in their voteshare. According to this evidence, there’s a strong positive relationship between gaining a committee chair position and getting a higher voteshare and the member’s legislative effectiveness.

The felm function is incredibly flexible and you can include as many fixed effects as you want, but they should have a theoretical purpose.


3 Visualizing model output

Visualizing model output used to be a huge pain. Fortunately, we have a couple of great R packages that work with ggplot to make this easy. Let’s try out the dotwhisker package:

library(dotwhisker)

We can easily visualize the results from our previous model, m, as a coefficient plot:

dwplot(m) + theme_bw()

Incredibly simple. There’s a lot you can customize with dwplot and I’ll go through some later, but the documentation is really well done.


3.1 Models with interactions

Plotting models with interaction terms is also incredibly easy. Let’s say we want to plot the predicted results of this specification:

m.int <- lm(les ~ chair + ch_senior + female + ch_senior * female + power + votepct + dem + dw_nom, data=vw)
summary(m.int)
## 
## Call:
## lm(formula = les ~ chair + ch_senior + female + ch_senior * female + 
##     power + votepct + dem + dw_nom, data = vw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7520 -0.5868 -0.3199  0.2968 13.7684 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.2942528  0.0734690   4.005 6.24e-05 ***
## chair             3.1984053  0.0628903  50.857  < 2e-16 ***
## ch_senior         0.0764100  0.0036453  20.961  < 2e-16 ***
## female            0.0417211  0.0690922   0.604 0.545960    
## power            -0.2372366  0.0305180  -7.774 8.39e-15 ***
## votepct           0.0016272  0.0009747   1.669 0.095078 .  
## dem               0.1750741  0.0566483   3.091 0.002003 ** 
## dw_nom            0.2379139  0.0651631   3.651 0.000263 ***
## ch_senior:female -0.0184722  0.0123468  -1.496 0.134656    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.27 on 9842 degrees of freedom
##   (412 observations deleted due to missingness)
## Multiple R-squared:  0.3156, Adjusted R-squared:  0.315 
## F-statistic: 567.2 on 8 and 9842 DF,  p-value: < 2.2e-16

Essentially, we want to know how the member of Congress’ gender interacts with their chamber seniority to predict changes in legislative effectiveness. Since ch_senior is a continuous variable, this is not obvious without plotting some values. To do this we’ll use a couple of additional packages in combination with ggplot:

library(sjPlot)
library(sjmisc)
plot_model(m.int, type="int") + theme_minimal()

Easy enough. Based on this plot, there is evidence of a difference in the way chamber seniority interacts with gender, with more senior females on average less effective than more senior males. There’s a lot of evidence for this in the literature as being driven by some kind of discrimination or the need for female members of Congress to work harder to get elected relative to a comparable male member.

There’s a lot more you can to plot interactions and the documentation on this package is also excellent.


4 Iteration and visualizing model output

Now for something slightly more complex.

A common task especially in description using OLS is subsetting your data and running the same model specification over each subset. Then, you’ll want to extract a certain coefficient value from each iteration and output it. I do this quite a bit in my work, and it’s pretty easy once you wrap your head around a few functions.

In this example, let’s take a look at how the power of subcomittee chairs has varied over time, measured by their legislative effectiveness. The basic model specification will be similar to above:

felm(les ~ sub_chr + chair + ch_senior + power + votepct + female + dem + abs(dw_nom) | state | 0 | icpsr, data=vw)

However, we’ll take out congress fixed effects and run this specification over slices of the data taken by each Congress. We’ll also include state fixed effects and keep standard errors clustered at the individual member. What we want out of this model is the coefficient on sub_chr. Let’s do a toy example:

m <- felm(les ~ sub_chr + chair + ch_senior + power + votepct + female + dem + abs(dw_nom) | state | 0 | icpsr, data=vw)

m %>%
  tidy %>% 
  filter(term=="sub_chr")
## # A tibble: 1 x 5
##   term    estimate std.error statistic  p.value
##   <chr>      <dbl>     <dbl>     <dbl>    <dbl>
## 1 sub_chr     1.06    0.0575      18.4 3.99e-74

Now what we need to do is run this within each Congress. There are a few ways of doing this, but I prefer this method since it’s the most intuitive to understand what’s going on. This will also show you how easy it is to do something much more complicated, such as running regression discontinuities within data subsets.

First, let’s create a basic function:

iterate.model <- function(cong_num){
  
  data <- vw %>% 
    filter(cong == cong_num)
  
  m <- felm(les ~ sub_chr + chair + ch_senior + power + votepct + female + dem + abs(dw_nom) | state | 0 | icpsr, data=data)
  
  m <- m %>%
    tidy %>% 
    filter(term=="sub_chr")
  
  m$cong <- cong_num
  
  return(m)
}

This function has four steps:

  1. Filter the vw data based on the passed congress number, cong_num;
  2. Run the model;
  3. Filter the model object to the coefficient of interest, sub_chr;
  4. Create a variable in the new model dataframe called cong and return the model object.

Let’s test it out:

iterate.model(111) # pass it the 111th Congress as a test
## # A tibble: 1 x 6
##   term    estimate std.error statistic p.value  cong
##   <chr>      <dbl>     <dbl>     <dbl>   <dbl> <dbl>
## 1 sub_chr    0.255     0.189      1.35   0.178   111

Perfect! Using purrr and the map_df function allows us to do this over the whole range of values in cong:

model.output <- map_df(unique(vw$cong), ~iterate.model(.))

head(model.output)
## # A tibble: 6 x 6
##   term    estimate std.error statistic       p.value  cong
##   <chr>      <dbl>     <dbl>     <dbl>         <dbl> <dbl>
## 1 sub_chr    0.974     0.251      3.88 0.000124         93
## 2 sub_chr    0.784     0.274      2.86 0.00443          94
## 3 sub_chr    1.10      0.180      6.13 0.00000000236    95
## 4 sub_chr    1.09      0.257      4.25 0.0000272        96
## 5 sub_chr    1.07      0.195      5.49 0.0000000744     97
## 6 sub_chr    0.962     0.220      4.37 0.0000164        98

The map_df function takes as its first parameter a vector or list of values you want to pass into your function, the second parameter. In this case we’re passing all of the unique values of cong from the vw dataset.

Super easy. Now we can go back to the dwplot function for visualizing this as a coefficient plot:

model.output %>% 
  select(-term) %>% 
  rename(term = cong) %>% 
  dwplot() +
  theme_minimal() +
  theme(legend.position = "none") +
  ylab("Congress") +
  xlab("Estimated Coefficient on Sub Cmte. Chair")

You see plots like this all of the time in published work, and as you can see it’s very easy to do. I had to do a few small edits to make it cooperate with the dwplot function, such as renaming the term varaiable to the cong variable to get it to display each Congress number on the y axis.


5 Maps and model results

Why not put everything together into one example?

What we’ll do here is take a set of models run on a different subset of data – similar to the congress by congress breakdown from above – extract a coefficient of interest, and plot those coefficients on a state map, shaded by value.

Specifically, let’s explore whether legislative effectiveness varies by state over time. I don’t have a particular prior about this, so it’ll be interesting to see.

What we want to do is first create a time-based grouping, and then iterate the model over that grouping. Then we’ll extract the state fixed effects for each time period. The state fixed effects are otherwise known as state-specific intercepts, so this will give us conditional averages by state.

vw$time.group <- NA
vw$time.group[vw$cong < 97] <- "1973-1980"
vw$time.group[vw$cong >= 97 & vw$cong <102] <- "1981-1990"
vw$time.group[vw$cong >= 102 & vw$cong < 107] <- "1991-2000"
vw$time.group[vw$cong >= 107 & vw$cong < 112] <- "2001-2010"
vw$time.group[vw$cong >= 112] <- "2011-2018"

You can replicate this process using tidyverse functions but I find this to be easier and more simple to read.

Now let’s create our model function:

iterate.model.state <- function(time){
  
  data <- vw %>% 
    filter(time.group == time)
  
  m <- felm(les ~ as.factor(state) | cong, data=data)
  
  m <- m %>%
    tidy %>% 
    filter(str_detect(term, "as.factor"))
  
  m$time.group <- time
  
  return(m)
}

The major difference in this example from the previous one is the line filter(str_detect(term, "as.factor")). All this does is it looks for as.factor in the term column, since those are the coefficients we want to keep.

Now let’s iterate:

state.models <- map_df(unique(vw$time.group), ~iterate.model.state(.))

We need to do a bit of cleaning to the term column in order to get it into a merge-able form:

state.models <- state.models %>% 
  mutate(state = str_sub(term, -2))

Now recall how easy it was to create a US map from the previous guide. We’re going to use the built in state map data in ggplot to create a nice dataframe:

states_map <- map_data("state")

This dataframe only has full (lower case) state names so we need to fix that in the states_map data.

states_map <- states_map %>% 
  left_join(data.frame(state = state.abb, region = str_to_lower(state.name)))
## Joining, by = "region"

Now we can join in the model data to the states map data. Note that this is going to create a longer dataset since we have five sets of observations per state (we have 5 time periods). This is OK! We’re going to deal with this in ggplot shortly:

states_map.long <- states_map %>% left_join(state.models) %>% 
  filter(!is.na(time.group))
## Joining, by = "state"

And now to the fun part: visualization.

ggplot(states_map.long %>% filter(time.group == "1991-2000"), aes(x=long, y=lat, group=group, fill=estimate)) +
  geom_polygon(color="#cccccc", size=.5) +
  coord_map() 

Very nice! However, this is only one time period. How can we plot all 5? Let’s use purrr:map again!

make.map <- function(time){
  ggplot(states_map.long %>% filter(time.group == time), aes(x=long, y=lat, group=group, fill=estimate)) +
  geom_polygon(color="#cccccc", size=.5) +
  scale_fill_gradient(low = "white", high = "dodgerblue") +
  coord_map() +
  ggtitle(time) +
  theme_void() +
  theme(legend.position = "none")
}

maps <- map(unique(states_map.long$time.group), ~make.map(.))

The regular map function returns a list, which is what we want in this case since each list element is a ggplot object:

maps[[1]]

We now have 5 maps within the maps list object. Let’s plot each of them in a coherent way using the grid.arrange function from the gridExtra package:

library(gridExtra)

grid.arrange(maps[[1]], maps[[2]], maps[[3]], maps[[4]], maps[[5]], nrow = 3)

And there you have it! There is some variation over time in the average effectiveness per state, something I’m not sure existing work tells us much about.