The previous guides showed you how to prepare your data, including working with messy data, and how to describe and visualize it in useful ways. In this guide, I go through the final step of using data for researchers and other interested parties: working with models and model output.
This section does require some working knowledge of OLS. However, it does not require working knowledge of modeling functions in R, which I’ll cover in a little bit of detail.
After producing results from models, I’ll show how to easily iterate a particular model specification over subsets of your data using purrr
and take those results for visualization purposes. I’ll also use the very useful dotwhisker
package for producing coefficient plots.
To keep things simple and focus on models specifically, we’ll work with the ready-to-go Volden and Wiseman Legislative Effectiveness Data. Go ahead and download the House data from the 93rd to 115th Congress and save it to your ‘data’ directory, then read it in:
Now let’s rename some of the variables:
vw <- vw %>%
rename(icpsr = `ICPSR number, according to Poole and Rosenthal`,
cong = `Congress number`,
state = `Two-letter state code`,
district = `Congressional district number`,
name = `Legislator name, as given in THOMAS`,
les = `Legislative Effectiveness Score (1-5-10)`,
female = `1 = female`,
dem = `1 = Democrat`,
ch_senior = `Seniority, number of terms served counting current`,
votepct = `Percent vote received to enter this Congress`,
power = `1 = Member of Appropriation, Rules, or Ways and Means`,
chair = `1 = Chair, of standing committee or intellgence/homeland sec according to Stewar`,
sub_chr = `1 = Chair of subcommittee, according to Almanac of American Politics and Adler/W`,
dw_nom = `First-dimension DW-NOMINATE score`)
We can now run some basic linear models inline with what Volden and Wiseman run in their book:
##
## Call:
## lm(formula = les ~ chair + ch_senior + votepct, data = vw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8892 -0.5842 -0.3386 0.2929 13.6423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4278273 0.0650288 6.579 4.97e-11 ***
## chair 3.2661451 0.0624438 52.305 < 2e-16 ***
## ch_senior 0.0668750 0.0033913 19.720 < 2e-16 ***
## votepct 0.0008457 0.0009613 0.880 0.379
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.275 on 9896 degrees of freedom
## (363 observations deleted due to missingness)
## Multiple R-squared: 0.3082, Adjusted R-squared: 0.308
## F-statistic: 1469 on 3 and 9896 DF, p-value: < 2.2e-16
This is a very simple model showing that being a committee chair as well as a more senior member predicts a higher legislative effectiveness score. Let’s do something more interesting:
summary(lm(les ~ chair + ch_senior + power + votepct + female + dem + abs(dw_nom) + as.factor(cong), data=vw))
##
## Call:
## lm(formula = les ~ chair + ch_senior + power + votepct + female +
## dem + abs(dw_nom) + as.factor(cong), data = vw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7743 -0.5933 -0.3190 0.3028 13.8013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3455976 0.0912363 3.788 0.000153 ***
## chair 3.1959373 0.0629271 50.788 < 2e-16 ***
## ch_senior 0.0750028 0.0035469 21.146 < 2e-16 ***
## power -0.2446507 0.0305499 -8.008 1.3e-15 ***
## votepct 0.0011702 0.0009946 1.177 0.239388
## female -0.0510063 0.0428547 -1.190 0.233991
## dem 0.0100592 0.0273120 0.368 0.712651
## abs(dw_nom) 0.2680342 0.0769161 3.485 0.000495 ***
## as.factor(cong)94 0.0105389 0.0868961 0.121 0.903471
## as.factor(cong)95 0.0238368 0.0877376 0.272 0.785871
## as.factor(cong)96 0.0322048 0.0871850 0.369 0.711849
## as.factor(cong)97 0.0518149 0.0874480 0.593 0.553514
## as.factor(cong)98 0.0224198 0.0872752 0.257 0.797272
## as.factor(cong)99 0.0044360 0.0870996 0.051 0.959382
## as.factor(cong)100 -0.0282837 0.0870450 -0.325 0.745240
## as.factor(cong)101 -0.0241841 0.0872723 -0.277 0.781701
## as.factor(cong)102 -0.0577942 0.0871890 -0.663 0.507435
## as.factor(cong)103 -0.0122956 0.0876458 -0.140 0.888436
## as.factor(cong)104 0.0227211 0.0875135 0.260 0.795155
## as.factor(cong)105 0.0555917 0.0881773 0.630 0.528412
## as.factor(cong)106 -0.0005897 0.0873469 -0.007 0.994614
## as.factor(cong)107 -0.0327318 0.0879493 -0.372 0.709777
## as.factor(cong)108 -0.0520967 0.0881601 -0.591 0.554579
## as.factor(cong)109 -0.1033902 0.0883869 -1.170 0.242132
## as.factor(cong)110 -0.0890764 0.0884895 -1.007 0.314136
## as.factor(cong)111 -0.0942078 0.0875937 -1.076 0.282174
## as.factor(cong)112 -0.0968844 0.0889577 -1.089 0.276134
## as.factor(cong)113 -0.0639433 0.0889902 -0.719 0.472440
## as.factor(cong)114 -0.0357081 0.0875520 -0.408 0.683393
## as.factor(cong)115 -0.0461652 0.0873452 -0.529 0.597138
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.271 on 9821 degrees of freedom
## (412 observations deleted due to missingness)
## Multiple R-squared: 0.3158, Adjusted R-squared: 0.3138
## F-statistic: 156.3 on 29 and 9821 DF, p-value: < 2.2e-16
I’ve added in some additional covariates as well as abs(dw_nom)
which is a way of measuring ideological extremity. I’ve also included as.factor(cong)
which turns the congress number into a fixed effect (or dummy variable). These cross-sectional comparisons are interesting, but what if we want to include some higher dimensional fixed effects, such as a standard two-way fixed effects model?
For this we’ll turn to the lfe
package, by far the best package for working with fixed effects in R. I’m also going to load in the broom
package, another incredibly useful package when working with model objects. We’ll also assign the model result to an object.
Let’s run the same model as above with felm
:
m <- felm(les ~ chair + ch_senior + power + votepct + female + dem + abs(dw_nom) | cong | 0 | icpsr, data=vw)
m %>% tidy
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 chair 3.20 0.227 14.1 1.33e-44
## 2 ch_senior 0.0750 0.00761 9.86 8.24e-23
## 3 power -0.245 0.0530 -4.62 3.92e- 6
## 4 votepct 0.00117 0.00136 0.859 3.90e- 1
## 5 female -0.0510 0.0556 -0.918 3.59e- 1
## 6 dem 0.0101 0.0477 0.211 8.33e- 1
## 7 abs(dw_nom) 0.268 0.137 1.95 5.06e- 2
There are a couple new things here. First, note that the model results are the same as when we used the lm
function except for the standard errors.
Second, note the syntax for felm
:
felm(y ~ x | fixed_effect | instrument | cluster_standard_error, data=data)
This includes the same model specification, but we can now put the fixed effect separately into the model so the output is not displayed (as is typical in published work). You can also include instruments if you have them (which we don’t), and you can tell felm which variable to cluster on – in this case, the member.
Finally, I pipe the model object into broom::tidy
which turns our model output into an easy to use dataframe!
Let’s try something similar with two-way fixed effects – fixed effects for units (members) and time (congresses):
m <- felm(les ~ chair + ch_senior + power + votepct + female + dem + abs(dw_nom) | cong + icpsr | 0 | icpsr, data=vw)
## Warning in chol.default(mat, pivot = TRUE, tol = tol): the matrix is either
## rank-deficient or indefinite
## Warning in chol.default(mat, pivot = TRUE, tol = tol): the matrix is either
## rank-deficient or indefinite
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 chair 2.91 0.245 11.9 2.50e-32
## 2 ch_senior 0.0170 0.0218 0.780 4.35e- 1
## 3 power -0.0355 0.0521 -0.682 4.96e- 1
## 4 votepct 0.00692 0.00142 4.87 1.14e- 6
## 5 female NaN 0 NaN NaN
## 6 dem -0.967 0.580 -1.67 9.52e- 2
## 7 abs(dw_nom) 0.616 0.406 1.52 1.29e- 1
Uh oh. We get a warning. Why? It’s because the variable female
is perfectly correlated with the member fixed effect – gender does not vary at all within member. Additionally, though we don’t get an error from dem
, it doesn’t make sense to include here as there are only a hand full of members that change their party which is producing all of the variation in this variable. Let’s take those out and run this again:
m <- felm(les ~ chair + ch_senior + power + votepct + abs(dw_nom) | cong + icpsr | 0 | icpsr, data=vw)
m %>% tidy
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 chair 2.91 0.245 11.9 2.53e-32
## 2 ch_senior 0.0170 0.0218 0.779 4.36e- 1
## 3 power -0.0371 0.0521 -0.712 4.77e- 1
## 4 votepct 0.00687 0.00142 4.84 1.29e- 6
## 5 abs(dw_nom) 0.618 0.406 1.52 1.28e- 1
Very cool. However, these models have a different interpretation. Substantively, the member fixed effect is holding fixed any time-invariant member trait, so the variation comes through changes to the member’s status such as gaining a committee chair position, becoming more senior, or changes in their voteshare. According to this evidence, there’s a strong positive relationship between gaining a committee chair position and getting a higher voteshare and the member’s legislative effectiveness.
The felm
function is incredibly flexible and you can include as many fixed effects as you want, but they should have a theoretical purpose.
Visualizing model output used to be a huge pain. Fortunately, we have a couple of great R packages that work with ggplot to make this easy. Let’s try out the dotwhisker
package:
We can easily visualize the results from our previous model, m
, as a coefficient plot:
Incredibly simple. There’s a lot you can customize with dwplot
and I’ll go through some later, but the documentation is really well done.
Plotting models with interaction terms is also incredibly easy. Let’s say we want to plot the predicted results of this specification:
m.int <- lm(les ~ chair + ch_senior + female + ch_senior * female + power + votepct + dem + dw_nom, data=vw)
summary(m.int)
##
## Call:
## lm(formula = les ~ chair + ch_senior + female + ch_senior * female +
## power + votepct + dem + dw_nom, data = vw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7520 -0.5868 -0.3199 0.2968 13.7684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2942528 0.0734690 4.005 6.24e-05 ***
## chair 3.1984053 0.0628903 50.857 < 2e-16 ***
## ch_senior 0.0764100 0.0036453 20.961 < 2e-16 ***
## female 0.0417211 0.0690922 0.604 0.545960
## power -0.2372366 0.0305180 -7.774 8.39e-15 ***
## votepct 0.0016272 0.0009747 1.669 0.095078 .
## dem 0.1750741 0.0566483 3.091 0.002003 **
## dw_nom 0.2379139 0.0651631 3.651 0.000263 ***
## ch_senior:female -0.0184722 0.0123468 -1.496 0.134656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.27 on 9842 degrees of freedom
## (412 observations deleted due to missingness)
## Multiple R-squared: 0.3156, Adjusted R-squared: 0.315
## F-statistic: 567.2 on 8 and 9842 DF, p-value: < 2.2e-16
Essentially, we want to know how the member of Congress’ gender interacts with their chamber seniority to predict changes in legislative effectiveness. Since ch_senior
is a continuous variable, this is not obvious without plotting some values. To do this we’ll use a couple of additional packages in combination with ggplot:
Easy enough. Based on this plot, there is evidence of a difference in the way chamber seniority interacts with gender, with more senior females on average less effective than more senior males. There’s a lot of evidence for this in the literature as being driven by some kind of discrimination or the need for female members of Congress to work harder to get elected relative to a comparable male member.
There’s a lot more you can to plot interactions and the documentation on this package is also excellent.
Now for something slightly more complex.
A common task especially in description using OLS is subsetting your data and running the same model specification over each subset. Then, you’ll want to extract a certain coefficient value from each iteration and output it. I do this quite a bit in my work, and it’s pretty easy once you wrap your head around a few functions.
In this example, let’s take a look at how the power of subcomittee chairs has varied over time, measured by their legislative effectiveness. The basic model specification will be similar to above:
felm(les ~ sub_chr + chair + ch_senior + power + votepct + female + dem + abs(dw_nom) | state | 0 | icpsr, data=vw)
However, we’ll take out congress fixed effects and run this specification over slices of the data taken by each Congress. We’ll also include state fixed effects and keep standard errors clustered at the individual member. What we want out of this model is the coefficient on sub_chr
. Let’s do a toy example:
m <- felm(les ~ sub_chr + chair + ch_senior + power + votepct + female + dem + abs(dw_nom) | state | 0 | icpsr, data=vw)
m %>%
tidy %>%
filter(term=="sub_chr")
## # A tibble: 1 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 sub_chr 1.06 0.0575 18.4 3.99e-74
Now what we need to do is run this within each Congress. There are a few ways of doing this, but I prefer this method since it’s the most intuitive to understand what’s going on. This will also show you how easy it is to do something much more complicated, such as running regression discontinuities within data subsets.
First, let’s create a basic function:
iterate.model <- function(cong_num){
data <- vw %>%
filter(cong == cong_num)
m <- felm(les ~ sub_chr + chair + ch_senior + power + votepct + female + dem + abs(dw_nom) | state | 0 | icpsr, data=data)
m <- m %>%
tidy %>%
filter(term=="sub_chr")
m$cong <- cong_num
return(m)
}
This function has four steps:
vw
data based on the passed congress number, cong_num
;sub_chr
;cong
and return the model object.Let’s test it out:
## # A tibble: 1 x 6
## term estimate std.error statistic p.value cong
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 sub_chr 0.255 0.189 1.35 0.178 111
Perfect! Using purrr
and the map_df
function allows us to do this over the whole range of values in cong
:
## # A tibble: 6 x 6
## term estimate std.error statistic p.value cong
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 sub_chr 0.974 0.251 3.88 0.000124 93
## 2 sub_chr 0.784 0.274 2.86 0.00443 94
## 3 sub_chr 1.10 0.180 6.13 0.00000000236 95
## 4 sub_chr 1.09 0.257 4.25 0.0000272 96
## 5 sub_chr 1.07 0.195 5.49 0.0000000744 97
## 6 sub_chr 0.962 0.220 4.37 0.0000164 98
The map_df
function takes as its first parameter a vector or list of values you want to pass into your function, the second parameter. In this case we’re passing all of the unique values of cong
from the vw
dataset.
Super easy. Now we can go back to the dwplot
function for visualizing this as a coefficient plot:
model.output %>%
select(-term) %>%
rename(term = cong) %>%
dwplot() +
theme_minimal() +
theme(legend.position = "none") +
ylab("Congress") +
xlab("Estimated Coefficient on Sub Cmte. Chair")
You see plots like this all of the time in published work, and as you can see it’s very easy to do. I had to do a few small edits to make it cooperate with the dwplot
function, such as renaming the term
varaiable to the cong
variable to get it to display each Congress number on the y axis.
Why not put everything together into one example?
What we’ll do here is take a set of models run on a different subset of data – similar to the congress by congress breakdown from above – extract a coefficient of interest, and plot those coefficients on a state map, shaded by value.
Specifically, let’s explore whether legislative effectiveness varies by state over time. I don’t have a particular prior about this, so it’ll be interesting to see.
What we want to do is first create a time-based grouping, and then iterate the model over that grouping. Then we’ll extract the state fixed effects for each time period. The state fixed effects are otherwise known as state-specific intercepts, so this will give us conditional averages by state.
vw$time.group <- NA
vw$time.group[vw$cong < 97] <- "1973-1980"
vw$time.group[vw$cong >= 97 & vw$cong <102] <- "1981-1990"
vw$time.group[vw$cong >= 102 & vw$cong < 107] <- "1991-2000"
vw$time.group[vw$cong >= 107 & vw$cong < 112] <- "2001-2010"
vw$time.group[vw$cong >= 112] <- "2011-2018"
You can replicate this process using tidyverse
functions but I find this to be easier and more simple to read.
Now let’s create our model function:
iterate.model.state <- function(time){
data <- vw %>%
filter(time.group == time)
m <- felm(les ~ as.factor(state) | cong, data=data)
m <- m %>%
tidy %>%
filter(str_detect(term, "as.factor"))
m$time.group <- time
return(m)
}
The major difference in this example from the previous one is the line filter(str_detect(term, "as.factor"))
. All this does is it looks for as.factor
in the term
column, since those are the coefficients we want to keep.
Now let’s iterate:
We need to do a bit of cleaning to the term
column in order to get it into a merge-able form:
Now recall how easy it was to create a US map from the previous guide. We’re going to use the built in state map data in ggplot to create a nice dataframe:
This dataframe only has full (lower case) state names so we need to fix that in the states_map
data.
states_map <- states_map %>%
left_join(data.frame(state = state.abb, region = str_to_lower(state.name)))
## Joining, by = "region"
Now we can join in the model data to the states map data. Note that this is going to create a longer dataset since we have five sets of observations per state (we have 5 time periods). This is OK! We’re going to deal with this in ggplot shortly:
## Joining, by = "state"
And now to the fun part: visualization.
ggplot(states_map.long %>% filter(time.group == "1991-2000"), aes(x=long, y=lat, group=group, fill=estimate)) +
geom_polygon(color="#cccccc", size=.5) +
coord_map()
Very nice! However, this is only one time period. How can we plot all 5? Let’s use purrr:map
again!
make.map <- function(time){
ggplot(states_map.long %>% filter(time.group == time), aes(x=long, y=lat, group=group, fill=estimate)) +
geom_polygon(color="#cccccc", size=.5) +
scale_fill_gradient(low = "white", high = "dodgerblue") +
coord_map() +
ggtitle(time) +
theme_void() +
theme(legend.position = "none")
}
maps <- map(unique(states_map.long$time.group), ~make.map(.))
The regular map
function returns a list, which is what we want in this case since each list element is a ggplot object:
We now have 5 maps within the maps
list object. Let’s plot each of them in a coherent way using the grid.arrange
function from the gridExtra
package:
And there you have it! There is some variation over time in the average effectiveness per state, something I’m not sure existing work tells us much about.