Ever wanted to work with 1,000,000 row datasets without waiting on anything to load?
Ever wondered how to use replication data or publicly available data?
Want to know how to merge two datasets together that have no unique ID in common?
How about reading in multiple, potentially hundreds, of CSVs into one dataframe?
Wrangle in messy committee assignment data?
Have you wondered how to create cool maps where each district is shaded a different color based on some value?
Want to learn how to easily create coefficient plots from dozens of models all at once? (You know purrr
is a thing but have no idea how to use it!)
You’ve heard of regression discontinuity but found the actual application in R too intimidating?
tidyverse
framework in a practical setting.tidycensus
package to create new demographic measures for your dataset.ggplot2
and other packages with built-in shapefiles.lfe
package. I then show examples of working with model output, extracting certain coefficient values, creating coefficient plots, and plotting predicted results.
purrr
to easily run many models on the same data and extract the coefficient of interest.ggplot2
and the rdrobust
package.This guide is meant to serve two purposes:
As someone who has spent hundreds of hours trying to find and compile the perfect congressional dataset, I’ve discovered there is a lack of good centralized information on what is out there and how to use it. I am not going to cover in great detail what is available within these datasets – just how you can use them for practical reasons.
This guide also serves as an introduction to working with “big data”, as many of the datasets in question approach 1,000,000 observations in their raw form. You couldn’t effectively do some of these operations in Excel even if you wanted to!
This series of guides assumes basic knowledge of R and the tidyverse syntax. However, I will walk through specific operations and provide links to other resources where available. You do not need to be an expert in R or the tidyverse to follow these guides. You should know the difference between a list and a vector, how to assign values to objects, and what pipes are.
I view this guide as comprehensive but not definitive. It’ll give you a broad set of intuitions and tools to work with and wrangle messy data with the ultimate goal of analysis.
Please visit my website for more of my research and tutorials. Do not hesitate to contact me via email or on Twitter with any questions, concerns, or comments!
tidyverse
are particularly useful when working with messy congressional data;From a substantive perspective, the datasets I briefly describe below include information on committee assignments, legislative effectiveness, legislator demographics, district characteristics, electoral returns, money in politics, congressional disbursements, ideological scalings, distributive spending, bill introductions, and much more.
If you want to learn more about some of the newer R packages, this guide will be an excellent resource. I use the tidyverse
framework including dplyr
, readr
, ggplot2
, and purr
. I also use packages for modeling and visualization such as lfe
and dotwhisker
.
Importantly, the focus of this guide will primarily be on recent Congresses, mostly the 1990s-today. Some of the data do go back much further in time, however.