1 Introduction

Ever wanted to work with 1,000,000 row datasets without waiting on anything to load?

Ever wondered how to use replication data or publicly available data?

Want to know how to merge two datasets together that have no unique ID in common?

How about reading in multiple, potentially hundreds, of CSVs into one dataframe?

Wrangle in messy committee assignment data?

Have you wondered how to create cool maps where each district is shaded a different color based on some value?

Want to learn how to easily create coefficient plots from dozens of models all at once? (You know purrr is a thing but have no idea how to use it!)

You’ve heard of regression discontinuity but found the actual application in R too intimidating?


2 Outline of content

  1. Congressional Datasets - An overview of available datasets covering different aspects of Congress, from off the shelf data to publicly available government data to other niche sources.
  2. Working with Congressional Data - How to start working with existing data, including merging datasets together by common IDs, wrangling data into proper formats, and creating new measures such as bill introductions. This is a great introduction to working with R’s tidyverse framework in a practical setting.
  3. Messy Congressional Data - What do you do when you have two datasets that don’t share IDs? This guide goes through first how to clean up these data and prepare them for merging, how to merge, and then how to fix errors. It also shows how to work with publicly available data and replication files and put them into an existing dataset that is ready to use for regressions and visualization.
    • This guide also provides a crosswalk file for working with Bioguide IDs or ICPSR IDs, as well as information on how to use the tidycensus package to create new demographic measures for your dataset.
    • A download link is provided to House and Senate election results at the district and state level (respectively).
  4. Descriptive Statistics and Visualizations - So you finally have your dataset ready to go, now what? Here I cover how to produce practical summary statistics, including tables that can be exported into LaTeX, as well as more complex examples of conditional summary statistics. I also show examples of simple and complex visualizations that are ready to be put into academic papers or blog posts.
    • Also covered in this guide is plotting data on maps using ggplot2 and other packages with built-in shapefiles.
  5. Working with Models - You’re ready to run some regressions. This guide shows the basics behind working with linear models including those with high-dimensional fixed effects using the lfe package. I then show examples of working with model output, extracting certain coefficient values, creating coefficient plots, and plotting predicted results.
    • There are also examples of easily creating interaction plots and using purrr to easily run many models on the same data and extract the coefficient of interest.
    • Ever wanted to plot model coefficients on a map?
  6. Regression Discontinuity - Getting started with regression discontinuity is a bit intimidating. Here, I show it’s actually quite easy using ggplot2 and the rdrobust package.

3 Overview

This guide is meant to serve two purposes:

  1. An overview of available data and R packages for researchers, instructors, practitioners, and policymakers interested in Congress; and
  2. A series of R tutorials centered on using these data, including joining/merging, cleaning, visualization, and modeling.

As someone who has spent hundreds of hours trying to find and compile the perfect congressional dataset, I’ve discovered there is a lack of good centralized information on what is out there and how to use it. I am not going to cover in great detail what is available within these datasets – just how you can use them for practical reasons.

This guide also serves as an introduction to working with “big data”, as many of the datasets in question approach 1,000,000 observations in their raw form. You couldn’t effectively do some of these operations in Excel even if you wanted to!

This series of guides assumes basic knowledge of R and the tidyverse syntax. However, I will walk through specific operations and provide links to other resources where available. You do not need to be an expert in R or the tidyverse to follow these guides. You should know the difference between a list and a vector, how to assign values to objects, and what pipes are.

I view this guide as comprehensive but not definitive. It’ll give you a broad set of intuitions and tools to work with and wrangle messy data with the ultimate goal of analysis.

Please visit my website for more of my research and tutorials. Do not hesitate to contact me via email or on Twitter with any questions, concerns, or comments!


4 What you will learn

  • Which datasets are available in easy-to-use formats;
  • Good replication datasets from published work;
  • Data available from non-profit sources such as OpenSecrets or ProPublica;
  • Unique IDs used in congressional data;
  • How to merge datasets together with or without common IDs;
  • How to create new bespoke measures using existing data;
  • How R and the tidyverse are particularly useful when working with messy congressional data;
  • Which R packages provide pre-built data and save you from reinventing the proverbial wheel (such as shapefiles and census fields).

From a substantive perspective, the datasets I briefly describe below include information on committee assignments, legislative effectiveness, legislator demographics, district characteristics, electoral returns, money in politics, congressional disbursements, ideological scalings, distributive spending, bill introductions, and much more.

If you want to learn more about some of the newer R packages, this guide will be an excellent resource. I use the tidyverse framework including dplyr, readr, ggplot2, and purr. I also use packages for modeling and visualization such as lfe and dotwhisker.

Importantly, the focus of this guide will primarily be on recent Congresses, mostly the 1990s-today. Some of the data do go back much further in time, however.