1 Introduction: Messy data

This guide is meant to provide insight in how to address a very common problem in working with congressional data, but that is certainly not unique to congressional data. Specifically, you’ll learn how to approach merging two (or more) datasets together that do not share unique IDs. In the last section of the guide I demonstrated the best case scenario: what to do when your datasets share the same ID. Here, we’ll be working with one nicely prepared dataset (Volden and Wiseman’s legislative effectiveness data) and two slightly more complicated datasets.

The first comes from a public benefit organization that tracks government data, but not in necessarily the optimal way for practitioners and researchers. The second is a replication data file from a paper published in a top political science journal – another common task for researchers.

Beyond merging together these datasets, I spend a lot of time talking about the logic behind this process and the functions and data wrangling approaches one should think about when faced with this problem. There are many ways of going about this, but what I present below is my preferred method – which may not necessarily be the best. At the end of this section, you’ll have a dataset ready to run regressions and from which you can produce interesting visualizations.


2 Merging with no shared IDs

2.1 Preparing the workspace:

First, set your working directory to the location of your R script and load some needed packages:

library(readstata13)
library(tidyverse)

This is a common challenge in real world data work. You have two datasets, one potentially nicely formatted with a unique ID per unit, and one collected from perhaps a government source or a website that is in no shape to work with. How do you put them together? The answer is typically not obvious, and is much more of an artform than a science.

Here, we’ll work with a very useful dataset for congressional researchers: cosponsorship data. It is often more useful to work with the raw cosponsorship data than a cleaned version because you want to extract some new source of information from it, such as networks of cosponsorship.

Our ultimate goal will be to merge in this cosponsorship data to the Volden and Wiseman data, which is at the member-Congress level. Let’s go ahead and read in this data (for the House only):

library(readxl)

vw <- read_excel("data/CELHouse93to115.xlsx")

The column names are a bit of a mess, but we’ll deal with that later.


2.2 Cleaning the data

To start, download a zip file of the raw CSVs of cosponsorship data from GovTrack here. Extract the CSVs into your data folder.

This is another frequent obstacle in working with real world data; the data come in multiple files that you want to put together into one. There are a lot of ways to do this, but I’m going to show you the “state of the art” approach using the purrr package.

files <- list.files(path="data", pattern="*cosponsor", full.names=T)

cosp <- map_df(files, ~read_csv(.))

The first thing we did here is read in the cosponsorship CSV names into a vector using list.files. This function does exactly what its name suggests: it takes the path you want (in this case the “data” folder), a pattern within the file name – a regular expression – in this case if the file name contains “cosponsor”. Finally, I’m telling it to keep the entire file name including the directory.

The next step is a straightforward use of the purrr::map_df function. If you’re familiar with base R, the map family of functions works similar to the apply functions. Here, we’re passing it our vector of file names and telling it to go over each and apply a function: in this case, the read_csv function. The map_df version of the map, as suggested in the name, returns a dataframe as a result (instead of a list). So, we now have a dataframe of all the cosponsorship data.

head(cosp)
## # A tibble: 6 x 9
##   bill_number name  bioguide_id state district sponsor original_cospon~
##   <chr>       <chr> <chr>       <chr>    <dbl> <lgl>   <lgl>           
## 1 hconres1-1~ Arme~ <NA>        TX          26 TRUE    FALSE           
## 2 hconres10-~ Swee~ <NA>        NY          22 TRUE    FALSE           
## 3 hconres100~ Dunc~ <NA>        TN           2 TRUE    FALSE           
## 4 hconres100~ Hill~ <NA>        TN           4 FALSE   FALSE           
## 5 hconres100~ Osbo~ <NA>        NE           3 FALSE   FALSE           
## 6 hconres100~ Wamp~ <NA>        TN           3 FALSE   FALSE           
## # ... with 2 more variables: date_signed <date>, date_withdrawn <date>

Taking a look at the data, we see there’s a completely useless bioguide_id column, but there are columns for the member’s name, their state, and their district. This makes our job a bit easier.

When merging data like this, I like to break down the problem into chunks:

  1. First, we need to get the member’s name into a workable format. If this were purely district level data (instead of member data), then this wouldn’t be necessary since we could just merge on district and state.
  2. Next, we need to extract the Congress number of the bill.
  3. Then, we need to create our member-Congress level dataset (one observation per member per Congress).
  4. Finally, we’ll be able to merge to the Volden and Wiseman data based on member name, state, and district.

First, let’s extract the Congress number from the bill_number varaible – this is pretty easy since it’s the last three characters of each bill ID. We can use the str_sub function to do this without complicated regular expressions.

str_sub("hconres1-107", -3) # easy
## [1] "107"
cosp <- cosp %>% 
  ungroup %>% 
  mutate(cong = str_sub(bill_number, -3) %>% as.numeric) # convert it into a number along the way

I also want to drop Senate bills since we’re only going to be merging this into House data from Volden and Wiseman:

cosp <- filter(cosp, str_sub(bill_number, 1, 1) != "s")

All this is doing is checking the first letter of the bill_number variable for the letter ‘s’ – if it is the letter ‘s’, drop it from the data.

Next, let’s clean up the name. Specifically, we want a column for first and last names which will make our lives much easier. There are a lot of ways to do this, but why reinvent the wheel? Let’s use the humaniformat package:

#install.packages("humaniformat")
library(humaniformat)

format_reverse("McCrain, Joshua M.")
## [1] "Joshua M. McCrain"
parse_names("Joshua M. McCrain")
##   salutation first_name middle_name last_name suffix         full_name
## 1       <NA>     Joshua          M.   McCrain   <NA> Joshua M. McCrain

This package has two very useful functions: format_reverse and parse_names. The first is straightforward. The second takes a formatted name and turns it into a convenient dataframe. So, our next step is to do these things to our data, which we’ll perform in a specific order.

cosp <- cosp %>% 
  ungroup %>% 
  mutate(clean_name = format_reverse(name))

Easy enough. However, doing an operation that returns a dataframe is slightly more complicated. To provide some more insight into the purrr package, I’ll use that to create a separate dataframe of cleaned names, then join it back into the main dataframe.

name.df <- map_df(unique(cosp$clean_name), ~parse_names(.))

cosp <- left_join(cosp, name.df, by = c("clean_name" = "full_name"))

What’s going on here? First, we use map_df almost exactly as above, however this time it takes a vector of the unique names from the cosp dataframe and returns a dataframe of the cleaned names. Next, we join this back into the cosponsorship dataframe. The by argument is very useful – it’s saying join the two dataframes together by clean_name on the left hand dataframe and full_name on the right hand side. We now have our cleaned names and Congress numbers and get to joining it to the Volden and Wiseman data.


2.3 Putting it all together:

What we ultimately want to do is match each of the members in the cosponsorship data to their appropriate icpsr ID. This next step makes it easier, but is not necessary – I like to break it down this way to make error checking easier.

Let’s create a new dataframe of the relevant info from the Volden and Wiseman (from now on, VW) data that we’ll be merging on.

merge.df <- vw %>% 
  select(icpsr = `ICPSR number, according to Poole and Rosenthal`, 
         cong = `Congress number`, 
         state = `Two-letter state code`, 
         district = `Congressional district number`,
         name = `Legislator name, as given in THOMAS`)

This is easier to work with. Now what I want to do is extract just the last name from the merge.df dataframe since ultimately we’ll be joining the two dataframes together by last name, congress, state and district. I’m going to do this all in one step, but it’s the same idea as above:

merge.df <- merge.df %>% 
  ungroup %>% 
  mutate(clean_name = format_reverse(name)) %>% 
  left_join(map_df(unique(.$clean_name), ~parse_names(.)), by = c("clean_name" = "full_name")) %>%
  select(icpsr:district, first_name, last_name) # drop unnecessary columns

head(merge.df)
## # A tibble: 6 x 6
##   icpsr  cong state district first_name last_name
##   <dbl> <dbl> <chr>    <dbl> <chr>      <chr>    
## 1 14000    93 SD           2 James      Abdnor   
## 2 13001    93 NY          20 Bella      Abzug    
## 3 10700    93 WA           7 Brock      Adams    
## 4 10500    93 NY           7 Joseph     Addabbo  
## 5    NA    93 OK           3 Carl       Albert   
## 6 12000    93 AR           1 Bill       Alexander

Excellent. Now we have our dataframe from which to merge. Let’s try it out:

check <- left_join(cosp, merge.df) %>%
  filter(is.na(icpsr)) %>% 
  distinct(clean_name, .keep_all=T) %>% 
  select(clean_name, last_name, cong)
## Joining, by = c("state", "district", "cong", "first_name", "last_name")

Wouldn’t you know it, we’re missing 38 matches. Not too bad in the scheme of things. Let’s figure out what’s going on.

Ah-ha! The first thing I noticed is that joining by district is throwing things off. In the cosponsorship data, states with only one district have the district as missing, whereas the VW data has it as a 1. This is an easy fix.

cosp$district[is.na(cosp$district)] <- 1

That fixed 11 of the missing matches. What is going on with the remainder? Looking at the merge.df data, there are some members without icpsr IDs which seem to be representatives of non-state territories. Since these members don’t get to vote, they’re typically dropped from the data, which I’ll do now:

cosp <- filter(cosp, state %in% state.abb)

Here I’m using the built in state.abb vector of state abbreviations and dropping observations not in that list. Let’s see how this did:

check <- merge.df %>% 
  select(-first_name) %>% 
  left_join(cosp, .) %>%
  filter(is.na(icpsr)) %>% 
  distinct(clean_name, .keep_all=T) %>% 
  select(clean_name, last_name, cong, state, district)
## Joining, by = c("state", "district", "cong", "last_name")

Alright, down to 16. Now what? This part is just manual searching. What I like to do is open up the check dataframe in a separate window and start searching for names in the merge.df data to see what I’m missing.

Ah, it looks like we have some more district based issues. Some single district states have a 0 instead of 1. That’s an easy fix. What’s going on with the others? After search their names in merge.df, it seems they’re simply missing from the VW data. This happens from time to time, and there’s not much we can do about it but note it and move on. So, once we fix this last district issue, we’re good to go!

cosp$district[cosp$district==0] <- 1

cosp <- merge.df %>% 
  select(-first_name) %>% 
  left_join(cosp, .) %>% 
  filter(!is.na(icpsr))
## Joining, by = c("state", "district", "cong", "last_name")

Great! We now have an icpsr ID associated with each of the observations in the cosp dataframe! Now for the easy part: creating some member-Congress level measures:


2.4 Finalizing the data:

What’s useful to get from this data? One measure that might be interesting, especially as it relates to the Volden and Wiseman idea of legislative effectiveness, is how many original cosponsors members average on each of their bills. How do we calculate this?

What we need to do is get the original sponsor(s) of a given bill, and then see how many people cosponsor their bills.

cosp <- cosp %>% 
  filter(sponsor==FALSE) %>% 
  group_by(bill_number) %>% 
  summarize(total_cosponsors = length(unique(icpsr))) %>%
  left_join(cosp, .) %>% # put it into the full data
  mutate(total_cosponsors = replace_na(total_cosponsors, 0)) # replace NAs with 0s
## Joining, by = "bill_number"

What’s going on here? All I’m doing is taking the cosponsorship data and getting it down to one observation per bill, creating a new variable that tells us how many cosponsors that bill has (how many icpsr IDs are associated with that bill that are NOT the original sponsor). I am then piping that into a left_join in order to add it back to our original dataframe and replacing NAs with 0s.

Ok, so now let’s keep only bill-level observations with sponsor set to TRUE, group by member and Congress to get how many bills they introduced in a given Congress, and then divide total cosponsors by total bills to get the average number of cosponsors:

cosp.panel <- cosp %>% 
  ungroup %>% 
  filter(sponsor == TRUE) %>% 
  group_by(icpsr, cong) %>% 
  summarize(total_bills = length(unique(bill_number)),
            total_cosponsor = sum(total_cosponsors)) %>% 
  mutate(avg_cosponsor = total_cosponsor/total_bills) 

And there you have it: a panel with one observation per member Congress, with new variables for how many bills (and resolutions) they introduced, how many aggregate cosponsors they received, and how many cosponsors they averaged per bill.


3 Merging in replication data

The final step of this part of the guide will be merging in some replication data from a published paper. Specifically, we’ll be working with the replication data from Alexander, Berry and Howell’s 2015 Journal of Politics paper on distributive spending. These data are great because they provide us with a clean, easy to use district-level number of federal spending data. If you’ve ever tried to do this yourself, you’ll know it’s a nightmare.

Let’s download the district data from Dataverse: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VR12G4 – it’s the final file on this page. Download it as the original format and save it to your data folder. We’ll then read it in with the readstata13 package

library(readstata13)

abh <- read.dta13("data/ABH_full_district.dta")

These are district by Congress level data (one observation per district per Congress), and what we want from it is the variables ln_grant_nf and ln_grant_all which stand for log non-formula grants and log all grants respectively. This is a common measure of district level spending – specifically, the non-formula grants exclude spending determined by, for instance, social security with the idea being the other amounts of spending can be affected by members of Congress ability to deliver pork.

We also want the congress, state and district variables so we can merge it with the VW data:

abh <- select(abh, year, cong = congress, district = dist, fips_state, ln_grants_nf, ln_grants_all)

Next, let’s get per Congress averages of the two grant variables since we ulimately want to merge with a Congress level dataset:

abh <- abh %>% 
  ungroup %>% 
  group_by(fips_state, cong, district) %>% 
  summarize_at(vars(ln_grants_nf, ln_grants_all), mean)

One other holdup is that we don’t have the state abbreviation which VW uses, just the fips code. This is pretty easy to solve through loading in the maps package and its state.fips object:

library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
head(state.fips)
##   fips ssa region division abb    polyname
## 1    1   1      3        6  AL     alabama
## 2    4   3      4        8  AZ     arizona
## 3    5   4      3        7  AR    arkansas
## 4    6   5      4        9  CA  california
## 5    8   6      4        8  CO    colorado
## 6    9   7      1        1  CT connecticut

Ah, one more complication: there are leading 0s in the fips_state variable in abh but not in this data. That’s another easy fix using the readr::parse_number function:

abh <- abh %>% 
  ungroup %>% 
  mutate(fips_state = parse_number(fips_state))

Let’s join this in, first by cleaning up the state.fips data a bit to get one fips code per state:

abh <- state.fips %>% 
  select(fips_state = fips, state = abb) %>% 
  group_by(fips_state) %>% 
  distinct(state) %>% 
  left_join(abh, .)
## Joining, by = "fips_state"

And we’re good to go. We can now merge our abh replication data and our cosp.panel cosponsorship data with the Volden and Wiseman data to create a new dataset!


4 The final merge:

Our first step is to prepare the VW data by fixing some names:

vw <- vw %>% 
  rename(icpsr = `ICPSR number, according to Poole and Rosenthal`, 
         cong = `Congress number`, 
         state = `Two-letter state code`, 
         district = `Congressional district number`,
         name = `Legislator name, as given in THOMAS`)

Then we can join in the abh data:

vw <- left_join(vw, abh)
## Joining, by = c("cong", "state", "district")

And finally the cosp.panel data:

vw <- left_join(vw, cosp.panel) 
## Joining, by = c("icpsr", "cong")

And there you have it. It’s important to note that each of our new additions don’t cover the same time period as the full dataset. That’s fine, but it will limit our range of analyses. Let’s save this dataset so we can use it in the next section of the guide: modeling and visualization.

save(abh, cosp.panel, file = "data/additional_covariates.RData")

5 Additional resources

Before moving on to the next section, here are a couple additional resources that you’ll find useful when working with messy data.

5.1 Bioguide ID

Another common unique identifier for members of Congress is the Bioguide ID. In my experience, this is second most common after the icpsr ID. I actually find this one has fewer problems than the icpsr ID. An example of a dataset that uses it is ProPublica’s House Office Expenditure Data. I won’t cover how to use it specifically, but the ideas are exactly the same as what you’ve seen here and in the previous guide.

First, download this file: Bioguide ID - ICPSR ID crosswalk and save it to the appropriate folder. Then, read it into your workspace and merge it into your dataset using the same left_join operations used previously. That’s all there is to it!


5.2 Election returns

Finding good data on election returns can be tricky. Both the Volden and Wiseman dataset and the historical congressional data used here provide some of this.

To make things easier, you can download a full dataset of House and Senate election results from 1972-2018 here. (this link takes you to an .RData file). These data are based off of Gary Jacobson’s elections data and updated and provided to you all by Carlos Algara.

These datasets are very straightforward

  • dv is the vote received for the Democratic candidate (so 100 - dv returns the Republican vote share).
  • dpres is the two-party Democratic presidential vote-share in last election. So 2014 is 2012 numbers and 2012 is 2012’s numbers.

The only tricky part for merging these in for the House comes through cleaning up the district variable which takes the form NC13. How can we clean this? Easy!

dist <- "NC13" %>% 
  parse_number

state <- "NC13" %>% 
  str_replace_all("[:digit:]", "")

data.frame(dist, state)
##   dist state
## 1   13    NC

That’s all there is to it. The rest of the merging process is identical to that outlined above.


5.3 Tidycensus

The tidycensus package is one of the more amazing packages in R. It saves you a ton of trouble if you have to work with census data which can be a real pain to wrangle. However, this package does take a little getting used to and a bit of preperation. Here, I’ll briefly go over one example of how to use it, but I highly recommend this overview: Basic usage of tidycensus.

First, install the package and load it into your workspace:

#install.packages("tidycensus")
library(tidycensus)

Next, you’ll need to get a census API key. This is pretty easy to do and it’s free. Go to the census’ Key Signup page and enter the requested information. You should get one quickly. Once you have it, run the census_api_key function with your key filled in:

census_api_key("YOUR KEY HERE")

Now for the somewhat tricky part: finding the variable from the census you want. There are thousands, so this takes some doing.

One trick is to laod the list of variables into a dataframe so you can search through it that way. Just to note, here I’ll be working with the 2018 ACS but you can change that to other years depending on your purpose.

vars <- load_variables(2018, "acs5", cache = TRUE) %>%
  distinct(concept, .keep_all=T)

As an example, I’m going to take the median income variable and the total population varaible and load that into a dataframe. These variables are called “B19013_001” and “B01003_001”, evidently.

acs <- get_acs(geography = "congressional district",
        variables = c(medianIncome = "B19013_001",
                      totalPop = "B01003_001"),
        #state = "",
        year = 2018,
        output="wide")
## Getting data from the 2014-2018 5-year ACS

get_acs is the main function when working with tidycensus. As you can see it takes some straightforward parameters. We get to tell it what kind of geography we want (here, congressional districts); what varaibles we want it to load; if we wanted to, we could limit it to certain states, what year, and how to output the data. Let’s look at the acs dataframe:

head(acs)
## # A tibble: 6 x 6
##   GEOID NAME                medianIncomeE medianIncomeM totalPopE totalPopM
##   <chr> <chr>                       <dbl>         <dbl>     <dbl>     <dbl>
## 1 2701  Congressional Dist~         61735           650    672649       322
## 2 2702  Congressional Dist~         83585           846    698522       399
## 3 2703  Congressional Dist~         88852          1166    710378       517
## 4 2704  Congressional Dist~         68928           764    706259       337
## 5 2705  Congressional Dist~         61851           765    707352       449
## 6 2706  Congressional Dist~         79860           989    702583       541

This is perfect. All that’s left is a little cleaning in order to get a state abbreviation and district number to merge into our previous data. The GEOID variable makes this easy since it’s a combination of the state FIPS code we worked with earlier (the first two digits) and the district number (the second two digits).

acs <- acs %>% 
  ungroup %>% 
  mutate(fips = str_sub(GEOID, 1, 2),
         cd = str_sub(GEOID, 3, 4))

All that’s left from here is exactly the same procedure as above: fixing the state FIPs variable and district variable in order to merge into our existing data. And there you have it, you now know how to load in your own census data!

load("data/congressional_election_data.RData")