Automating census data pulls with r

I had a pretty cool little quantitative micro-targeting of policy issues model I wanted to show you guys today. But I had to get this Census Bureau API-R relationship smoothed out for a work thing and I’m kinda thinking it will have more universal appeal. So I’m posting it first.

Here is the gist: The Census Bureau collects lots of really cool data: demographic, socio-economic, social structure type stuff:

educational attainment by state, congressional district, city, census tract, ext
percent of population speaking a language other than English at home…by state, census tract, school district, whatever
number of homeowners versus renters by various geographies

I know everybody is all beside themselves about using Twitter/Facebook/whatever for some micro-analysis. But we shouldn’t forget that for a lot of applications (political issue tracking, environmental justice, economic development) we can get a lot of milage out of the stuff the Census Bureau already collects.

The rub is that the Census Bureau’s data organization is kinda poor and their web GUIs for data retrieval are really confusing and hard to use.

Most of the stuff in this post will rely on API calls rather than static data files…but there are a few .csv files you may want and the full annotated R code is also available here:

Aaron’s R Spatial GitHub Repo.

Pulling Census Data in R

Have you ever wanted to know how many people in your city speak Taglog at home? Or how many black people live in Vermont? Or what percent of the people in Jackson, Mississippi have a college degree? Then the American Fact Finder web GUI is for you.

But if you’ve ever wondered how the age structure of Franklin County, TN has changed over the last 7 years (perhaps as job opportunities in rural areas continue to decline and young people increasingly migrate to urban areas) AND you’d like to compare that change to changes in other rural geographies in the U.S. AND you’d like your analysis to be reproducible AND possibly applicable to other similar research questions…then you owe it to yourself to get friendly with the Census Bureau’s API.

Here’s a great resources that I borrowed from pretty heavily

Prerequisites

There are two crucial prerequisites for this section:

An API key from the Census Bureau…get one here
The R package RJSONIO.

A quick word on item 2: there is an R package called acs or something like that which is designed to work with data from the Census Bureau’s American Communities Survey. Try it out if you want. I monkeyed with it but found it was overkill. The fromJSON() function in the RJSONIO library is pretty much all I need…or want for that matter

An Example

The sample API call on the Census Bureau’s Developers page to get population by state from the 1 year estimates of the 2015 American Communities Survey is:

api.census.gov/data/2015/acs1?get=NAME,B01001_001E&for=state:*&key=…

In the code below I pass this API call to the Census Bureau, use the fromJSON() function to collect the JSON results in an R list, and clean up the list

library(RJSONIO)
key <- mykey
resURL <- paste('http://api.census.gov/data/2015/acs1?get=NAME,B01001_001E&for=state:*&key=',
+                 key,sep="")
ljson<-fromJSON(resURL)
head(ljson)
[[1]]
[1] "NAME"        "B01001_001E" "state"      

[[2]]
[1] "Alabama" "4858979" "01"     

[[3]]
[1] "Alaska" "738432" "02"    

[[4]]
[1] "Arizona" "6828065" "04"     

[[5]]
[1] "Arkansas" "2978204"  "05"      

[[6]]
[1] "California" "39144818"   "06"  

I can drop this into a dataframe pretty easily with:

ljson<-ljson[2:length(ljson)]
pop <- data.frame(statename=unlist(lapply(ljson,function(x){x[1]})),
                  totalpop=unlist(lapply(ljson,function(x){x[2]})),
                  state=unlist(lapply(ljson,function(x){x[3]})))
head(pop)
   statename totalpop state
1    Alabama  4858979    01
2     Alaska   738432    02
3    Arizona  6828065    04
4   Arkansas  2978204    05
5 California 39144818    06
6   Colorado  5456574    08

Another Example

I like examples. Especially when you are coding, I think it’s hyper-useful to have a stock of examples. It ups the probability that, for whatever specific thing you want to do, you’ll just have to alter a couple lines of existing code.

Here’s what’s going on in this example:

I used the API call to pull total Hispanic population by congressional district for the U.S.
used a separate API call to pull total population by congressional district for the U.S.
combined those 2 to create the % Hispanic population in each congressional district
merged that with results of the 2016 Congressional Elections

#set up the API call for Hispanic Population by Congressional District
resURL <- paste("http://api.census.gov/data/2015/acs1?get=NAME,B01001I_001E&for=congressional+district:*&key=",
                key,sep="")
latinos <- fromJSON(resURL)
latinos <- latinos[2:length(latinos)]

#Set up the API call for total population
resURL <- paste("http://api.census.gov/data/2015/acs1?get=NAME,B01001_001E&for=congressional+district:*&key=",
                key,sep="")
pop <- fromJSON(resURL)
pop <- pop[2:length(pop)]
pop.L <- data.frame(name=unlist(lapply(pop,function(x){unlist(x)[1]})),
                    pop=unlist(lapply(pop,function(x){unlist(x)[2]})),
                    state=unlist(lapply(pop,function(x){unlist(x)[3]})),
                    cd=unlist(lapply(pop,function(x){unlist(x)[4]})))

latinos <- data.frame(name=unlist(lapply(latinos,function(x){unlist(x)[1]})),
                      latino.pop=unlist(lapply(latinos,function(x){unlist(x)[2]})))

pop.L <- pop.L %>% inner_join(latinos,by=c('name')) %>%
            mutate(pct_latino=as.numeric(as.character(latino.pop))/
                     as.numeric(as.character(pop)))

#get the 2016 Congressional Election results
house_election_2016 <- read.csv('data/congressionalelections2016.csv')
#get the numeric codes for each state
state.codes <- read.csv('data/state_fips_codes.csv')
names(state.codes) <- c('statename','state.num','state')

votes <- house_election_2016 %>% 
          inner_join(state.codes,by=c('state'))

#fix pop.L congressional districts
pop.L$cd <- as.numeric(as.character(pop.L$cd))
pop.L$state <- as.numeric(as.character(pop.L$state))

votes <- votes %>% select(state.num,cd,winning_party)
names(votes) <- c('state','cd','party')

votes <- votes %>% inner_join(pop.L,by=c('state','cd'))

#order it by latino population and see if there are any notable 'R's
votes %>% arrange(-pct_latino)

   state cd party                                                     name    pop latino.pop pct_latino
1      6 21     R   Congressional District 21 (114th Congress), California 716371     535377 0.74734600
2     48 23     R        Congressional District 23 (114th Congress), Texas 747732     521424 0.69734076
3     12 26     R      Congressional District 26 (114th Congress), Florida 776959     541186 0.69654383
4      6 10     R   Congressional District 10 (114th Congress), California 739784     316229 0.42746126
5      6 25     R   Congressional District 25 (114th Congress), California 720316     268191 0.37232409
6      6 24     D   Congressional District 24 (114th Congress), California 736757     264994 0.35967626
7     32  4     D        Congressional District 4 (114th Congress), Nevada 722406     201492 0.27891795
8      6 49     R   Congressional District 49 (114th Congress), California 735828     189640 0.25772327
9      8  3     R      Congressional District 3 (114th Congress), Colorado 737812     181423 0.24589326
10     4  1     D       Congressional District 1 (114th Congress), Arizona 759663     173716 0.22867508
11    17 10     D     Congressional District 10 (114th Congress), Illinois 714251     160421 0.22460032
12    12  7     D       Congressional District 7 (114th Congress), Florida 738367     157720 0.21360651
13     8  6     R      Congressional District 6 (114th Congress), Colorado 796156     163880 0.20583906
14    32  3     D        Congressional District 3 (114th Congress), Nevada 758676     125431 0.16532881
15     6  7     D    Congressional District 7 (114th Congress), California 739069     117592 0.15910828
16     6 52     D   Congressional District 52 (114th Congress), California 755498     118641 0.15703682
17    12 18     R      Congressional District 18 (114th Congress), Florida 748028     111507 0.14906795
18    36  1     R      Congressional District 1 (114th Congress), New York 726589     107873 0.14846495
19    51 10     R     Congressional District 10 (114th Congress), Virginia 807670     105618 0.13076875
20    34  5     D    Congressional District 5 (114th Congress), New Jersey 743424      94463 0.12706477
21    20  3     R        Congressional District 3 (114th Congress), Kansas 755396      88598 0.11728683
22    31  2     R      Congressional District 2 (114th Congress), Nebraska 652870      71614 0.10969106
23    36  3     D      Congressional District 3 (114th Congress), New York 721764      74127 0.10270255
24    12 13     D      Congressional District 13 (114th Congress), Florida 716429      72211 0.10079296
25    36 19     R     Congressional District 19 (114th Congress), New York 705044      51710 0.07334294
26    19  3     R          Congressional District 3 (114th Congress), Iowa 812464      54484 0.06706020
27    27  2     R     Congressional District 2 (114th Congress), Minnesota 695029      40193 0.05782924
28    55  8     R     Congressional District 8 (114th Congress), Wisconsin 727491      36678 0.05041712
29    42  8     R  Congressional District 8 (114th Congress), Pennsylvania 707083      35083 0.04961652
30    26  8     R      Congressional District 8 (114th Congress), Michigan 728781      35125 0.04819692
31    36 24     R     Congressional District 24 (114th Congress), New York 708959      31211 0.04402370
32    27  3     R     Congressional District 3 (114th Congress), Minnesota 700121      30328 0.04331823
33    26  7     R      Congressional District 7 (114th Congress), Michigan 697627      29680 0.04254422
34    19  1     R          Congressional District 1 (114th Congress), Iowa 771888      28650 0.03711678
35    36 23     R     Congressional District 23 (114th Congress), New York 708372      26115 0.03686622
36    36 22     R     Congressional District 22 (114th Congress), New York 711391      26099 0.03668728
37    33  1     D Congressional District 1 (114th Congress), New Hampshire 671640      24624 0.03666250
38    18  9     R       Congressional District 9 (114th Congress), Indiana 741982      24680 0.03326226
39    36 21     R     Congressional District 21 (114th Congress), New York 710842      23524 0.03309315
40    17 12     R     Congressional District 12 (114th Congress), Illinois 699369      23119 0.03305694
41    26  1     R      Congressional District 1 (114th Congress), Michigan 700136      13902 0.01985614
42    27  8     D     Congressional District 8 (114th Congress), Minnesota 663670      11133 0.01677490
43    23  2     R         Congressional District 2 (114th Congress), Maine 657319       8938 0.01359766
> 

I didn’t really do this for any reason other than it sounded fun…but it does seem pretty interesting that the top congressional districts ranked by latino population are represented by Republicans…Although it deserves mention that I didn’t do any voting age population filters or anything like that here. It’s pretty off-the-cuff. But interesting nonetheless.

The Shitty Stuff

After toggling my way through the Census Bureau’s data GUI a few times to try and find basic shit like population, educational attainment, etc., I loved working with the API. However, the drawback to automating data retrieval with the API is that you have to know what series you are looking for. The Census Bureau does not make this easy.

Here is an example of where the API is still cool…but maybe not SUPER cool:

Suppose I want population by age and sex for each state in the U.S. and I start with the American Fact Finder. I can use either the ‘guided search’ or ‘advanced search’ to work my way through several filtering screens and arrive at an html table that will have a row/column (depending on how I decide to format the web query output) for each state and a row/column for each of like 100 variables (total population, total male population, males ages 5 and under, males ages 5-17, males ages 18-19,….total females, females ages 5 and under, etc., etc.). I can choose to download this table into a .csv format.

Alternatively, I can try to save myself several points and clicks and get the same data using the API. Here’s the rub: when I used the web tool I discovered that this particular table was Table B01001. THERE IS NO API CALL FOR TABLE B01001. There is an API call for:

B01001_002E - total male population
B01001_007E - total males ages 18 and 19
B01001_026E - total female population
B01001_040E - females age 50-54

Basically, using the API requires you to not only know the precise table you want but also the precise data series within that table.

Here is how I got 2012, 2014, and 2015 population by age and sex for each congressional district disregarding ages under 18 and over 55:

series.males <- c('001E','002E','007E','008E','009E','010E','011E','012E','013E','014E','015E','016E')
series.females <- c('026E','031E','032E','033E','034E','035E','036E','037E','038E','039E','040E')
series <- c(series.males,series.females)

series <- paste('B01001_',series,sep="")
series.names<- c('total pop','total male','m18_19','m20','m21','m22_24','m25_29','m30_34',
                 'm35_39','m40_44','m45_49','m50_54','total female','f18_19','f20','f21','f22_24',
                 'f25_29','f30_34',
                 'f35_39','f40_44','f45_49','f50_54')


pop.fn <- function(i,yr){
resURL <- paste('http://api.census.gov/data/',yr,
                '/acs1?get=NAME,',
                series[i],'&for=congressional+district:*&key=',key,sep="")
ljson <- fromJSON(resURL)
ljson <- ljson[2:length(ljson)]
tmp <- data.frame(unlist(lapply(ljson,function(x)x[1])),
                  unlist(lapply(ljson,function(x)x[2])),
                  unlist(lapply(ljson,function(x)x[3])),
                  unlist(lapply(ljson,function(x)x[4])),
                  series.names[i])
names(tmp) <- c('name','variable','state','congressional district','series_name')
return(tmp)
}

pop.df2015 <- data.frame(rbindlist(lapply(c(1:length(series)),pop.fn,yr=2015)))
pop.df2015$source <- 'ACS 2015 1 yr'
pop.df2014 <- data.frame(rbindlist(lapply(c(1:length(series)),pop.fn,yr=2014)))
pop.df2014$source <- 'ACS 2014 1 yr'
pop.df2012 <- data.frame(rbindlist(lapply(c(1:length(series)),pop.fn,yr=2012)))
pop.df2012$source <- 'ACS 2012 1 yr'

I don’t have much wisdom to offer on how to efficiently search for the precise series you are looking for (e.g. B01001_032E). I will tell you what I have found to be wildly inefficient: opening the variables list html table and doing a command+f search for some keyword. There are over 62,000 data series in the summary tables and another ~60,000 in the subject tables and my chrome browser hangs up for minutes trying to search the html table.

I haven’t totally flushed this out yet but I was able to use R’s XML package to read the variable list html table into R.

##################################################################
#Try to parse out the variables list so I can do some manner
# of informed search for searies I'm looking for:
library(XML)
theurl <- "http://api.census.gov/data/2015/acs1/variables.html"
tables <- readHTMLTable(theurl)

#################################################################

From here there is an object called Concept. This field has keywords that can help you narrow your search for the right series. At this point I’m not entirely sure how to access this object…but I do know that Concept has keywords like ‘Educational Attainment’ which can, at least in theory, help you locate the stuff you want.

A Mapping Example

Now, I want to combine some of the stuff I showed off in my last post. Specifically, I’m going to use the API to pull educational attainment for a chunk of western states at the geographic unit of congressional districts and crank out a map.

################################################################
################################################################
################################################################
# A quick mapping example

library(rgdal)
library(ggplot2)
library(ggmap)
library(scales)
library(maptools)
library(rgeos)
library(dplyr)

#use the function I wrote above to pull total population and population with a 
# bachelors degree by congressional district from Census Bureau

edu.fn <- function(yr){
  resURL <- paste('http://api.census.gov/data/',yr,
                  '/acs1/subject?get=NAME,S1501_C01_006E&for=congressional+district:*&',
                  'key=f5a32f694a14b28acf7301f4972eaab8551eafda',sep="")
  ljson <- fromJSON(resURL)
  name <- unlist(lapply(ljson[2:length(ljson)],function(x)x[1]))
  pop25 <- unlist(lapply(ljson[2:length(ljson)],function(x)x[2]))
  state <- unlist(lapply(ljson[2:length(ljson)],function(x)x[3]))
  congressional_district <- unlist(lapply(ljson[2:length(ljson)],function(x)x[4]))
  df <- data.frame(name=name,pop25=value,state=state,congressional_district=congressional_district,
                   source=paste('ACS_1yr_',yr,sep="")) 
  
  #add in the number of people 25 and over with a bachelor's degree
  resURL <- paste('http://api.census.gov/data/',yr,
                  '/acs1/subject?get=NAME,S1501_C01_012E&for=congressional+district:*&',
                  'key=f5a32f694a14b28acf7301f4972eaab8551eafda',sep="")
  ljson <- fromJSON(resURL)
  pop25_bachelors <- unlist(lapply(ljson[2:length(ljson)],function(x)x[2]))
  
  df$pop25_bachelors <- pop25_bachelors
  return(df)
}  

#get educational attainment from the 2015 ACS
edu2015 <- edu.fn(yr=2015)

#read the shapefile with cartographic boundaries for congressional districts
#let's see if we can map some shit by congressional district
congress <- readOGR(dsn = "data/cb_2015_us_cd114_500k", 
                    layer = "cb_2015_us_cd114_500k")
#pull out state, CD, and AFFGEOID so we can merge with the education stuff
df.tmp <- congress@data[,c('STATEFP','CD114FP','AFFGEOID')]
df.tmp$state_cd <- paste(df.tmp$STATEFP,df.tmp$CD114FP,sep="")

congress <- fortify(congress, region="AFFGEOID")

#keep edu from some western states (CA, AZ, NV, UT)
edu.tmp <- edu2015 %>% mutate(state_cd=paste(state,congressional_district,sep="")) %>% 
            inner_join(df.tmp,by=c('state_cd')) %>%
            filter(state %in% c('06','04','49','32')) %>%
            rename(id=AFFGEOID) %>%
            mutate(pct_college=as.numeric(as.character(pop25_bachelors))/
                     as.numeric(as.character(pop25))) 
         
plotData <- inner_join(congress,edu.tmp)

#trim the congress data frame
ids <- unique(edu.tmp$id)
congress <- congress %>% filter(congress$id %in% ids)

ggplot() +
  geom_polygon(data = plotData, aes(x = long, y = lat, group = group,
                                    fill = 1-pct_college)) +
  geom_polygon(data = congress, aes(x = long, y = lat, group = group),
               fill = NA, color = "black", size = 0.25) +
  coord_map() +
  scale_fill_distiller(palette = "YlOrRd", labels = percent,
                       breaks = pretty_breaks(n = 15)) +
  guides(fill = guide_legend(reverse = FALSE)) + theme_bw() 

education plot here

Ok, so the map is a little campy and maybe not super informative without some city-type context. It’s also probably worth noting that here I plotted 1 - percent of age 25 and over population wiht a Bachelor’s Degree. I wanted to display the education ‘hotspots’ in the darkest shades but didn’t want to spend all day monkeying with the palette brewer…so the darker reds are the congressional districts where relatively more of the population holds Bachelor’s Degrees.

I’m pretty sure there is some really cool quantitive policy I could do with this ability to quickly pull data Census data at the Congressional District level with the API and map it in R. Hopefully, I’ll be able to illustrate some of the analysis I have in mind in the next few blog posts.

I’ll try one more with a static google map basemap just to see what it looks like:

One last map…I promise

This one comes directly from Kevin Johnson’s blog post on clipping static Google maps to Census Bureau boundaries to create basemaps. Here I’m reverting back to the Census Tract boundaries and the percent of population with health insurance for California that I used in my last post. I’ll work on amending this to work with the Congressional District Boundaries later today.

Note that I’m using a pretty tight zoom around Fresno California. That’s because the larger the extent of the map, the less you will see of the features on the underlying basemap….and if the map is so big that you can’t see the word Fresno it doesn’t really make a ton of sense to add a context-rich basemap at all.

################################################################
################################################################
################################################################
#try it with a base map

#load the raster library
library(raster)

#go back to California Census Tract data because the map 
tract <- readOGR(dsn = "data/gz_2010_06_140_00_500k", 
                 layer = "gz_2010_06_140_00_500k")
#load back in the insurance data we pulled
ins <- read.csv("data/CA_insured_tract.csv")
ins$id <- as.character(ins$GEO.id)
ins$percent <- ins$Insured18_64/ins$Pop18_64

#get a basemap
map <- get_map("Fresno", zoom = 11, maptype = "roadmap")
p <- ggmap(map)
p

#establish a bounding box around our static google map
box <- as(extent(as.numeric(attr(map, 'bb'))[c(2,4,1,3)] + 
                   c(.001,-.001,.001,-.001)), "SpatialPolygons")
proj4string(box) <- CRS(summary(tract)[[4]])
tractSub <- gIntersection(tract, box, byid = TRUE, 
                          id = as.character(tract$GEO_ID))
tractSub <- fortify(tractSub, region = "GEO_ID")
plotData <- left_join(tractSub, ins, by = "id")

ggmap(map) +
  geom_polygon(data = plotData, aes(x = long, y = lat, group = group, 
                                    fill = percent), colour = NA, alpha = 0.5) +
  scale_fill_distiller(palette = "YlOrRd", breaks = pretty_breaks(n = 10), 
                       labels = percent) +
  labs(fill = "") +
  theme_nothing(legend = TRUE) +
  guides(fill = guide_legend(reverse = TRUE, override.aes = 
                               list(alpha = 1)))

heat map

Written on February 17, 2017