Social vulnerability indicators with r

header map

Here is something I’ve been doing at work that I’m kinda proud of. It’s a fully reproducible spatial visualization of social vulnerability indicators by Census Tract for the West Coast. The entire project is available for download on my GitHub svi-tools repository here.

A pretty thorough walk-through exists here.

Background

Social Vulnerability refers to the resilience of communities when confronted by external stresses on human health, stresses such as natural or human-caused disasters, or disease outbreaks.

The Center for Disease Control, Environmental Protection Agency, Federal Emergency Management Agency, NOAA, and a ton of other state and federal government agencies use social vulnerability indicators for a variety of purposes…among them:

The CDC and various state departments of health use social vulnerability indicators to scree for communities that may be of particularly high risk of disruption from disease outbreaks or other adverse public health events.
The EPA uses social vulnerability indicators to construct their Environmental Justice Screener. Their Environmental Justice Screener is meant to identify and protect vulnerable communities from environmental hazards like water quality degredation.
NOAA uses social vulnerability to identify communities that might be particularly adversely affected by commercial fisheries management policies.

Social Vulnerability Indexes are generally constructed using data related to the following themes and metrics:

svi chart

Problem

problem 1

The first problem or challenge that one normally encounters when trying to replicate or construct a Social Vulnerability Index is that the shear amount of data required is onerous in itself. There are 16 key metrics that feed into most Social Vulnerability Indicies. However, many of these metrics are aggregations of 10-20 primary data series.

For example, the indicator ‘No High School Diploma’ is generally constructed from The Census Bureau’s American Communities Survey table B15001. Most applications that I have encountered actually use for this metric, “Percent of the population 25 and over with some high school education but no diploma.” In order to contruct the metric “percent of population 25 and over with some high school education but no diploma”, one needs the individual data series:

B15001_013E, males age 25-34 with some high school but no diploma
B15001_021E, males age 35-44 with some high school but no diploma
etc, etc,
B15001_078E, females age 65 and up with some high school but no diploma

a total of 8 individual data series.

problem 2

The 2nd problem one generally encounters in the construction of a Social Vulnerability Index is that getting the necessary data series from the Census Bureau’s Data Web Mart, The American Fact Finder can be quite clunky. To get the data series mentioned above to calculate the metric, “Percent of the population 25 and over with some high school but no diploma” I follow these steps:

select “Guided Search” from the landing page
select “information about people”, then “educational attainment” from the Topics page
select geography type: Census Tract, state: CA, county: all counties from the Geography page
select “No, I’m not looking for race or ethic data” from the Race/Ethnic Groups page
select table B15001 from the results page

What I get is a table with the educational attainment of 5 age groups broken down by gender. If I also select “Modify Table” –> “transpose” then I get a sensible looking table with Census Tracts along the rows and educational attainment estimates in the columns. Then I can export the results to a .csv file.

All-in it took me somewhere in the neighborhood of 5-8 minutes to get these data using the American Fact Finder…not terrible for one or two data series, but really annoying if you want like 20 or 30 data series.

problem 3

The biggest problem I have with getting all 50-odd primary data series necessary to contruct a Social Vulnerability Index using the American Fact Finder is that there are so many steps to follow that it’s REALLY hard to completely document all the steps in a way that makes your particular index reproducable.

Solution

I’ve written a library of R functions to automate all the data pulls necessary to construct the 16 social vulnerability indicators above. I’ve also written wrapper functions which sequentially call all of the data functions for any year, state, county, census tract, and/or census designated place.

This wasn’t particularly intellectually taxing but I think it represents a pretty solid contribution to social science research as it:

makes the compilation of community social vulnerability indicies completely (and easily) reproducable.
allows these indicators to be easily constructed for any geography (state, county, census tract, etc) and any year.

To illustrate the extent to which I’ve simplified the acquisition of social vulnerability indicators data from the Census Bureau, here is the minimum required to get a complete social vulnerability indicators data set for all Census Tracts in California using the American Communities Survey 5-year estimates for 2015.

Assuming you’ve pulled the file SVI_censustract_functions.R from my GitHub repo and assuming you have a key for Census Bureau’s API:

source('R/SVI_censustract_functions.R')
key <- ''

#call the svi census tract function to build the data
svi.df <- svi_data_censustract.fn(yr=2015,key=key)

#save the results locally so you don't have to recompile every time
saveRDS(svi.df,file='svi_censustract_2015.RDA')

I’m not gonna lie to you, the initial data compilation isn’t quick. But if you’re willing to wait 20 minutes you get a fully formed SVI data set at the Census Tract level:

head(svi.df)
                                          name total_households state county  tract total_pop
Census Tract 4001, Alameda County, California             1286    06    001 400100      2952
Census Tract 4002, Alameda County, California              832    06    001 400200      1984
Census Tract 4003, Alameda County, California             2489    06    001 400300      5377
Census Tract 4004, Alameda County, California             1801    06    001 400400      4105
Census Tract 4005, Alameda County, California             1624    06    001 400500      3651
Census Tract 4006, Alameda County, California              701    06    001 400600      1665
     pov.pct unemp.rate pop25 edu25 pct_no_diploma    pci no_health_insurance pop_under_65
0.03489160 0.05656425  2420    40     0.01652893 104852                  95         2273
0.05796371 0.02575897  1611    24     0.01489758  76930                 104         1543
0.08097928 0.09389810  4338   216     0.04979253  62892                 213         4635
0.07440839 0.03503309  3346   204     0.06096832  53779                 144         3585
0.12654746 0.06431005  2913    93     0.03192585  38442                 439         3297
0.06602059 0.14502370  1276   100     0.07836991  37911                 213         1459
  pct_no_healthins pop_under_18 pop_over_65 disable_pct single_mom_pct pct.nonwhite pct_limited_eng
     0.04179498          440         579  0.05426028     0.03810264    0.2801491      0.03810264
     0.06740117          318         356  0.02234994     0.02524038    0.1980847      0.00000000
     0.04595469          899         582  0.04734485     0.06588992    0.3384787      0.08879068
     0.04016736          654         428  0.09493671     0.05330372    0.2874543      0.02332038
     0.13315135          443         269  0.06891408     0.09913793    0.3815393      0.02278325
     0.14599040          204         171  0.06534653     0.13266762    0.6180180      0.01283880
  housing_gt_10_pct mobile_pct pct_crowded pct_no_vehicle   pct_group
      0.00000000          0 0.006550218     0.01310044 0.000000000
      0.09479769          0 0.000000000     0.05664740 0.044354839
      0.25242718          0 0.000000000     0.14058252 0.006323229
      0.10903260          0 0.008017103     0.06841261 0.014372716
      0.18063754          0 0.023612751     0.13990555 0.000000000
      0.02432432          0 0.028378378     0.21216216 0.005405405

technical detail

As a prerequisite you will want to check out my previous post on automating Census Bureau data acquisition with R. This post contains the basics of setting up an API key and a little jump-start into pulling Census Bureau data with R.

Setting up the data function library was mostly grunt-work, not super glamorous. Basically, I just identified all the individual data series that one would need in order to compile the 16 aforementioned indicators and created a function to pull each one for a given year and state. An example of one such function to get portion of the population with income below the poverty line in last 12 months is,

#first create a function to generate the appropriate API call
library(RJSONIO)
library(rgdal)
library(ggplot2)
library(ggmap)
library(scales)
library(maptools)
library(rgeos)
library(dplyr)
library(data.table)


api_call.fn <- function(year,state,series,key){
  
  
if(year %in% c(2009,2010)){
  call <- paste('https://api.census.gov/data/',year,'/acs5?key=',key,'&get=',series,',NAME&for=tract:*&in=state:',state,
                 sep="")
}else if(year %in% c(2011,2012,2013,2014)){
  call <- paste('https://api.census.gov/data/',year,'/acs5?get=NAME,',series,'&for=tract:*&in=state:',state,'&key=',
                 key,sep="")
}else{
  call <- paste('https://api.census.gov/data/',year,'/acs/acs5?get=NAME,',series,'&for=tract:*&in=state:',state,'&key=',
                 key,sep="")
  
}

return(call)  
}



# A function to get percent of households below the poverty line from the ACS 5 year
# estimates for a particular year

#------------------------------------------------------------------------------
# population for which poverty status is determined B17001_001E
# population with income below the poverty line in the last 12 months B17001_002E

below_poverty.fn <- function(state,year,key){

series1 <- 'B17001_001E'
series2 <- 'B17001_002E'

call1 <- api_call.fn(year=year,state=state,key=key,series=series1)
call2 <- api_call.fn(year=year,state=state,key=key,series=series2)


# population for whom poverty status is determined
pov.pop <- fromJSON(call1)
pov.pop <- data.frame(rbindlist(lapply(pov.pop,function(x){
  x<-unlist(x)
  return(data.frame(name=x[1],pov.pop=x[2],state=x[3],county=x[4],tract=x[5]))
})
)) %>% filter(row_number() > 1)


# income in the past 12 months below poverty level
pov <- fromJSON(call2)
pov <- data.frame(rbindlist(lapply(pov,function(x){
  x<-unlist(x)
  return(data.frame(name=x[1],pov=x[2],state=x[3],county=x[4],tract=x[5]))
})
)) %>% filter(row_number() >1)

pov <- tbl_df(pov) %>% inner_join(pov.pop,by=c('name','state','county','tract')) %>%
  mutate(pov.pct = as.numeric(as.character(pov))/as.numeric(as.character(pov.pop)))

return(pov)
}

A slightly more intense example involves the metric we talked about previously, “population age 25 and over with some high school education but no diploma”:

#--------------------------------------------------------------------------------
# education - population 25 and over with less than a high school degree

edu.fn<-function(year,state,key){

  series.df <- data.frame(series=c(
    'B15001_013E','B15001_021E',
    'B15001_029E','B15001_037E',
    'B15001_054E','B15001_062E',
    'B15001_070E','B15001_078E'),
    labels=c('m_25_34_nodiploma','m_35_44_nodiploma',
             'm_45_64_nodiploma','m_65_plus_nodiploma',
             'f_25_34_nodiploma','f_35_44_nodiploma',
             'f_45_64_nodiploma','f_65_plus_nodiploma'))

    counts.df <- data.frame(series=c('B15001_011E','B15001_019E','B15001_027E','B15001_035E',
                                   'B15001_052E','B15001_060E','B15001_068E','B15001_076E'),
                          labels=c('m_25_34','m_35_44','m_45_64','m_65_plus',
                                   'f_25_34','f_35_44','f_45_64','f_65_plus'))
tmp.df <- list()
for(i in 1:nrow(series.df)){
  call1 <- api_call.fn(year=year,state=state,key=key,series=series.df$series[i])
  call2 <- api_call.fn(year=year,state=state,key=key,series=counts.df$series[i])

  total.age <- fromJSON(call2)
  total.age <- data.frame(rbindlist(lapply(total.age,function(x){
    x<-unlist(x)
    return(data.frame(name=x[1],pop.age=x[2],state=x[3],county=x[4],tract=x[5]))
  })
  )) %>% filter(row_number() > 1)

  edu.age <- fromJSON(call1)  
  edu.age <- data.frame(rbindlist(lapply(edu.age,function(x){
    x<-unlist(x)
    return(data.frame(name=x[1],edu.age=x[2],state=x[3],county=x[4],tract=x[5]))
  })
  )) %>% filter(row_number() > 1)

  z <- edu.age %>% inner_join(total.age,by=c('name','state','county','tract')) %>%
         mutate(age.range=as.character(counts.df$labels[i]))
tmp.df[[i]] <- z        
}  

edu<- data.frame(rbindlist(tmp.df)) %>%
      group_by(name,state,county,tract) %>%
      summarise(no_diploma=sum(as.numeric(as.character(edu.age)),na.rm=T)
                ,pop=sum(as.numeric(as.character(pop.age)),na.rm=T)) %>%
     mutate(pct_no_diploma=no_diploma/pop)

return(edu)
}
#--------------------------------------------------------------------------------------------

Cool Extension

In this post I demo’d some mapping with Census Bureau shapefiles. In my new social vulnerability tools GitHub repository I’ve included all the census tract boundary and county boundary shapefiles for California. If you grab these you should be able to reproduce the following visualizations of Census Bureau social vulnerability data rather easily:

rm(list=ls())

library(rgdal)
library(ggplot2)
library(ggmap)
library(scales)
library(maptools)
library(rgeos)
library(dplyr)

############################################################################################################
############################################################################################################
############################################################################################################
############################################################################################################
############################################################################################################
############################################################################################################
############################################################################################################
############################################################################################################



############################################################################################################
############################################################################################################
############################################################################################################
############################################################################################################
############################################################################################################
############################################################################################################
############################################################################################################
############################################################################################################
# A regional mapping exercise

svi.df <- readRDS('data/svi_censustract_2015.RDA')

tract <- readOGR(dsn = "shapefiles/gz_2010_06_140_00_500k", 
                 layer = "gz_2010_06_140_00_500k")
tract <- fortify(tract, region="GEO_ID")

# the GEO ID for a census tract is state + county + tract...
# state must have the trailing 0
# county must be 3 digit
ca.poverty <- tbl_df(svi.df) %>%
              select(name,state,county,tract,pov.pct) %>%
              filter(state=='06') %>%
              mutate(id=paste('1400000US',state,county,tract,sep=""))


plotData <- left_join(tract,ca.poverty)

county <- readOGR(dsn = "shapefiles/gz_2010_06_060_00_500k", 
                  layer = "gz_2010_06_060_00_500k")
county <- fortify(county, region="COUNTY")

#now I'm going to use the scales() package to bin the data a little
# so the contrasts in coverage rates are a little more visible
ggplot() +
  geom_polygon(data = plotData, aes(x = long, y = lat, group = group,
                                    fill = pov.pct)) +
  geom_polygon(data = county, aes(x = long, y = lat, group = group),
               fill = NA, color = "black", size = 0.25) +
  coord_map() +
  scale_fill_distiller(palette = "Greens", labels = percent,
                       breaks = pretty_breaks(n = 10)) +
  guides(fill = guide_legend(reverse = TRUE)) + theme_bw()

poverty

And here is one where I zoomed in on California’s North Coast area and displayed percent of population without health insurance.

ca.ins <- tbl_df(svi.df) %>%
  select(name,state,county,tract,pct_no_healthins) %>%
  filter(state=='06') %>%
  mutate(id=paste('1400000US',state,county,tract,sep="")) %>%
  mutate(uninsured=pct_no_healthins)


plotData <- left_join(tract,ca.ins)


#now try just a zoom on some North Coast areas
# a few reference points 
point.df <- data.frame(name=c('Crescent City','Fort Bragg','Eureka'),
                       lat=c(41.7558,39.4457,40.8021),
                       long=c(-124.2026,-123.8053,-124.1637))

ggplot() +
  geom_polygon(data = subset(plotData,lat>=38 & long<=-121), aes(x = long, y = lat, group = group,
                                                                 fill = uninsured)) +
  geom_polygon(data = subset(county,lat>=38 & long<=-121), aes(x = long, y = lat, group = group),
               fill = NA, color = "black", size = 0.25) +
  geom_point(data=point.df,aes(x=long,y=lat),shape=18,size=3) + 
  geom_text(data=point.df,
             aes(x=long,y=lat,label=name),hjust=-0.1) + 
  coord_map() +
  scale_fill_distiller(palette = "Spectral", labels = percent,
                       breaks = pretty_breaks(n = 10)) +
  guides(fill = guide_legend(reverse = TRUE)) + theme_bw()

health insurance

Written on March 23, 2018