First thing I’m gonna do here is plug a GitHub repo I’ve been working on called R-HD-Causal-Inference. It’s got a lot of good content of Josh Angrist flavored causal inference modeling in R.

This replication is based on Card and Krueger’s 1994 Minimum Wage Study. The full citation if you are interested is:

Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania David Card; Alan B. Krueger The American Economic Review, Vol. 84, No. 4. (Sep., 1994), pp. 772-793.

It’s a really interesting natural experiment paper examining the impact of a change in the minumum wage on fast food employment in New Jersey and Pennsylvania. Here’s what happened: in 1992 New Jersey raised their minimum wage from $4.25 an hour to $5.05 an hour. Pennsylvania did not. David Card and Alan Krueger surveyed something like 410 fast food restaurants in New Jersey and Eastern Pennsylvania in Feb. 1992 (before the minimum wage hike took effect) and again in Nov./Dec. 1992 (after the wage hike took effect). ## Prereqs and Stuff

Here are the R packages I’m using. I know a lot of people like to load these in the code chunks that use them to make it clear to the reader where certain packages come into play. I prefer to give you the full list up front so you can install everything you need.

library(ggplot2)
library(tidyverse)
library(dplyr)
library(plm)
library(haven)
library(tidyverse)
library(kableExtra)

Causal Inference Primer

Soooper short blub on causal inference here. If you’re already down with the whole Causal Inference party, skip this section. For more cool stuff on the general problem of establishing causal links with observational data check out 2 books by Joshua Angrist and Jorn-Steffan Pischke

  1. Mastering Metrics
  2. Mostly Harmless Econometrics

Naive Hypothesis (\(H^N\)): raising wages will cause employment to fall because the labor demand curve is inversely related to the price of labor.

Ok, so here’s the basic pickle with using observational data to try and estimate the impact of wages on employment:

  1. observed employment and wages are the result of equilibration of labor supply and labor demand. If labor supply is exogenously constrained by non-price factors, it is totally plausible to expect we would see wages rise in order to attract labor. In this case, the causal link will be the exact opposite of the \(H^N\).

We can potentially circumvent this problem by looking at discrete actions such as a change in the minumum wage. Changing the minimum wage is an exogenous shock that changes the price of labor. If we can be reasonably confident that this shock was not perfectly anticipated, we can be pretty sure that whatever employment effects we observe are influenced by the price change…and not the other way around.

So looking at data from a state like NJ that changed their minimum wage in 1992 could be a neat way to get around the problems of #1 and generate some insights about \(H^N\). But this leads to another problem:

  1. The problem with looking at data JUST in NJ before and after a minimum wage change is that we don’t really know what would have happened in NJ in the absence of the policy change. We don’t have a credible counterfactual.

So this is where Pennsylvania comes into play. In a medical study one would measure the efficacy of a pharmaceutical by taking a pool of affected people, giving some affected people the drug, and giving other affected people a placebo. The assumption here is that if the two groups (drug v. placebo) are randomly selected then the outcomes of the placebo group are what we might reasonably expect to observe in the “drug” group if we had not given them the drug. Since we either give each person the drug or we don’t, we can’t observe the same person in the post-treatment period in two different states (drug AND placebo).

Card and Krueger include fast food restaurants in PA in thier study in order to create the credible counterfactual. The assumption is that regional characteristics will be similar enough between NJ and neighboring Eastern PA that these fast food restaurants will be similar in all important respects to those in NJ. And since PA did not change their minimum wage in 1992, employment in fast food restaurants in Eastern PA is what we would have expected to observe in fast food restaurants in, NJ in the absence of a change in the minimum wage.

In short, the Natural Experiment here is that we have fast food restaurants located very close to one another (close enough to believe they are relatively homogeneous units) and some of these (the ones in NJ) experienced an abrupt change in the price of labor while others (the ones in Eastern PA) did not.

Card and Krueger Data

The data from this paper are available on David Card’s UC Berkeley faculty page. Here I download these data, unzip them, and save them to a .csv file inside my project.

tempfile_path <- tempfile()
download.file("http://davidcard.berkeley.edu/data_sets/njmin.zip", destfile = tempfile_path)
tempdir_path <- tempdir()
unzip(tempfile_path, exdir = tempdir_path)
codebook <- read_lines(file = paste0(tempdir_path, "/codebook"))

variable_names <- codebook %>%
  `[`(8:59) %>%
  `[`(-c(5, 6, 13, 14, 32, 33)) %>%
  str_sub(1, 13) %>%
  str_squish() %>%
  str_to_lower()

dataset <- read_table2(paste0(tempdir_path, "/public.dat"),
                       col_names = FALSE)

dataset <- dataset %>%
  select(-X47) %>%
  `colnames<-`(., variable_names) %>%
  mutate_all(as.numeric) %>%
  mutate(sheet = as.character(sheet))


write.csv(dataset,file="fast-food-data.csv")

Card and Krueger Table 3

The first bite of red-meat from the Card and Krueger paper is in Table 3. I have re-printed the first 5 columns of these results as they appear in the paper below. I don’t really want to fudge with trying to format an html table with a standard error in () below a mean so I’m making separate columns for the mean values and standard errors. Forgive me.

table3 <- data.frame(Variable=c("FTE employment before, all observations","FTE employment after, all observations",
                      "Change in mean FTE employment","Change in mean FTE employment, balanced sample of stores",
                      "Change in mean FTE employment setting FTE at 0 for closed stores"),
           PA_Mean=c(23.33,21.17,-2.16,-2.28,-2.28),
           PA_SE=c(1.35,0.94,1.25,1.25,1.25),
           NJ_Mean=c(20.44,21.03,0.59,0.47,0.23),
           NJ_SE=c(0.51,0.52,0.54,0.48,0.49),
           NJ_PA_mean=c(-2.89,-0.14,2.76,2.75,2.51),
           NJ_PA_SE=c(1.44,1.07,1.36,1.34,1.35))
           
knitr::kable(table3) %>% kable_classic(full_width=F)
Variable PA_Mean PA_SE NJ_Mean NJ_SE NJ_PA_mean NJ_PA_SE
FTE employment before, all observations 23.33 1.35 20.44 0.51 -2.89 1.44
FTE employment after, all observations 21.17 0.94 21.03 0.52 -0.14 1.07
Change in mean FTE employment -2.16 1.25 0.59 0.54 2.76 1.36
Change in mean FTE employment, balanced sample of stores -2.28 1.25 0.47 0.48 2.75 1.34
Change in mean FTE employment setting FTE at 0 for closed stores -2.28 1.25 0.23 0.49 2.51 1.35

Now, I’ll try to reproduce these numbers from the data set:

df <- read.csv("fast-food-data.csv")
 

# Add FTE = full time + managers + (0.5*part time)
df <- df %>% group_by(state) %>% mutate(fte=empft+nmgrs+(0.5*emppt),fte_after=empft2+nmgrs2+(0.5*emppt2)) 
# NJ/PA Comparison

# State-by-State means and standard errors
summary <- df %>% group_by(state) %>% 
            summarise(mean_before=mean(fte,na.rm=T),
                      mean_after=mean(fte_after,na.rm=T),
                      var_before=var(fte,na.rm=T),
                      var_after=var(fte_after,na.rm=T),
                      count_before=sum(!is.na(fte)),
                      count_after=sum(!is.na(fte_after))) %>%
           ungroup() %>%
           mutate(se_before=sqrt(var_before/count_before),
                  se_after=sqrt(var_after/count_after)) %>%
          mutate(state=ifelse(state==0,"PA","NJ"))
## `summarise()` ungrouping output (override with `.groups` argument)
knitr::kable(summary,digits=c(2,2,2,2,0,0,2,2)) %>% kable_classic(full_width=F)
state mean_before mean_after var_before var_after count_before count_after se_before se_after
PA 23.33 21.17 140.57 69 77 77 1.35 0.94
NJ 20.44 21.03 82.92 86 321 319 0.51 0.52



This is a pretty good start. It looks like the means and standard deviations are pretty much the same.

# the difference in means
PA_diff <- (summary$mean_after[summary$state=='PA']-summary$mean_before[summary$state=='PA'])
PA_diff_se <- sqrt(summary$se_after[summary$state=='PA']+summary$se_before[summary$state=='PA'])
NJ_diff <- (summary$mean_after[summary$state=='NJ']-summary$mean_before[summary$state=='NJ'])
NJ_diff_se <- sqrt(summary$se_after[summary$state=='NJ']+summary$se_before[summary$state=='NJ'])

NJ_PA_before <- (summary$mean_before[summary$state=='NJ']-summary$mean_before[summary$state=='PA'])
NJ_PA_before_se <- sqrt(summary$se_before[summary$state=='NJ']+summary$se_before[summary$state=='PA'])

NJ_PA_after <- (summary$mean_after[summary$state=='NJ']-summary$mean_after[summary$state=='PA'])
NJ_PA_after_se <- sqrt(summary$se_after[summary$state=='NJ']+summary$se_after[summary$state=='PA'])


# the difference in differences  
did_mean <- (NJ_diff-PA_diff)

did_se <- sqrt(NJ_diff_se+PA_diff_se)


tmp <- data.frame(variable=c("fte employment before","se employment before","fte employment after"," se employment after", "change in mean fte","se of change in mean"),NJ_PA=c(NJ_PA_before,NJ_PA_before_se,NJ_PA_after,NJ_PA_after_se,did_mean,did_se))

knitr::kable(tmp,digits=2) %>% kable_classic(full_width=F)
variable NJ_PA
fte employment before -2.89
se employment before 1.36
fte employment after -0.14
se employment after 1.21
change in mean fte 2.75
se of change in mean 1.59



As you can see, my results are generally consistent with the numbers in the Card and Krueger paper. There seems to be a little issue with some of the standard errors that I can’t quite figure out.

Card and Krueger Table 4

In Part B of Section III Card and Krueger discuss some “Regression Adjusted” models which include the following specifications:

\(\Delta E_i = a + bX_i + cNJ_i + \epsilon_i\), and

\(\Delta E_i = a' + b'X_i + c'GAP_i + \epsilon_i'\)

Here:

To be more explicit,

So the first thing I need to do here is create the \(GAP_i\) variable:

est.df <- df %>% ungroup() %>% filter(is.na(fte)==F & is.na(fte_after)==F & is.na(wage_st)==F & is.na(wage_st2)==F) %>%
             mutate(delta_emp=fte_after-fte,
                    gap=ifelse(state==1 & wage_st<= 5.05,((5.05-wage_st)/wage_st),0)) %>%
               mutate(chain1=ifelse(chain==1,1,0),
                      chain2=ifelse(chain==2,1,0),
                      chain3=ifelse(chain==3,1,0),
                      chain4=ifelse(chain==4,1,0))

The Card and Krueger estimates are reprinted below:

table4 <- data.frame(Independent_Var=c("New Jersey Dummy","Initial Wage Gap","Controls for Chain and Ownership","Controls for Region","Standard Error of Regression","Probability Value for Controls","Number of Stores in Sample"),
                     Model1_Coeff=c(2.33,"-","no","no",8.79,"-",357),
                     Model1_SE=c(1.19,"-","no","no",8.79,"-",357),
                     Model2_Coeff=c(2.3,"-","yes","no",8.78,"0.34",357),
                     Model2_SE=c(1.2,"-","yes","no",8.78,0.34,357),
                     Model3_Coeff=c("-",15.65,"no","no",8.76,"-",357),
                     Model3_SE=c("-",6.08,"no","no",8.76,"-",357),
                     Model4_Coeff=c("-",14.92,"yes","no",8.76,0.44,357),
                     Model4_SE=c("-",6.21,"yes","no",8.76,0.44,357))
knitr::kable(table4) %>% kable_classic(full_width=F)
Independent_Var Model1_Coeff Model1_SE Model2_Coeff Model2_SE Model3_Coeff Model3_SE Model4_Coeff Model4_SE
New Jersey Dummy 2.33 1.19 2.3 1.2
Initial Wage Gap
15.65 6.08 14.92 6.21
Controls for Chain and Ownership no no yes yes no no yes yes
Controls for Region no no no no no no no no
Standard Error of Regression 8.79 8.79 8.78 8.78 8.76 8.76 8.76 8.76
Probability Value for Controls
0.34 0.34
0.44 0.44
Number of Stores in Sample 357 357 357 357 357 357 357 357

Now I’m going to run these same models using the data I have loaded into the workspace.

model1 <- lm(delta_emp~state, data=est.df)
model2 <- lm(delta_emp~state+co_owned+chain2+chain3+chain4,data=est.df)
model3 <- lm(delta_emp~gap,data=est.df)
model4 <- lm(delta_emp~gap+co_owned+chain2+chain3+chain4,data=est.df)
model5 <- lm(delta_emp~gap+co_owned+chain2+chain3+chain4+centralj+northj+pa1,data=est.df)

Next, I collect the estimates from these linear models and try to organize them in a table that looks similar to the one above. This is pretty cumbersome and not very code efficient…but I want to be really transparent about what I’m doing so you guys can see how I extract all these terms and where they come from.

# coefficients and standard errors
mod1_coeffs <- summary(model1)$coefficients
mod2_coeffs <- summary(model2)$coeff
mod3_coeffs <- summary(model3)$coeff
mod4_coeffs <- summary(model4)$coeff

mod1_sigma <- summary(model1)$sigma
mod2_sigma <- summary(model2)$sigma
mod3_sigma <- summary(model3)$sigma
mod4_sigma <- summary(model4)$sigma

table4_rep <- data.frame(Independent_Var=c("New Jersey Dummy","Initial Wage Gap","Controls for Chain and Ownership","Controls for Region","Standard Error of Regression","Probability Value for Controls","Number of Stores in Sample"),
                     Model1_Coeff=c(round(mod1_coeffs[,1][2],2),"-","no","no",round(mod1_sigma,2),"-",nrow(est.df)),
                     Model1_SE=c(round(mod1_coeffs[,2][2],2),"-","no","no",round(mod1_sigma,2),"-",nrow(est.df)),
                     Model2_Coeff=c(round(mod2_coeffs[,1][2],2),"-","yes","no",round(mod2_sigma,2),NA,nrow(est.df)),
                     Model2_SE=c(round(mod2_coeffs[,2][2],2),"-","yes","no",round(mod2_sigma,2),NA,nrow(est.df)),
                     Model3_Coeff=c("-",round(mod3_coeffs[,1][2],2),"no","no",round(mod3_sigma,2),NA,nrow(est.df)),
                     Model3_SE=c("-",round(mod3_coeffs[,2][2],2),"no","no",round(mod3_sigma,2),NA,nrow(est.df)),
                     Model4_Coeff=c("-",round(mod4_coeffs[,1][2],2),"yes","no",round(mod4_sigma,2),NA,nrow(est.df)),
                     Model4_SE=c("-",round(mod4_coeffs[,2][2],2),"yes","no",round(mod4_sigma,2),NA,nrow(est.df)))


knitr::kable(table4_rep) %>% kable_classic(full_width=F)
Independent_Var Model1_Coeff Model1_SE Model2_Coeff Model2_SE Model3_Coeff Model3_SE Model4_Coeff Model4_SE
New Jersey Dummy 2.28 1.19 2.28 1.2
Initial Wage Gap
17.05 6.09 16.36 6.24
Controls for Chain and Ownership no no yes yes no no yes yes
Controls for Region no no no no no no no no
Standard Error of Regression 8.71 8.71 8.72 8.72 8.66 8.66 8.68 8.68
Probability Value for Controls
NA NA NA NA NA NA
Number of Stores in Sample 351 351 351 351 351 351 351 351

It’s not hard to see that my regression results differ somewhat from those reported in the Card and Krueger Table 4. How qualitatively different are they?

From Card and Krueger (p. 781)

The specifications in columns (iiil-(v) use the GAP variable to measure the effect of the minimum wage. This variable gives a slightly better fit than the simple New Jersey dummy, although its implications for the New Jersey-Pennsylvania comparison are similar. The mean value of \(GAP_i\) among New Jersey stores is 0.11. Thus the estimate in column (iii) implies a 1.72 increase in FTE employment in New Jersey relative to Pennsylvania.

My data set has a slightly lower mean \(GAP_i\) value which results in an increase in FTE that seems reasonably similar to the 1.72 jobs reported by Card and Krueger:

# The GAP coefficient estimate from My Model 3 is 17.05

mean(est.df$gap[est.df$state==1])*17.05
## [1] 1.782764

Sensativity Analysis

My Regression results are qualitatively similar to Card and Krueger’s results reported in their Table 4…but not exactly the same. For starter, I’m not totally sure how they defined which stores had “complete” data and which did not. I suspect that if I downloaded the SAS file from David Card’s faculty webpage I could probably reverse engineer thier estimation data frames exactly. But I’m not going to do that.

What I will do is try something that might make a small difference:

  1. Card and Krueger calculate FTE in wave 1 (the before period) as the sum of full time employees (empft), number of managers (nmngr), and 0.5 * number of part time employees (emppt). The do the same for FTE in wave 2 (the after period) but the relenvant variables in that calculation are empft2, nmngr2, and emppt2.

  2. There are some stores in the list of 410 for which all of these values are NA. In order to run the regressions in Table 4, these stores need to be dropped because \(\Delta E_i\) can only be calculated for stores that had valid employment observations both before and after.

  3. There are some stores for which at least 1 but not all the employment variables are NA. For example,

print.data.frame(df %>% filter(is.na(empft)==T & is.na(nmgrs)==F))
##     X sheet chain co_owned state southj centralj northj pa1 pa2 shore ncalls
## 1  53   446     1        0     0      0        0      0   0   1     0      2
## 2  87   231     1        0     1      0        0      1   0   0     0      1
## 3 118   139     4        0     1      0        1      0   0   0     0      0
## 4 175     8     4        0     1      0        1      0   0   0     0      0
## 5 273   198     1        0     1      0        0      1   0   0     0      0
## 6 382   384     1        0     1      0        0      1   0   0     0      3
##   empft emppt nmgrs wage_st inctime firstinc bonus pctaff meals open hrsopen
## 1    NA    NA     3    4.50       4     0.25     0     50     2  6.0    18.0
## 2    NA    10     3      NA      26       NA     0     NA     2  7.0    16.0
## 3    NA    NA     2    5.50      NA       NA     0     NA     2 10.5    12.5
## 4    NA    25     4    4.80      NA     0.15     0     NA     1 10.5    12.0
## 5    NA    NA     4    4.25       4     0.12     1     65     3  7.0    16.0
## 6    NA    NA     3    4.65      26     0.25     0     NA     2  7.0    16.0
##   psoda pfry pentree nregs nregs11 type2 status2  date2 ncalls2 empft2 emppt2
## 1  1.03 0.84    0.94    NA      NA     1       1 110792      NA      7     24
## 2  1.02 1.02    0.95     2       2     1       1 111092      NA     16      6
## 3  0.95 0.95    1.06     2       1     1       1 111492      NA     11     25
## 4  1.02 0.91    2.28     2       1     1       1 110592      NA     15     25
## 5  1.06 0.95    0.95     3       3     1       1 111792       8     10     32
## 6  1.06 0.95    0.95     3       2     1       1 110792      NA      0     30
##   nmgrs2 wage_st2 inctime2 firstin2 special2 meals2 open2r hrsopen2 psoda2
## 1      4     5.00        4     0.25        1      2      6       18   0.97
## 2      4     5.05        4     0.25        0      1      7       16   0.90
## 3      5     5.50       19     0.25        1      2     11       12     NA
## 4      3     5.05       26     0.33        0      2     10       13   1.01
## 5      4     5.05       26     0.15        0      2      7       16   1.05
## 6      3     5.05       26     0.10        0      3      7       16   1.05
##   pfry2 pentree2 nregs2 nregs112 fte fte_after
## 1  0.84     0.91      4        3  NA      23.0
## 2  0.95     0.95      3        1  NA      23.0
## 3    NA       NA     NA       NA  NA      28.5
## 4  1.01     0.94      2        2  NA      30.5
## 5  0.94     0.94      3        3  NA      30.0
## 6  1.05     0.89      3        2  NA      18.0

In my replication in the previous section I dropped observations for which any of the employment variables were NA. This seemed reasonable to me. I interpreted NA as missing data, as in, “This store has full time employees but didn’t want to tell Card and Krueger how many.”

This may be part of the reason that I estimated regressions with 351 observations while Card and Krueger’s Table 4 shows results for regression run with 357 observations.

Here, I’ll redo my regressions with a new version of the data that drops observation only if

  1. All of empft, emppt, nmgrs OR empft2, emppt2, nmgrs2 equal 0
  2. Either wage_st or wage_st2 are missing
df <- read.csv("fast-food-data.csv")
 
# replace NA with 0
df <- df %>% replace(is.na(.),0)

# Add FTE = full time + managers + (0.5*part time)
df <- df %>% group_by(state) %>% mutate(fte=empft+nmgrs+(0.5*emppt),fte_after=empft2+nmgrs2+(0.5*emppt2)) 

#now remove any stores that don't have employment before and after and any stores that don't have wage before and after
df <- df %>% filter(fte>0 & fte_after>0 & wage_st>0 & wage_st2>0)

# compute the wage gap
est.df <- df %>% ungroup() %>% 
             mutate(delta_emp=fte_after-fte,
                    gap=ifelse(state==1 & wage_st<= 5.05,((5.05-wage_st)/wage_st),0)) %>%
               mutate(chain1=ifelse(chain==1,1,0),
                      chain2=ifelse(chain==2,1,0),
                      chain3=ifelse(chain==3,1,0),
                      chain4=ifelse(chain==4,1,0))

Run the regressions again:

model1 <- lm(delta_emp~state, data=est.df)
model2 <- lm(delta_emp~state+co_owned+chain2+chain3+chain4,data=est.df)
model3 <- lm(delta_emp~gap,data=est.df)
model4 <- lm(delta_emp~gap+co_owned+chain2+chain3+chain4,data=est.df)
model5 <- lm(delta_emp~gap+co_owned+chain2+chain3+chain4+centralj+northj+pa1,data=est.df)

Collect the terms and display in Table:

# coefficients and standard errors
mod1_coeffs <- summary(model1)$coefficients
mod2_coeffs <- summary(model2)$coeff
mod3_coeffs <- summary(model3)$coeff
mod4_coeffs <- summary(model4)$coeff

mod1_sigma <- summary(model1)$sigma
mod2_sigma <- summary(model2)$sigma
mod3_sigma <- summary(model3)$sigma
mod4_sigma <- summary(model4)$sigma

table4_rep <- data.frame(Independent_Var=c("New Jersey Dummy","Initial Wage Gap","Controls for Chain and Ownership","Controls for Region","Standard Error of Regression","Probability Value for Controls","Number of Stores in Sample"),
                     Model1_Coeff=c(round(mod1_coeffs[,1][2],2),"-","no","no",round(mod1_sigma,2),"-",nrow(est.df)),
                     Model1_SE=c(round(mod1_coeffs[,2][2],2),"-","no","no",round(mod1_sigma,2),"-",nrow(est.df)),
                     Model2_Coeff=c(round(mod2_coeffs[,1][2],2),"-","yes","no",round(mod2_sigma,2),NA,nrow(est.df)),
                     Model2_SE=c(round(mod2_coeffs[,2][2],2),"-","yes","no",round(mod2_sigma,2),NA,nrow(est.df)),
                     Model3_Coeff=c("-",round(mod3_coeffs[,1][2],2),"no","no",round(mod3_sigma,2),NA,nrow(est.df)),
                     Model3_SE=c("-",round(mod3_coeffs[,2][2],2),"no","no",round(mod3_sigma,2),NA,nrow(est.df)),
                     Model4_Coeff=c("-",round(mod4_coeffs[,1][2],2),"yes","no",round(mod4_sigma,2),NA,nrow(est.df)),
                     Model4_SE=c("-",round(mod4_coeffs[,2][2],2),"yes","no",round(mod4_sigma,2),NA,nrow(est.df)))


knitr::kable(table4_rep) %>% kable_classic(full_width=F)
Independent_Var Model1_Coeff Model1_SE Model2_Coeff Model2_SE Model3_Coeff Model3_SE Model4_Coeff Model4_SE
New Jersey Dummy 2.16 1.22 2.18 1.22
Initial Wage Gap
18.91 6.17 17.9 6.32
Controls for Chain and Ownership no no yes yes no no yes yes
Controls for Region no no no no no no no no
Standard Error of Regression 9.09 9.09 9.08 9.08 9.01 9.01 9.02 9.02
Probability Value for Controls
NA NA NA NA NA NA
Number of Stores in Sample 370 370 370 370 370 370 370 370

So it appears that this little “correction” of excluding fewer stores based on missing data moved my coefficient estimates further away, rather than closer to, Card and Krueger’s Table 4 results. This, as you can see in the calculation below, leads to an estimate of an addition of ~2 FTE jobs as a result of the minimum wage hike. This is slightly higher than Card and Krueger’s estimate of an additional 1.72 FTE jobs.

mean(est.df$gap[est.df$state==1])*mod3_coeffs[,1][2]
##     gap 
## 1.96764

Some Commentary

Card and Krueger (1994) is a really cool paper and you should read it. There are many things to like about it, here is something always impresses me about this paper:

A lot of empirical social scientist (myself totally included) have completely resigned themselves to the universal truth that observational data is imperfect. It’s almost always collected for some purpose other than the purpose you/we want to use it for…which means that, whatever effect we wish to test, we can almost always be sure that the data were not generated from a careful experimental design set up to maximize the probability of detecting that effect conditional on the effect existing.

As an empirical researcher I spend a crazy amount of time trying to figure out how to do cool stuff with messy, crappy data. Usually this takes the form of technical/econometric/machine learning firepower.

Lots of unobserved heterogeneity, bad sampling strategy, missing data?

No problem. I’m sure somebody has developed a clustered-robust standard error for that which probably involves simultaeously inverting and uninverting a covariance matrix while spinning some plates on your head. That should fix everything.

Reading David Card, Alan Krueger, and Josh Angrist has a way of reminding me that sometimes the answer to crappy data is to think a little harder about the problem and get some better data.

This particular paper does an awful lot with just some simple t-tests of means and a few OLS regressions. The reason it works is because of the elegance of the natural experiment.