This is going to be a quick look at two kind of cool data sets that seem really interesting and useful for evaluating the economic impacts of Covid. I haven’t quite figured out exactly how to best use them in this capacity yet…but I wanted to give you all a quick look at them. The data I’m talking about are:
California Taxable Sales from the California Department of Tax and Fee Administration
Paycheck Protection Program data from the U.S. Small Business Administration
I plan to keep this post pretty simple. Mainly, I just want to explore these data sets a little. I’m going to filter out a lot of the complexity and just focus in on what the two data sets have to say about the Retail and Food/Beverage Service Sector of the Santa Cruz County (CA) local economy. Here’s a quick outline for the rest of this post:
Here are the R libraries that this module depends on:
library(dplyr)
library(ggplot2)
library(DT)
library(ggthemes)
library(zoo)
library(scales)
library(lubridate)
library(formattable)
library(kableExtra)
There are also some data dependencies but I will discuss those in the next section.
I have two primary sources of data for this Vignette:
These data are pretty interesting and folks interested in using them for things more serious than recreational data mining should do their own due diligence. Here’s a quick summary:
The PPP program was created to help businesses retain and pay workers during the Covid-19 pandemic. It was a form of economic stimulus that was offered to businesses on the condition that at least 75% of the payment amount go directly to payroll. The PPP data available from the SBA contain around 50 different fields that provide information on the nature of these payments. Each row in each .csv is a payment. The columns contain information like name and address of the business, amount of PPP money received, and other helpful stuff (like NAICS Code for each recipient).
The California Taxable Sales Data that I am using are reported quarterly for each major city in California. The values of most interest are reported in two columns:
The PPP data are available from the SBA as a series of .csv files. I have downloaded these, saved them locally, and created some R functions to:
I’ll try to crank out a follow-up post to this one with more reproducible content on working with these PPP data.
# this part is a little cumbersome...in my PPP-EIDL Project I've created a linked set of functions to make this look nice...
# but it's hard to run these from within my GitHub Blog so we're going old-skool here:
public150plus <- read.csv("C:/Users/aaron.mamula/Desktop/R-Projects/PPP-EIDL-Analysis/source-data/02-01-21 Paycheck Protection Program Data/public_150k_plus.csv",
stringsAsFactors = F)
public_up_to_150k_1 <- read.csv("C:/Users/aaron.mamula/Desktop/R-Projects/PPP-EIDL-Analysis/source-data/02-01-21 Paycheck Protection Program Data/public_up_to_150k_1.csv",
stringsAsFactors = F)
public_up_to_150k_2 <- read.csv("C:/Users/aaron.mamula/Desktop/R-Projects/PPP-EIDL-Analysis/source-data/02-01-21 Paycheck Protection Program Data/public_up_to_150k_2.csv",
stringsAsFactors = F)
public_up_to_150k_3<- read.csv("C:/Users/aaron.mamula/Desktop/R-Projects/PPP-EIDL-Analysis/source-data/02-01-21 Paycheck Protection Program Data/public_up_to_150k_3.csv",
stringsAsFactors = F)
public_up_to_150k_4 <- read.csv("C:/Users/aaron.mamula/Desktop/R-Projects/PPP-EIDL-Analysis/source-data/02-01-21 Paycheck Protection Program Data/public_up_to_150k_4.csv",
stringsAsFactors = F)
public_up_to_150k_5 <- read.csv("C:/Users/aaron.mamula/Desktop/R-Projects/PPP-EIDL-Analysis/source-data/02-01-21 Paycheck Protection Program Data/public_up_to_150k_5.csv",
stringsAsFactors = F)
public_up_to_150k_6 <- read.csv("C:/Users/aaron.mamula/Desktop/R-Projects/PPP-EIDL-Analysis/source-data/02-01-21 Paycheck Protection Program Data/public_up_to_150k_6.csv",
stringsAsFactors = F)
ppp.df <- rbind(public150plus,
public_up_to_150k_1,
public_up_to_150k_2,
public_up_to_150k_3,
public_up_to_150k_4,
public_up_to_150k_5,
public_up_to_150k_6)
These data are quite large (mainly because they are very wide) so I’m going to filter them to include just Santa Cruz County for this illustration:
# the PPP data have a BorrowerCounty field and a ProjectCountyName field. These two values are generally the same but
# can be different. I'm going to argue that, for this application, the ProjectCountyName is the more important of the two
# because this should be a better indicator of where the money was distributed.
ppp.df <- ppp.df %>% filter(ProjectCountyName=="SANTA CRUZ" & BorrowerState=="CA")
The California Taxable Sales Data by City are accessible through an API maintained by the California Department of Tax and Fee Information. The GET url can be parameterized in a number of ways but for my current purposes I just want all available data:
It’s worth noting here that the object ca.taxable
is a list object. It is the JSON object that is returned from my API request, which contains some meta-data about the API request itself. I extract the values of interest from this list object by collecting the values
list element.
ca.taxable <- jsonlite::fromJSON("https://cdtfa.ca.gov/dataportal/api/odata/Taxable_Sales_by_City?%24count=true")
ca.taxable <- ca.taxable$value
I don’t want to munge around with either of these data sets too much here. They are pretty rich and one could probably do a lot of cool stuff with them. Here, I just want to give everyone a peek at the data with some emphasis on stuff that’s relevant for the illustration at hand:
Here’s a quick look at some of the more interesting fields in the PPP Data for Santa Cruz County, CA:
knitr::kable(ppp.df %>% ungroup() %>% filter(row_number()<=10) %>%
select(BorrowerName,BorrowerAddress,BorrowerCity,BorrowerZip,InitialApprovalAmount,JobsReported,FranchiseName,
ServicingLenderName, NAICSCode) %>% arrange(-JobsReported)) %>%
kable_styling(full_width=F)
BorrowerName | BorrowerAddress | BorrowerCity | BorrowerZip | InitialApprovalAmount | JobsReported | FranchiseName | ServicingLenderName | NAICSCode |
---|---|---|---|---|---|---|---|---|
THRESHOLD ENTERPRISES LTD | 23 Janis Way | Scotts Valley | 95066-3506 | 5439600 | 500 | PNC Bank, National Association | 325411 | |
AMERI KLEEN | 119 BEACH ST | WATSONVILLE | 95076-4557 | 3490200 | 500 | 1st Capital Bank | 561720 | |
ALTA VISTA FARMS LP | 136 MARSH LN | WATSONVILLE | 95076-2224 | 3680400 | 481 | 1st Capital Bank | 111334 | |
ENCOMPASS COMMUNITY SERVICES | 380 Encinal St. Suite 200 | SANTA CRUZ | 95060-2101 | 4786545 | 450 | Santa Cruz County Bank | 621112 | |
SALUD PARA LA GENTE | 195 Aviation Way, Suite 200 | WATSONVILLE | 95076-2053 | 5218785 | 366 | Santa Cruz County Bank | 624190 | |
BAY PHOTO, INC. | 920 DISC DR | SCOTTS VALLEY | 95066-4544 | 3599477 | 355 | Santa Cruz Community CU | 323111 | |
S. MARTINELLI & COMPANY | 735 Beach Street | WATSONVILLE | 95076 | 3894618 | 326 | Santa Cruz County Bank | 311411 | |
SANTA CRUZ SEASIDE COMPANY | 400 Beach Street | Santa Cruz | 95060-5416 | 6236226 | 243 | Santa Cruz County Bank | 713110 | |
UNIVERSAL AUDIO, INC | 4585 Scotts Valley Drive | SCOTTS VALLEY | 95066-4517 | 4088800 | 201 | Santa Cruz County Bank | 334310 | |
NEW TEACHER CENTER | 1205 Pacific Avenue, Suite 301 | SANTA CRUZ | 95060-3914 | 3071357 | 133 | Santa Cruz County Bank | 923130 |
Next, let’s have a quick look at the Taxable Sales Data.
ca.taxable$RetailandFoodServicesTaxableTransactions <- currency(ca.taxable$RetailandFoodServicesTaxableTransactions,digits=0L)
knitr::kable(ca.taxable %>% filter(County %in% c("Santa Cruz","SANTA CRUZ")) %>% filter(row_number() <= 10)) %>%
kable_styling(full_width=F)
CalendarYear | Quarter | QuarterMonthFrom | QuarterMonthTo | County | City | RetailandFoodServicesTaxableTransactions | TotalAllOutletsTaxableTransactions | DisclosureFlag | PerCapita |
---|---|---|---|---|---|---|---|---|---|
2009 | A | 1 | 12 | Santa Cruz | Capitola | $322,595,000 | 355427000 | NA | NA |
2009 | A | 1 | 12 | Santa Cruz | Santa Cruz | $589,761,000 | 718859000 | NA | NA |
2009 | A | 1 | 12 | Santa Cruz | Santa Cruz County | $1,956,754,000 | 2638469000 | NA | NA |
2009 | A | 1 | 12 | Santa Cruz | Scotts Valley | $117,995,000 | 147933000 | NA | NA |
2009 | A | 1 | 12 | Santa Cruz | Watsonville | $381,538,000 | 495137000 | NA | NA |
2009 | Q1 | 1 | 3 | Santa Cruz | Capitola | $69,397,000 | 76375000 | NA | NA |
2009 | Q1 | 1 | 3 | Santa Cruz | Santa Cruz | $128,250,000 | 156001000 | NA | NA |
2009 | Q1 | 1 | 3 | Santa Cruz | Santa Cruz County | $426,714,000 | 584519000 | NA | NA |
2009 | Q1 | 1 | 3 | Santa Cruz | Scotts Valley | $24,822,000 | 32275000 | NA | NA |
2009 | Q1 | 1 | 3 | Santa Cruz | Watsonville | $84,602,000 | 110666000 | NA | NA |
So there are two kind of weird things about the taxable sales data:
The data are mostly quarterly…but they have an annual summary row as well which is identified by Quarter = A
.
The geographic unit of resolution is a little funky. There are observations for each major city in Santa Cruz County but there are also observations where the City
field = “SANTA CRUZ COUNTY”.
These issues are pretty minor and easily dealt with.
For starters, I’m going to isolate a set of NAICS Codes with certian 4-digit prefixes.
I can’t say for sure if these are all the NAICS Codes that should match up with the receipts in the RetailandFoodServicesTaxableTransactions
column from the ca.taxable
data frame, but it seems like a good approximation at least.
# extract the first 4 digits for each NAICS Code
retail <- ppp.df %>% ungroup() %>% mutate(NAICS4dig=as.numeric(substring(as.character(NAICSCode),1,4))) %>%
filter(NAICS4dig %in% c(7223,7224,7225,4451,4452,4453,4471,4481,4482,4483,4511,4512,4522,4531,4539))
retail$InitialApprovalAmount <- currency(retail$InitialApprovalAmount,digits=0L)
knitr::kable(retail %>% select(BorrowerName,BorrowerCity,InitialApprovalAmount,JobsReported,ProjectCountyName) %>%
arrange(-InitialApprovalAmount) %>% filter(row_number() <= 20)) %>% kable_styling(full_width=F)
BorrowerName | BorrowerCity | InitialApprovalAmount | JobsReported | ProjectCountyName |
---|---|---|---|---|
RICHARD MICHAEL HARRISON | SOQUEL | $9,112,277 | 31 | SANTA CRUZ |
J.J.’S SALOON AND SOCIAL CLUB, LLC | SOQUEL | $2,165,312 | 7 | SANTA CRUZ |
SEA EAGLE LP. | Santa Cruz | $1,717,410 | 164 | SANTA CRUZ |
CAPITOLA GAYLE’S, INC. | CAPITOLA | $1,345,000 | 80 | SANTA CRUZ |
SEA EAGLE, LP | SANTA CRUZ | $1,160,382 | 202 | SANTA CRUZ |
DELUXE FOODS OF APTOS, INC. | APTOS | $903,187 | 79 | SANTA CRUZ |
CULINARY ENTERPRISES, INC. | CAPITOLA | $885,855 | 136 | SANTA CRUZ |
WHITING’S FOOD CONCESSIONS INC | Santa Cruz | $860,850 | 100 | SANTA CRUZ |
FLEW THE COOP INC | LA SELVA BEACH | $816,400 | 194 | SANTA CRUZ |
STAGNARO BROTHERS SEAFOOD INC | Santa Cruz | $738,467 | 103 | SANTA CRUZ |
WILD ROOTS MARKET, INC | BEN LOMOND | $731,370 | 110 | SANTA CRUZ |
MAS MAC INC | Watsonville | $710,540 | 210 | SANTA CRUZ |
SUN SHOPS | SANTA CRUZ | $645,470 | 45 | SANTA CRUZ |
LILLIAN’S ITALIAN KITCHEN INCORPORATED | Santa Cruz | $631,000 | 44 | SANTA CRUZ |
WEST END TAP & KITCHEN LP | Santa Cruz | $597,900 | 88 | SANTA CRUZ |
HULA’S SANTA CRUZ LLC | Santa Cruz | $574,651 | 53 | SANTA CRUZ |
SHOPPER’S CORNER INC | SANTA CRUZ | $549,434 | 81 | SANTA CRUZ |
WHITING’S FOOD CONCESSIONS, INC | SANTA CRUZ | $545,000 | 148 | SANTA CRUZ |
EL PALOMAR, INC | SANTA CRUZ | $526,565 | 82 | SANTA CRUZ |
J & J CORP. | Santa Cruz | $501,900 | 70 | SANTA CRUZ |
Next, I’m going create a really simple visual of the time-series of taxable sales in Santa Cruz County for just the Retail and Food Services Transactions:
# first I need to create a date from the quarterly observations...
# I'm also going to rename these really long columns
sc.taxable <- ca.taxable %>% filter(Quarter!="A") %>%
mutate(County=toupper(County),City=toupper(City)) %>%
filter(County=="SANTA CRUZ") %>%
mutate(date=as.Date(paste(CalendarYear,"-",QuarterMonthFrom,"-","1",sep=""),format="%Y-%m-%d")) %>%
mutate(Q=as.yearqtr(paste(CalendarYear,Quarter,sep="")),
RetailFood=RetailandFoodServicesTaxableTransactions) %>%
select(-RetailandFoodServicesTaxableTransactions)
ggplot(subset(sc.taxable,City=="SANTA CRUZ COUNTY"),aes(x=Q,y=RetailFood/1000000)) +
geom_point() + geom_line() + theme_bw() + ylab("$s (millions)") + xlab("") + scale_x_yearqtr(format = "%YQ%q") +
theme(axis.text.x=element_text(angle=45)) +
ggtitle("Santa Cruz County Retail & Food Service Transactions")
Even if all you know about Santa Cruz County, California is that it’s pretty close to the ocean, it’s probably not shocking to you to see pronounced seasonality in the quarterly Retail and Food Service Transaction Data series. I’ll admit that I was somewhat surprised to see a pretty linear upward trend in Taxable Sales in Santa Cruz that’s been at play for the last 10 years.
If one just “eye-balls” the data, it looks like there is often a pretty big jump in Retail and Food Service transactions between Q1 and Q2, then progressively smaller increases between the remaining quarters. I’m not certain on the timeline of events but I kinda think the most restrictive elements of the County-wide shelter in place order (non-essential businesses needing to close) went into effect around March of 2020. So it’s not super surprising to see the Q2 2020 value pretty far below the Q2 values for other recent years
In fact, since 2009, Q2 Retail and Food Service Transactions in Santa Cruz County have been growing by around 3-4% per year (compared with same quarter previous year). 2020 is the first year since 2009 that Q2 Retail and Food Service Transactions declined (again compared to same quarter previous year).
# filter for Q2 values and clean up labels for display
q3 <- sc.taxable %>% filter(Quarter=="Q2" & City=="SANTA CRUZ COUNTY") %>% select(date,Q,County,City,RetailFood) %>%
arrange(date) %>%
mutate(PCT_CNG=(RetailFood-lag(RetailFood))/lag(RetailFood),
year=year(date),
quarter="Q2") %>% select(-date,-Q,-City)
q3$RetailFood <- currency(q3$RetailFood,digits=0L)
q3$PCT_CNG <- percent(q3$PCT_CNG,digits=2)
knitr::kable(q3, caption="Quarter 2 Taxable Sales and Percent Change from Previous Quarter in the Retail and Food/Beverage Sector") %>% kable_styling(full_width=F)
County | RetailFood | PCT_CNG | year | quarter |
---|---|---|---|---|
SANTA CRUZ | $489,492,000 | NA | 2009 | Q2 |
SANTA CRUZ | $521,519,000 | 6.54% | 2010 | Q2 |
SANTA CRUZ | $567,292,000 | 8.78% | 2011 | Q2 |
SANTA CRUZ | $594,107,000 | 4.73% | 2012 | Q2 |
SANTA CRUZ | $639,874,000 | 7.70% | 2013 | Q2 |
SANTA CRUZ | $669,644,368 | 4.65% | 2014 | Q2 |
SANTA CRUZ | $681,535,151 | 1.78% | 2015 | Q2 |
SANTA CRUZ | $697,889,702 | 2.40% | 2016 | Q2 |
SANTA CRUZ | $735,575,775 | 5.40% | 2017 | Q2 |
SANTA CRUZ | $755,090,015 | 2.65% | 2018 | Q2 |
SANTA CRUZ | $782,691,045 | 3.66% | 2019 | Q2 |
SANTA CRUZ | $672,220,452 | -14.11% | 2020 | Q2 |
The next step in this very simple “data story” is to aggregate the PPP funding received by Retail and Food Service Entities in Santa Cruz County.
retail.agg <- retail %>% group_by(NAICS4dig) %>% summarise(Jobs=sum(JobsReported,na.rm=T),
Dollars=sum(InitialApprovalAmount,na.rm=T)) %>%
arrange(-Dollars)
## `summarise()` ungrouping output (override with `.groups` argument)
# This next part is pretty cumbersome...but it's gonna make the plot look way nicer
retail.agg <- retail.agg %>%
mutate(BusType=ifelse(NAICS4dig %in% c(7225,7224,7223),"Food Service/Bars",
ifelse(NAICS4dig %in% c(4451,4452,4453),"Grocery/Food/Liquor Store",
ifelse(NAICS4dig %in% c(4481,4482,4483,4512,4531),"Retail (Clothes,Books,Shoes,etc)","Other"))))
ggplot(retail.agg,aes(x=BusType,y=Dollars/1000000)) + geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=45)) + ylab("$ (millions)") + xlab("") + theme_tufte()
So here are some related observations that we can make from a quick look at these two data set:
These observations are not the result of careful analysis but rather quick and dirty data mining. They are not made for the purposes of supporting any conclusions about the “economic impacts of Covid on Santa Cruz County’s Retail and Food Services Sector.” They are offered as illustrations of the type of information that can be extracted from joining PPP data with California Taxable Sales Data. I think it’s not too hard to see how, with a little additional rigor, these data sets could generate some important insights about the economic/fiscal impacts of Covid-19 on local economies.
The California Taxable Sales Data are available at the city level. In the prior sections I just used the county aggregate but it might be fun for some people to separate out the city level contributions. Here’s a quick picture of how county level taxable sales for Santa Cruz County are distributed among the various cities:
#aggretate taxable sales by year and city within Santa Cruz County...also reorder factor levels
# so that the plot looks nicer
cities.annual <- sc.taxable %>% filter(City!="SANTA CRUZ COUNTY") %>% group_by(City,CalendarYear) %>%
summarise(RetailFood=sum(RetailFood)) %>% arrange(CalendarYear,-RetailFood) %>%
mutate(City=factor(City, levels=c("SCOTTS VALLEY","CAPITOLA","WATSONVILLE","SANTA CRUZ")))
## `summarise()` regrouping output by 'City' (override with `.groups` argument)
ggplot(cities.annual,aes(x=CalendarYear,y=RetailFood/1000000,fill=City)) + geom_bar(stat="identity") +
theme_bw() + scale_fill_viridis_d() + xlab("") + ylab("$s (millions)") +
ggtitle("Retail and Food Service Taxable Sales in Santa Cruz County, CA")
We can then aggregate the PPP data by the ProjectCity
field to get a city-level picture of PPP money in the Retail and Food Service Sector:
# use the "retail" data frame (which was derived from the source PPP data) and aggregate by city
retail.city.agg <- retail %>% mutate(ProjectCity=toupper(ProjectCity)) %>%
group_by(ProjectCity) %>% summarise(InitialApprovalAmount=sum(InitialApprovalAmount,na.rm=T)) %>%
arrange(-InitialApprovalAmount)
## `summarise()` ungrouping output (override with `.groups` argument)
#order factor levels
retail.city.agg <- retail.city.agg %>% mutate(ProjectCity=factor(ProjectCity,unique(retail.city.agg$ProjectCity)))
ggplot(retail.city.agg,aes(x=ProjectCity,y=InitialApprovalAmount/1000000)) + geom_bar(stat="identity") +
theme_bw() +
theme(axis.text.x=element_text(angle=90)) + ylab("$s (millions)") + xlab("")
The first thing to notice here is that there are bunch of “cities” in the PPP data that don’t exist in the California Taxable Sales Data. I’m not going to do anything about that right now…but, again, if one were interested in using these data for more serious purposes, one would need to decide how to map cities consistently between the two data sets.
My preliminary hypothesis is that taxable sales receipts activity in places like Aptos, Ben Lomond, and Corralitos (places that are not specifically represented in the CA Taxable Sales Data) are probably embedded in the County Aggregate. That is, I’m guessing that the Santa Cruz County Aggregate in the Taxable Sales Data is larger than the sum of Santa Cruz, Watsonville, Capitola, and Scotts Valley..and the delta between those two things is probably the activity in these smaller geographies that are not listed as cities in the taxable sales data set.
That’s all I got for right now. Hit me up if you want these data and don’t want to mess with the SBA’s FTP site…I can drop them in a google drive or something.