I’ve been really enjoying reading David Robinson’s ‘Introduction to Empirical Bayes’ book. While I’ve taken multiple stats classes as a graduate student, it is refreshing to have a practical guide to applying Bayesian methods to the analysis of real world data, I especially am enjoying the book because compared to purely theoretical, formula heavy sorts of courses, David Robinson’s book presents first and foremost the intuition behind the madness, and steps through all the additional complications that can really imbue power behind these statistical methods.
And, as a bonus, I can now clearly imagine my datasets consisting of genes as tiny little baseball players vying to get more at-bats so that they can become the next Mike Piazza of the genome.
Now, I’ll admit, I’m not a huge baseball fan (even though I am a huge admirer of sabermetrics). But I do enjoy watching the NFL on ocassion (less so since the Chargers have been jerks to my beloved hometown and taken Philip Rivers with them to Los Angeles). And so that brings me to applying these methods to a QB focused, NFL dataset. With a quick search, I found this Kaggle dataset of QB stats from 1996 to 2016.
In this first post, I want to first show some of the things I do when I first get a dataset - in particular, cleaning the data and doing some preliminary inspection. Let’s dive in!
The Right Tools for the Job: Tidyverse
Without the right tools, it becomes much harder to do a job correctly and, more so, efficiently. My programming language of choice is R
, primarily because it has my new favorite swiss army knife - the tidyverse
. It is not only a toolset but also a philosophy underlying data organization that I don’t have the space to get into here (and frankly, that is much more eloquently detailed elsewhere), but suffice it to say that most of the code herein uses tidyverse
tools as opposed to base R
. I’ll try to show the specific package a function comes from using the <package>::<function>
annotation as much as possible, but you can also search for a function’s origin using the ??<function>
command in R
.
First Pass Inspection
Let’s start with just reading in the data, and giving it a quick look using tibble::glimpse
.
library(tidyverse) ## loads dplyr, tidyr, ggplot2, etc.
library(stringr) ## great for strings
library(forcats) ## great for dealing with factors
## File path - split up for to reduce line width
path <- paste0("../../static/post/",
"2017-04-17-analyzing-qbs-empirical-bayes/",
"QBStats_all.csv.gz")
## Read in the (compressed) CSV
raw <- readr::read_csv(path)
## View each variable and first observations
glimpse(raw, width = 43)
## Observations: 12,556
## Variables: 13
## $ qb <chr> "Matt RyanM. Ryan",...
## $ att <int> 34, 37, 11, 36, 40,...
## $ cmp <int> 25, 23, 10, 23, 31,...
## $ yds <dbl> 344, 261, 75, 219, ...
## $ ypa <dbl> 10.1, 7.1, 6.8, 6.1...
## $ td <int> 4, 3, 1, 2, 1, 0, 4...
## $ int <int> 0, 0, 0, 1, 0, 2, 2...
## $ lg <chr> "32t", "28", "13", ...
## $ sack <dbl> 2, 2, 0, 1, 2, 2, 1...
## $ loss <dbl> 19, 10, 0, 5, 14, 1...
## $ rate <dbl> 144.7, 110.3, 125.4...
## $ GamePoints <int> 43, 28, 28, 22, 16,...
## $ year <int> 2016, 2016, 2016, 2...
I won’t delve into the columns and what they mean in this post, but suffice it to say that each row pertains to a given quarterback’s performance metrics over the course of a year.
Fixing Up Names Using stringr
and map
Now, we see one immediate problem: the qb
column values look attrocious. Specifically, it seems the name is repeated, with a space in the first iteration and first initial separated from the last name by an underscore on the second iteration.
Here’s where the stringr
library comes in - we need to parse this qb
column - I’m going to create a “.parse_qb
(I use the .
to prepend functions when its very specific to the dataset/use-case). Also, one interesting thing to note is the “underscore” is not really an underscore, but some other non-standard character that took me a bit to debug. Always watch those encodings!
## Parse QB column to "<last>, <first>" format
.parse_qb <- function(x) {
y <- stringr::str_split(x, " ") %>% unlist
last <- stringr::str_split(y[2], " ") %>%
unlist %>% .[2] # Note encoding of pseudo underscore
first <- y[1]
name <- paste0(last, ", ", first)
return(name)
}
raw %>%
mutate(qb = map_chr(qb, .parse_qb)) %>% ## See note below
select(qb) %>% head(3)
## # A tibble: 3 x 1
## qb
## <chr>
## 1 Ryan, Matt
## 2 Winston, Jameis
## 3 Glennon, Mike
Much better (at least, per this first group of names!)! One of my new favorite functions here is the map
family from the purrr
package. This family functions in a similar spirit as lapply
, but is much neater in its application and also has subfunctions that can specify the returned output’s type - in this case, I specify that the output will be a character string (hence the _chr
suffix). And when paired with the dplyr::mutate
function to change/add columns, it really shines for munging data and mise en place.
But okay, so we parsed the qb
column..but did it happen correctly? Now, in writing this post, I was working on filtering out NA
values, and found that some of the names indeed had NA
values for the first name group! And sometimes that’s exactly how you find errors - by chance. I could’ve avoided this by having a more exhaustive test case list in my case. Anyways, I’ll show one example that was broken: Odell Beckham Jr.
odell <- stringr::str_subset(raw$qb, "Odell")[1]
ryan <- raw$qb[1]
## Raw
ryan
## [1] "Matt RyanM. Ryan"
odell
## [1] "Odell Beckham Jr.O. Beckham"
## Parsed v1
.parse_qb(ryan)
## [1] "Ryan, Matt"
.parse_qb(odell)
## [1] "NA, Odell"
Yikes! While Matt Ryan’s is fine, Odell’s first name is evaluating to an NA
. And the issue is coming from using a space to separate the first time. So let’s take a different tact: since the entire name is before the weird underscore, let’s split on that and the dot to get a first field, then remove the last letter if it is uppercase (aka, a first initial).
## Parse QB column to "<last>, <first>" format
.parse_qb_v2 <- function(x) {
## Initial parsing by pseudo-underscore and period
## Extract "<first> <last><first initial>."
y <- str_split(x, " ") %>%
unlist %>% .[1] %>%
str_split("\\.") %>%
unlist %>% .[1]
## Split by character
split <- str_split(y, "") %>% unlist
## Check if last letter is uppercase
last_letter_upper <- split %>%
.[length(.)] %>%
str_detect(., toupper(letters)) %>%
sum > 0
## Drop last letter if its uppercase (eg first initial)
if (last_letter_upper == TRUE) {
name <- split[-length(split)] %>% paste(collapse = "")
} else {
name <- split %>% paste(collapse = "")
}
return(name)
}
Much more code here than the first pass, but now let’s check our test cases:
.parse_qb_v2(odell)
## [1] "Odell Beckham Jr"
.parse_qb_v2(ryan)
## [1] "Matt Ryan"
And now it finally works for both our test cases, and could even automate testing using a package like testhat
. Phew! Just goes to show, a little error checking goes a long way when munging! And now we have a function that clean our qb
column.
Checking for NA
Values
Another common task is to identify rows with NA
values. Sometimes its appropriate to keep them in place, or re-encode them even, but first let’s inspect them to make sure we won’t miss anything important and assess how many we have. We’ll use the base function complete.cases
, passing in the dataset using the .
(dot) notation from the %>%
(pipe) to dplyr::filter
down to only rows with NA
values, and look at the number of rows with such NA
values, and inspect some examples.
## Identify QB's with incomplete records
incomplete <- raw %>%
filter(!complete.cases(.))
nrow(incomplete) ## number of rows with >0 NA vals
## [1] 17
glimpse(incomplete, width = 43)
## Observations: 17
## Variables: 13
## $ qb <chr> "Michael PittmanM. ...
## $ att <int> 18, 9, 2, 3, 19, 5,...
## $ cmp <int> 58, 43, 1, -6, 45, ...
## $ yds <dbl> 3.22, 4.78, 0.50, -...
## $ ypa <dbl> 9, 18, 1, 0, 8, 7, ...
## $ td <int> 2, 0, 1, 0, 0, 0, 0...
## $ int <int> NA, NA, NA, NA, NA,...
## $ lg <chr> NA, NA, NA, NA, NA,...
## $ sack <dbl> NA, NA, NA, NA, NA,...
## $ loss <dbl> NA, NA, NA, NA, NA,...
## $ rate <dbl> NA, NA, NA, NA, NA,...
## $ GamePoints <int> 34, 34, 34, 34, 27,...
## $ year <int> 2009, 2009, 2009, 2...
So it looks like we have 17 rows with NA values, occuring in columns like interceptions, sacks, longest pass (lg) across Michael Pittman and others.
Here is where some sleuthing comes in: is there something in common between where all these NA values occur?
incomplete$year %>% table
## .
## 2009
## 17
incomplete$att
## [1] 18 9 2 3 19 5 4 16 8 7 2 1 12 2 3 3 3
Looking at the years in which we have the NA values, we see that it only seems to be 2009 that is affected. Then, if we look at attempts, we see that most of these are very low attempt years by these QB’s - most likely backups or wildcat plays (where say, a wide receiver throws the ball in a trick play).
So we can either ignore the NA values, and hope no errors are introduced downstream, or simply remove them. For now, we’ll go with keeping them for completeness, but probably won’t lose sleep if we remove them later on.
Odell’s Short QB Career
For any NFL savvy readers, you might notice that Odell Beckham Jr is not a QB, but rather a wide receiver! That’s probably because he was part of a trick play where he threw the ball. We’ll have to keep this in mind as we go downstream, that there are pretender “QBs” in our dataset, and also backups who get almost no play time, that we may want to remove from consideration downstream.
Wrap-up
These are just some of the things I do to first clean the dataset, but we haven’t yet explored the data thoroughly! We’ll get to that in a subsequent post, after we tidy it for ease of use with the tidyverse
.
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Email