Analysis in Practice Part 1 - Cleanup

April 17, 2017 Robert A. Amezquita

9 minute read

I’ve been really enjoying reading David Robinson’s ‘Introduction to Empirical Bayes’ book. While I’ve taken multiple stats classes as a graduate student, it is refreshing to have a practical guide to applying Bayesian methods to the analysis of real world data, I especially am enjoying the book because compared to purely theoretical, formula heavy sorts of courses, David Robinson’s book presents first and foremost the intuition behind the madness, and steps through all the additional complications that can really imbue power behind these statistical methods.

And, as a bonus, I can now clearly imagine my datasets consisting of genes as tiny little baseball players vying to get more at-bats so that they can become the next Mike Piazza of the genome.

Now, I’ll admit, I’m not a huge baseball fan (even though I am a huge admirer of sabermetrics). But I do enjoy watching the NFL on ocassion (less so since the Chargers have been jerks to my beloved hometown and taken Philip Rivers with them to Los Angeles). And so that brings me to applying these methods to a QB focused, NFL dataset. With a quick search, I found this Kaggle dataset of QB stats from 1996 to 2016.

In this first post, I want to first show some of the things I do when I first get a dataset - in particular, cleaning the data and doing some preliminary inspection. Let’s dive in!

The Right Tools for the Job: `Tidyverse`

Without the right tools, it becomes much harder to do a job correctly and, more so, efficiently. My programming language of choice is R, primarily because it has my new favorite swiss army knife - the tidyverse. It is not only a toolset but also a philosophy underlying data organization that I don’t have the space to get into here (and frankly, that is much more eloquently detailed elsewhere), but suffice it to say that most of the code herein uses tidyverse tools as opposed to base R. I’ll try to show the specific package a function comes from using the <package>::<function> annotation as much as possible, but you can also search for a function’s origin using the ??<function> command in R.

First Pass Inspection

Let’s start with just reading in the data, and giving it a quick look using tibble::glimpse.

library(tidyverse) ## loads dplyr, tidyr, ggplot2, etc.
library(stringr)   ## great for strings
library(forcats)   ## great for dealing with factors

## File path - split up for to reduce line width
path <- paste0("../../static/post/",
               "2017-04-17-analyzing-qbs-empirical-bayes/",
               "QBStats_all.csv.gz")

## Read in the (compressed) CSV
raw <- readr::read_csv(path)

## View each variable and first observations
glimpse(raw, width = 43)

## Observations: 12,556
## Variables: 13
## $ qb         <chr> "Matt RyanM. Ryan",...
## $ att        <int> 34, 37, 11, 36, 40,...
## $ cmp        <int> 25, 23, 10, 23, 31,...
## $ yds        <dbl> 344, 261, 75, 219, ...
## $ ypa        <dbl> 10.1, 7.1, 6.8, 6.1...
## $ td         <int> 4, 3, 1, 2, 1, 0, 4...
## $ int        <int> 0, 0, 0, 1, 0, 2, 2...
## $ lg         <chr> "32t", "28", "13", ...
## $ sack       <dbl> 2, 2, 0, 1, 2, 2, 1...
## $ loss       <dbl> 19, 10, 0, 5, 14, 1...
## $ rate       <dbl> 144.7, 110.3, 125.4...
## $ GamePoints <int> 43, 28, 28, 22, 16,...
## $ year       <int> 2016, 2016, 2016, 2...

I won’t delve into the columns and what they mean in this post, but suffice it to say that each row pertains to a given quarterback’s performance metrics over the course of a year.

Fixing Up Names Using `stringr` and `map`

Now, we see one immediate problem: the qb column values look attrocious. Specifically, it seems the name is repeated, with a space in the first iteration and first initial separated from the last name by an underscore on the second iteration.

Here’s where the stringr library comes in - we need to parse this qb column - I’m going to create a “, ” format for this column using a function I’m calling .parse_qb (I use the . to prepend functions when its very specific to the dataset/use-case). Also, one interesting thing to note is the “underscore” is not really an underscore, but some other non-standard character that took me a bit to debug. Always watch those encodings!

## Parse QB column to "<last>, <first>" format
.parse_qb <- function(x) {
    y <- stringr::str_split(x, " ") %>% unlist
    last <- stringr::str_split(y[2], " ") %>%
        unlist %>% .[2] # Note encoding of pseudo underscore
    first <- y[1]
    name <- paste0(last, ", ", first)
    return(name)
}

raw %>%
    mutate(qb = map_chr(qb, .parse_qb)) %>% ## See note below
    select(qb) %>% head(3)

## # A tibble: 3 x 1
##                qb
##             <chr>
## 1      Ryan, Matt
## 2 Winston, Jameis
## 3   Glennon, Mike

Much better (at least, per this first group of names!)! One of my new favorite functions here is the map family from the purrr package. This family functions in a similar spirit as lapply, but is much neater in its application and also has subfunctions that can specify the returned output’s type - in this case, I specify that the output will be a character string (hence the _chr suffix). And when paired with the dplyr::mutate function to change/add columns, it really shines for munging data and mise en place.

But okay, so we parsed the qb column..but did it happen correctly? Now, in writing this post, I was working on filtering out NA values, and found that some of the names indeed had NA values for the first name group! And sometimes that’s exactly how you find errors - by chance. I could’ve avoided this by having a more exhaustive test case list in my case. Anyways, I’ll show one example that was broken: Odell Beckham Jr.

odell <- stringr::str_subset(raw$qb, "Odell")[1]
ryan <- raw$qb[1]

## Raw
ryan

## [1] "Matt RyanM. Ryan"

odell

## [1] "Odell Beckham Jr.O. Beckham"

## Parsed v1
.parse_qb(ryan)

## [1] "Ryan, Matt"

.parse_qb(odell)

## [1] "NA, Odell"

Yikes! While Matt Ryan’s is fine, Odell’s first name is evaluating to an NA. And the issue is coming from using a space to separate the first time. So let’s take a different tact: since the entire name is before the weird underscore, let’s split on that and the dot to get a first field, then remove the last letter if it is uppercase (aka, a first initial).

## Parse QB column to "<last>, <first>" format
.parse_qb_v2 <- function(x) {
    ## Initial parsing by pseudo-underscore and period
    ## Extract "<first> <last><first initial>."
    y <- str_split(x, " ") %>%
        unlist %>% .[1] %>%
        str_split("\\.") %>%
        unlist %>% .[1] 

    ## Split by character
    split <- str_split(y, "") %>% unlist

    ## Check if last letter is uppercase
    last_letter_upper <- split %>%
        .[length(.)] %>%
        str_detect(., toupper(letters)) %>%
        sum > 0

    ## Drop last letter if its uppercase (eg first initial)
    if (last_letter_upper == TRUE) {
        name <- split[-length(split)] %>% paste(collapse = "")
    } else {
        name <- split %>% paste(collapse = "")
    }

    return(name)
}

Much more code here than the first pass, but now let’s check our test cases:

.parse_qb_v2(odell)

## [1] "Odell Beckham Jr"

.parse_qb_v2(ryan)

## [1] "Matt Ryan"

And now it finally works for both our test cases, and could even automate testing using a package like testhat. Phew! Just goes to show, a little error checking goes a long way when munging! And now we have a function that clean our qb column.

Checking for `NA` Values

Another common task is to identify rows with NA values. Sometimes its appropriate to keep them in place, or re-encode them even, but first let’s inspect them to make sure we won’t miss anything important and assess how many we have. We’ll use the base function complete.cases, passing in the dataset using the . (dot) notation from the %>% (pipe) to dplyr::filter down to only rows with NA values, and look at the number of rows with such NA values, and inspect some examples.

## Identify QB's with incomplete records
incomplete <- raw %>%
    filter(!complete.cases(.)) 

nrow(incomplete) ## number of rows with >0 NA vals

## [1] 17

glimpse(incomplete, width = 43)

## Observations: 17
## Variables: 13
## $ qb         <chr> "Michael PittmanM. ...
## $ att        <int> 18, 9, 2, 3, 19, 5,...
## $ cmp        <int> 58, 43, 1, -6, 45, ...
## $ yds        <dbl> 3.22, 4.78, 0.50, -...
## $ ypa        <dbl> 9, 18, 1, 0, 8, 7, ...
## $ td         <int> 2, 0, 1, 0, 0, 0, 0...
## $ int        <int> NA, NA, NA, NA, NA,...
## $ lg         <chr> NA, NA, NA, NA, NA,...
## $ sack       <dbl> NA, NA, NA, NA, NA,...
## $ loss       <dbl> NA, NA, NA, NA, NA,...
## $ rate       <dbl> NA, NA, NA, NA, NA,...
## $ GamePoints <int> 34, 34, 34, 34, 27,...
## $ year       <int> 2009, 2009, 2009, 2...

So it looks like we have 17 rows with NA values, occuring in columns like interceptions, sacks, longest pass (lg) across Michael Pittman and others.

Here is where some sleuthing comes in: is there something in common between where all these NA values occur?

incomplete$year %>% table

## .
## 2009 
##   17

incomplete$att

##  [1] 18  9  2  3 19  5  4 16  8  7  2  1 12  2  3  3  3

Looking at the years in which we have the NA values, we see that it only seems to be 2009 that is affected. Then, if we look at attempts, we see that most of these are very low attempt years by these QB’s - most likely backups or wildcat plays (where say, a wide receiver throws the ball in a trick play).

So we can either ignore the NA values, and hope no errors are introduced downstream, or simply remove them. For now, we’ll go with keeping them for completeness, but probably won’t lose sleep if we remove them later on.

Odell’s Short QB Career

For any NFL savvy readers, you might notice that Odell Beckham Jr is not a QB, but rather a wide receiver! That’s probably because he was part of a trick play where he threw the ball. We’ll have to keep this in mind as we go downstream, that there are pretender “QBs” in our dataset, and also backups who get almost no play time, that we may want to remove from consideration downstream.

Wrap-up

These are just some of the things I do to first clean the dataset, but we haven’t yet explored the data thoroughly! We’ll get to that in a subsequent post, after we tidy it for ease of use with the tidyverse.

Robert A. Amezquita

Blog

Categories

About

Recent Posts

Analysis in Practice Part 1 - Cleanup

The Right Tools for the Job: `Tidyverse`

First Pass Inspection

Fixing Up Names Using `stringr` and `map`

Checking for `NA` Values

Odell’s Short QB Career

Wrap-up

Robert Amezquita

Recent Posts

Using `map` with Generic Functions Like `t-test`

Recovering R Packages Automagically

Categories

About

Blog

Categories

About

The Right Tools for the Job: Tidyverse

First Pass Inspection

Fixing Up Names Using stringr and map

Checking for NA Values

Odell’s Short QB Career

Wrap-up

About

The Right Tools for the Job: `Tidyverse`

Fixing Up Names Using `stringr` and `map`

Checking for `NA` Values