Voyage Through Univariate Analysis: Charting the Solo Attributes of Roughing the Passer Penalties in the NFL

Univariate analysis is a crucial step in data analysis that focuses on examining and summarizing the characteristics of a single variable or attribute from the dataset. Univariate analysis provides a foundation for understanding the characteristics of individual variables, which is essential for more advanced multivariate analyses and modeling. It helps identify patterns, outliers, and potential data quality issues, making it a crucial step in the data analysis pipeline.

Python
import pandas as pd
# import and store the dataset
qbs = pd.read_excel("https://myordinaryjourney.com/wp-content/uploads/2023/09/cleaned_qbs.xlsx")
print(qbs.head())

#Output
      Player  Total  2009  2010  2011  2012  2013  2014  2015  2016  ...  \
0   A.Dalton     25     0     0     4     1     1     5     0     1  ...   
1     A.Luck     17     0     0     0     5     3     4     1     1  ...   
2  A.Rodgers     41     3     3     3     5     2     2     5     4  ...   
3    A.Smith     23     1     1     1     1     1     6     4     1  ...   
4  B.Bortles     12     0     0     0     0     0     1     2     1  ...   

   2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
0     0    170      0.15      5557         0.45     361     0.069   
1     0     94      0.18      3620         0.47     186     0.091   
2     0    228      0.18      7840         0.52     542     0.076   
3     0    149      0.15      4648         0.49     367     0.063   
4     0     79      0.15      2719         0.44     201     0.060   

   Sack Per Att  Third Down %  qboc  
0         0.065         40.00     0  
1         0.051         29.41     0  
2         0.069         39.02     0  
3         0.079         26.09     0  
4         0.074         33.33     0  

First, we should look to see what the variable datatypes are and if there are any null or missing values. 

Python
qbs.info(verbose=True)

#Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66 entries, 0 to 65
Data columns (total 26 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Player        66 non-null     object 
 1   Total         66 non-null     int64  
 2   2009          66 non-null     int64  
 3   2010          66 non-null     int64  
 4   2011          66 non-null     int64  
 5   2012          66 non-null     int64  
 6   2013          66 non-null     int64  
 7   2014          66 non-null     int64  
 8   2015          66 non-null     int64  
 9   2016          66 non-null     int64  
 10  2017          66 non-null     int64  
 11  2018          66 non-null     int64  
 12  2019          66 non-null     int64  
 13  2020          66 non-null     int64  
 14  2021          66 non-null     int64  
 15  2022          66 non-null     int64  
 16  2023          66 non-null     int64  
 17  Games         66 non-null     int64  
 18  Per Game      66 non-null     float64
 19  Attempts      66 non-null     int64  
 20  Per 100 Att   66 non-null     float64
 21  Sacked        66 non-null     int64  
 22  Per Sack      66 non-null     float64
 23  Sack Per Att  66 non-null     float64
 24  Third Down %  66 non-null     float64
 25  qboc          66 non-null     int64  
dtypes: float64(5), int64(20), object(1)

The variables I will examine more closely are the total number of Roughing the Passer (RTP) calls per quarterback, the number of RTP calls per year for each quarterback, and the number of RTP calls per games played for their distribution, central tendency, and variance. I will be using Python and Jupyter Notebooks.

Python
table_stats = qbs.describe()
print(table_stats)

#Output
           Total       2009       2010       2011       2012       2013  \
count  66.000000  66.000000  66.000000  66.000000  66.000000  66.000000   
mean   18.090909   0.696970   0.833333   1.121212   1.257576   1.075758   
std    12.505663   1.176301   1.452672   1.767355   1.774266   1.825550   
min     2.000000   0.000000   0.000000   0.000000   0.000000   0.000000   
25%     8.250000   0.000000   0.000000   0.000000   0.000000   0.000000   
50%    15.500000   0.000000   0.000000   0.000000   0.000000   0.000000   
75%    24.750000   1.000000   1.000000   2.000000   2.000000   1.000000   
max    57.000000   5.000000   6.000000   8.000000   6.000000   8.000000   

            2014       2015       2016      2017  ...       2023       Games  \
count  66.000000  66.000000  66.000000  66.00000  ...  66.000000   66.000000   
mean    1.333333   1.454545   1.287879   1.30303  ...   0.030303   99.530303   
std     1.842518   1.832878   1.698553   1.75385  ...   0.172733   53.915952   
min     0.000000   0.000000   0.000000   0.00000  ...   0.000000   42.000000   
25%     0.000000   0.000000   0.000000   0.00000  ...   0.000000   61.250000   
50%     0.500000   1.000000   1.000000   0.00000  ...   0.000000   77.500000   
75%     2.000000   2.000000   2.000000   2.00000  ...   0.000000  138.000000   
max     7.000000   7.000000   6.000000   7.00000  ...   1.000000  253.000000   

        Per Game     Attempts  Per 100 Att      Sacked   Per Sack  \
count  66.000000    66.000000    66.000000   66.000000  66.000000   
mean    0.176818  3250.303030     0.573636  211.212121   0.083424   
std     0.073801  2085.250348     0.241951  117.910594   0.034212   
min     0.050000   289.000000     0.130000   26.000000   0.024000   
25%     0.112500  1794.500000     0.390000  131.500000   0.058500   
50%     0.165000  2315.500000     0.550000  170.000000   0.078000   
75%     0.220000  4476.000000     0.740000  264.500000   0.107250   
max     0.350000  9725.000000     1.240000  542.000000   0.202000   

       Sack Per Att  Third Down %       qboc  
count     66.000000     66.000000  66.000000  
mean       0.070091     31.143788   0.272727  
std        0.016449     13.872813   0.448775  
min        0.030000      0.000000   0.000000  
25%        0.058000     25.000000   0.000000  
50%        0.069000     30.770000   0.000000  
75%        0.083500     39.755000   1.000000  
max        0.103000     77.780000   1.000000  

[8 rows x 25 columns]

Well, that was easy. Using the .describe() function allows for the examination of the data set’s values for central tendency (mean in this case) and the spread (standard deviation) and dispersion (technically). Although all the information is present to observe the dispersion of the information, it may hard to conceptualize the shape without using a visualization help. Histograms can aid in our observations.

Python
qbs.hist(figsize=(10, 15))

Adjust the figure size to be able to view all the histogram outputs more clearly. The distributions for Per Game and Per 100 Attempts look nearly normal, so that will allow us to use some parametric tests for analysis.

Another visual that is helpful to see the distribution is a boxplot (box-and whisker plot). This time let us just look at just the Per Game and Per 100 Attempts.

Python
qbs.boxplot(column=['Per Game', 'Per 100 Att'], figsize=(10,10))

The wonderful thing about boxplots is that they make it easy to identify outliers – observations that are outside the whiskers (either top or bottom). You can also easily see the centrality and spread of the data.  

One additional variable that we should look at is the ‘qboc’. Did you notice that there was an interesting distribution when the histograms plotted above? Let’s take a closer look.  

Python
qbs['qboc'].hist(figsize=(10, 15))

As you can see there are only two values for this variable. And this makes sense since we are categorizing the quarterbacks based on if they are (1) or are not (0) a quarterback of color. I am going to do a bit of foreshadowing, but this means that if we wanted to do any sort of predictive analysis, we need to think about some additional models beyond regular linear regression, logistic regression to be specific… but that is a post for another day. For now, I have identified the variables we can use for some multivariate analysis:

  • Per Game
  • Per 100 Attempts
  • qboc

In this data exploration process, univariate analysis was applied to understand and summarize the characteristics of individual variables, specifically focusing on those related to Roughing the Passer (RTP) calls. The dataset was examined using Python and Jupyter Notebooks. The summary statistics and visualizations were generated to gain insights into the central tendency, dispersion, and distribution of the data.

The analysis revealed that variables such as “Per Game” and “Per 100 Attempts” exhibited nearly normal distributions, making them suitable for parametric tests. Additionally, the “qboc” variable, which categorizes quarterbacks based on their ethnicity, showed a binary distribution, indicating potential utility in predictive modeling.

This initial exploration sets the stage for further multivariate analysis and modeling, offering a foundation for more in-depth investigations into the relationships between these variables and RTP calls in NFL quarterbacks.

First the Idea, then the Data: Navigating the Depths of Information

When working on a project that requires data, it is essential to consider the various sources and formats in which the information may be available. Often, the desired data cannot be found in one single location, requiring careful compilation from multiple sources. This process can be time-consuming and challenging, as each source may present its own set of complications and intricacies.

While some data may be readily accessible through online databases, APIs, or official government reports, it may not always be presented in the most convenient or usable form. For instance, a dataset could be in a raw, unstructured format, making it difficult to analyze and extract meaningful insights. In such cases, data manipulation techniques are necessary to transform the data into a suitable format.

Data manipulation involves various tasks such as cleaning, merging, and reshaping the data. Cleaning the data involves removing any inconsistencies, errors, or missing values. This step ensures that the data is accurate and reliable for analysis. Merging data from multiple sources requires aligning and combining different datasets based on common attributes or keys. This consolidation allows for a comprehensive and holistic view of the data.

Reshaping the data involves rearranging its structure to fit the desired analysis or visualization needs. This can include tasks such as pivoting, aggregating, or disaggregating the data. By reshaping the data, researchers and analysts can uncover patterns, trends, and relationships that might otherwise go unnoticed.

In the given scenario, the QB Penalty Stats table lacked information on the racial background of the players. While this data may not be readily available in the original source, alternative sources can be explored. In this case, a reliable and well-sourced list of black quarterbacks could be found on Wikipedia. Although caution is advisable when using information from Wikipedia, the aforementioned table demonstrates reliability in this particular case.

Working with data often goes beyond simply retrieving information from a single source. It requires diligent gathering, cleaning, merging, and reshaping of data to ensure accuracy and usability. With the right techniques and approaches, researchers can unlock the full potential of data and derive valuable insights.

Embarking on an Epic Journey

I now have a list of all QBs from 2009 until present (that’s 2023 for those of you that stumble upon my ramblings in the future) in the qb_rtp table. I also have a list of black QBs in the qbs_oc table. Sometimes I find it can be faster to do some of the data cleaning work in the native application. In this case, I needed to create a column in the black_qbs table that I could use to compare the qb_rtp[‘Player’] column in the other table. Luckily Excel’s autofill feature made this particularly easy. I inserted a blank column next to the [‘Quarterback’] called [‘short_name’]. After typing a few examples, I selected the “Autofill” option in the Data drop down menu, and voilà, I had the new column that I can use for comparison.

Starting to Use Notebooks

I haven’t used a Jupyter Notebook to code in a very long time. I usually use the Spyder IDE as I can see all my variables’ values at the same time. But I’m going to give it a shot. You can get the entire notebook here. The first step is to import the different libraries that I think I’ll need.

Python
import pandas as pd
import openpyxl

Next, I will import the data sets.

qbs = pd.read_excel('https://myordinaryjourney.com/wp-content/uploads/2023/09/QB_Table-1.xlsx', engine='openpyxl')
qbs_oc = pd.read_excel('https://myordinaryjourney.com/wp-content/uploads/2023/09/QB_Table-2.xlsx', engine='openpyxl')

Make sure to verify the data has been loaded by viewing a few rows from each dataframe. If you are wondering what the ‘\n’ means, it is the regular expression for a new line. It creates some extra space between the table headings and the table data to make them more readable. Dataquest has a great quick guide and cheat sheet available here.

print('Quarterback Stats table:\n', qbs.head(), '\n')
print('Quarterback of Color table:\n', qbs_oc.head())

#Output
Quarterback Stats table:
       Player  Total  2009  2010  2011  2012  2013  2014  2015  2016  ...  \
0   A.Dalton     25     0     0     4     1     1     5     0     1  ...   
1     A.Luck     17     0     0     0     5     3     4     1     1  ...   
2  A.Rodgers     41     3     3     3     5     2     2     5     4  ...   
3    A.Smith     23     1     1     1     1     1     6     4     1  ...   
4  B.Bortles     12     0     0     0     0     0     1     2     1  ...   

   2022  2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
0     2     0    170      0.15      5557         0.45     361     0.069   
1     0     0     94      0.18      3620         0.47     186     0.091   
2     3     0    228      0.18      7840         0.52     542     0.076   
3     0     0    149      0.15      4648         0.49     367     0.063   
4     0     0     79      0.15      2719         0.44     201     0.060   

   Sack Per Att  Third Down %  
0         0.065         40.00  
1         0.051         29.41  
2         0.069         39.02  
3         0.079         26.09  
4         0.074         33.33  

[5 rows x 25 columns] 

Quarterback of Color table:
          Quarterback    short_name Years active  \
0      Fritz Pollard     F.Pollard    19201926   
1        Joe Lillard     J.Lillard    19321933   
2  George Taliaferro  G.Taliaferro    19501955   
3     Marlin Briscoe     M.Briscoe        1968*   
4       James Harris      J.Harris    19691981   

                                                Team  
0  Akron Pros, Milwaukee Badgers, Hammond Pros, P...  
1                                  Chicago Cardinals  
2  New York Yanks, Dallas Texans, Baltimore Colts...  
3                                     Denver Broncos  
4  Buffalo Bills, Los Angeles Rams, San Diego Cha...  

Smushing the Data Together

Now that the two initial datasets have been read into Pandas dataframes, it’s time to smush them together in a way that makes them useful for further analysis. The first step I take is adding a new variable to the Quarterback Stats Table (qbs) called ‘qboc’. I set all the values to 0 for right now.

qbs['qboc'] = 0

Once again check to make sure the variable was added to the dataframe as expected.

print(qbs.head())

#Output
Player  Total  2009  2010  2011  2012  2013  2014  2015  2016  ...  \
0   A.Dalton     25     0     0     4     1     1     5     0     1  ...   
1     A.Luck     17     0     0     0     5     3     4     1     1  ...   
2  A.Rodgers     41     3     3     3     5     2     2     5     4  ...   
3    A.Smith     23     1     1     1     1     1     6     4     1  ...   
4  B.Bortles     12     0     0     0     0     0     1     2     1  ...   

   2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
0     0    170      0.15      5557         0.45     361     0.069   
1     0     94      0.18      3620         0.47     186     0.091   
2     0    228      0.18      7840         0.52     542     0.076   
3     0    149      0.15      4648         0.49     367     0.063   
4     0     79      0.15      2719         0.44     201     0.060   

   Sack Per Att  Third Down %  qboc  
0         0.065         40.00     0  
1         0.051         29.41     0  
2         0.069         39.02     0  
3         0.079         26.09     0  
4         0.074         33.33     0  

[5 rows x 26 columns]

The ‘qboc’ column was added to the end of the dataframe. The next step I take is to make the qbs[‘Player’] and the qbs_oc[‘short_name’] columns the indexes of the dataframes. This will speed up the comparison process. Although it wouldn’t be noticeable with these datasets, if you have a dataset will millions of records, it will bog down the system… ask me how I know.

qbs = qbs.set_index('Player')
qbs_oc = qbs_oc.set_index('short_name')

Now this is the really fun part. All I need to do now is compare the indexes of the two dataframes. If the indexes match, then I can assign a ‘1’ to the qbs[‘qboc’] variable to signify that that quarterback is indeed a quarterback of color.

qbs.loc[(qbs.index.isin(qbs_oc.index)), 'qboc'] = 1

This one line of code is pretty cool, I think. I want to dig into what this line actually does. I know this isn’t a Python coding post, but explaining will only take a minute and it unpacks some interesting concepts.

  1. qbs.index and qbs_oc.index specify that we are going to be doing something with the indexes of the two dataframes.
  2. The .isin is a function that returns a Boolean mask. So what (qbs.index.isin(qbs_oc.index) does is takes the index values of the qbs_oc dataframe and look to see if those values are in the qbs dataframe index. If they are, then the Boolean value = True, if not, the value = False. You can also save that part of the code to a variable and use over and over in other parts of your project, but in this instance, we don’t need to.
  3. The qbs.loc is locating all the the indexes in the qbs dataframe where the mask value = True.
  4. ‘qboc’ specifies the column we want to assign our value to. In this case a 1 to signify that the quarterback is a person of color.

The qbs.loc() designates the dataframe we are intending to modify; (qbs.index.isin(qbs_oc.index)) designates the row we want to work with; and ‘qboc’ designates the column.

print(qbs.head())

#Output
           Total  2009  2010  2011  2012  2013  2014  2015  2016  2017  ...  \
Player                                                                  ...   
A.Dalton      25     0     0     4     1     1     5     0     1     1  ...   
A.Luck        17     0     0     0     5     3     4     1     1     0  ...   
A.Rodgers     41     3     3     3     5     2     2     5     4     0  ...   
A.Smith       23     1     1     1     1     1     6     4     1     5  ...   
B.Bortles     12     0     0     0     0     0     1     2     1     7  ...   

           2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
Player                                                                      
A.Dalton      0    170      0.15      5557         0.45     361     0.069   
A.Luck        0     94      0.18      3620         0.47     186     0.091   
A.Rodgers     0    228      0.18      7840         0.52     542     0.076   
A.Smith       0    149      0.15      4648         0.49     367     0.063   
B.Bortles     0     79      0.15      2719         0.44     201     0.060   

           Sack Per Att  Third Down %  qboc  
Player                                       
A.Dalton          0.065         40.00     0  
A.Luck            0.051         29.41     0  
A.Rodgers         0.069         39.02     0  
A.Smith           0.079         26.09     1  
B.Bortles         0.074         33.33     0 

As you can see, A.Smith now has a 1 in the qboc column. However, the A.Smith in the qbs data set is actually Alex Smith not Akili Smith who is the actual quarterback referenced in the qbs_oc dataset. This illustrates the importance of going through your data to ensure accuracy. Luckily these datasets are small enough to go through to spot any errors.

double_check = qbs.loc[(qbs['qboc'] == 1)]
print(double_check.index)

#Output
Index(['A.Smith', 'C.Kaepernick', 'C.Newton', 'D.Prescott', 'D.Watson',
       'G.Smith', 'J.Brissett', 'J.Campbell', 'J.Freeman', 'J.Hurts',
       'J.Winston', 'K.Murray', 'L.Jackson', 'M.Vick', 'P.Mahomes',
       'R.Griffin III', 'R.Wilson', 'T.Bridgewater', 'T.Taylor'],
      dtype='object', name='Player')

qbs['qboc']['A.Smith'] = 0
print(qbs)

#Output
         Total  2009  2010  2011  2012  2013  2014  2015  2016  2017  ...  \
Player                                                                ...   
A.Smith     23     1     1     1     1     1     6     4     1     5  ...   

         2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
Player                                                                    
A.Smith     0    149      0.15      4648         0.49     367     0.063   

         Sack Per Att  Third Down %  qboc  
Player                                     
A.Smith         0.079         26.09     0  

[1 rows x 25 columns]

Next Steps

Get ready for some quarterback madness in the upcoming post! We’ll be diving into our fancy new quarterbacks table and unleashing some wild analysis on the gridiron.

The First Detour

Once upon a time, in the realm of weather-related nerdiness, I embarked on a quest to decipher the secrets of changing weather patterns. Armed with my mighty keyboard and a burning hatred for sweltering summers, I planned to uncover the truth about my local area’s climate evolution. You see, summer and I have never been the best of friends. The scorching heat and suffocating humidity make me cringe harder than a cat stuck in a cucumber maze. So, I figured, why not dive into the delightful world of data and investigate if there’s any hope for an early arrival of autumn? I dubbed it my “Project Meteorological Marvel.”

My cunning plan involved sifting through decades of weather records, gathering juicy tidbits on how temperatures have tortured us poor mortals over the years. I wanted to spot trends, make dazzling graphs, and perhaps even predict when autumn would grace us with its blessed presence. Oh, how I yearned for a reliable sign that summer’s reign of terror would soon be over! Of course, this was no ordinary undertaking. I needed a trustworthy data source, and what better place to turn to than the National Oceanic and Atmospheric Administration (NOAA)? If you can’t trust the NOAA to provide accurate historical weather data, well, I guess we’re all doomed!

Now, I must confess, I had no intention of becoming a weather forecaster during this escapade. That’s a whole different level of sorcery reserved for the truly brave and slightly crazy souls. No, my friends, my mission was solely to unravel the mysteries of the past, not predict the future. So, off I went, armed with my web-scraping skills and a fervent desire to put an end to endless summers. And thus, my epic journey into the realm of weather data began… but did it?

Well, it seems that once people discover your supreme data-crunching powers, they start throwing project ideas at you like confetti at a parade. Take my poor, football-obsessed husband for example. He came up with the brilliant notion of analyzing if there’s any connection between a quarterback’s race and the number of times they get a sweet, sweet roughing the passer call in their favor. And as if that wasn’t enough, I thought, why not spice it up even more and explore if the defender’s race also plays a role in how many roughing the passer flags rain down upon them? Heck, let’s even toss in the officials’ race for good measure. Who knew the football field could have so many hidden layers of sociology and statistics? But hey, I’ll play along and start with quarterbacks for now. Let the mind-boggling journey begin! Just don’t blame me if we end up in a statistical black hole of absurdity.

At first, I was going to look at NCAA football statistics given that the sample size would be much larger than for the NFL. However, I didn’t really find a good source for the data to either download or extract. It just doesn’t seem like the NCAA collects that data down to the player level. As luck would have it I was able to find a source for NFL penalty data. the aptly named NFL Penalties is a sited dedicated to capturing penalty data so that users can basically settle disputes “over a player and their frequent ability to get away with murder, or not.” The site’s author does a pretty good job at articulating problems with the data and any mitigation actions taken. Ultimately the data on the site is provided by nflfastR. 1

Now that I’ve talked about the general concept and the search for a data source, here my next steps.:

  1. Collect Roughing the Passer (RTP) data for the quarterback from NFL Penalties.
  2. Collect the relevant biographical data on each of the quarterbacks.
  3. Use Python and relevant libraries such as Pandas2 and matplotlib3 to perform data cleaning, exploration, univariate and bivariate analysis, and visualizations.
  4. Publish the findings along with the documented methodology.

I’m not sure how long this will take me, could be a day, could be weeks. Either way, check back often for updates on the project’s progress.


  1. Carl S, Baldwin B (2023). nflfastR: Functions to Efficiently Access NFL Play by Play Data. https://www.nflfastr.com/, https://github.com/nflverse/nflfastR. ↩︎
  2. The pandas development team. (2023). pandas-dev/pandas: Pandas (v2.1.0). Zenodo. https://doi.org/10.5281/zenodo.8301632 ↩︎
  3. J. D. Hunter, “Matplotlib: A 2D Graphics Environment”, Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007. ↩︎