Voyage Through Univariate Analysis: Charting the Solo Attributes of Roughing the Passer Penalties in the NFL

Univariate analysis is a crucial step in data analysis that focuses on examining and summarizing the characteristics of a single variable or attribute from the dataset. Univariate analysis provides a foundation for understanding the characteristics of individual variables, which is essential for more advanced multivariate analyses and modeling. It helps identify patterns, outliers, and potential data quality issues, making it a crucial step in the data analysis pipeline.

Python
import pandas as pd
# import and store the dataset
qbs = pd.read_excel("https://myordinaryjourney.com/wp-content/uploads/2023/09/cleaned_qbs.xlsx")
print(qbs.head())

#Output
      Player  Total  2009  2010  2011  2012  2013  2014  2015  2016  ...  \
0   A.Dalton     25     0     0     4     1     1     5     0     1  ...   
1     A.Luck     17     0     0     0     5     3     4     1     1  ...   
2  A.Rodgers     41     3     3     3     5     2     2     5     4  ...   
3    A.Smith     23     1     1     1     1     1     6     4     1  ...   
4  B.Bortles     12     0     0     0     0     0     1     2     1  ...   

   2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
0     0    170      0.15      5557         0.45     361     0.069   
1     0     94      0.18      3620         0.47     186     0.091   
2     0    228      0.18      7840         0.52     542     0.076   
3     0    149      0.15      4648         0.49     367     0.063   
4     0     79      0.15      2719         0.44     201     0.060   

   Sack Per Att  Third Down %  qboc  
0         0.065         40.00     0  
1         0.051         29.41     0  
2         0.069         39.02     0  
3         0.079         26.09     0  
4         0.074         33.33     0  

First, we should look to see what the variable datatypes are and if there are any null or missing values. 

Python
qbs.info(verbose=True)

#Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66 entries, 0 to 65
Data columns (total 26 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Player        66 non-null     object 
 1   Total         66 non-null     int64  
 2   2009          66 non-null     int64  
 3   2010          66 non-null     int64  
 4   2011          66 non-null     int64  
 5   2012          66 non-null     int64  
 6   2013          66 non-null     int64  
 7   2014          66 non-null     int64  
 8   2015          66 non-null     int64  
 9   2016          66 non-null     int64  
 10  2017          66 non-null     int64  
 11  2018          66 non-null     int64  
 12  2019          66 non-null     int64  
 13  2020          66 non-null     int64  
 14  2021          66 non-null     int64  
 15  2022          66 non-null     int64  
 16  2023          66 non-null     int64  
 17  Games         66 non-null     int64  
 18  Per Game      66 non-null     float64
 19  Attempts      66 non-null     int64  
 20  Per 100 Att   66 non-null     float64
 21  Sacked        66 non-null     int64  
 22  Per Sack      66 non-null     float64
 23  Sack Per Att  66 non-null     float64
 24  Third Down %  66 non-null     float64
 25  qboc          66 non-null     int64  
dtypes: float64(5), int64(20), object(1)

The variables I will examine more closely are the total number of Roughing the Passer (RTP) calls per quarterback, the number of RTP calls per year for each quarterback, and the number of RTP calls per games played for their distribution, central tendency, and variance. I will be using Python and Jupyter Notebooks.

Python
table_stats = qbs.describe()
print(table_stats)

#Output
           Total       2009       2010       2011       2012       2013  \
count  66.000000  66.000000  66.000000  66.000000  66.000000  66.000000   
mean   18.090909   0.696970   0.833333   1.121212   1.257576   1.075758   
std    12.505663   1.176301   1.452672   1.767355   1.774266   1.825550   
min     2.000000   0.000000   0.000000   0.000000   0.000000   0.000000   
25%     8.250000   0.000000   0.000000   0.000000   0.000000   0.000000   
50%    15.500000   0.000000   0.000000   0.000000   0.000000   0.000000   
75%    24.750000   1.000000   1.000000   2.000000   2.000000   1.000000   
max    57.000000   5.000000   6.000000   8.000000   6.000000   8.000000   

            2014       2015       2016      2017  ...       2023       Games  \
count  66.000000  66.000000  66.000000  66.00000  ...  66.000000   66.000000   
mean    1.333333   1.454545   1.287879   1.30303  ...   0.030303   99.530303   
std     1.842518   1.832878   1.698553   1.75385  ...   0.172733   53.915952   
min     0.000000   0.000000   0.000000   0.00000  ...   0.000000   42.000000   
25%     0.000000   0.000000   0.000000   0.00000  ...   0.000000   61.250000   
50%     0.500000   1.000000   1.000000   0.00000  ...   0.000000   77.500000   
75%     2.000000   2.000000   2.000000   2.00000  ...   0.000000  138.000000   
max     7.000000   7.000000   6.000000   7.00000  ...   1.000000  253.000000   

        Per Game     Attempts  Per 100 Att      Sacked   Per Sack  \
count  66.000000    66.000000    66.000000   66.000000  66.000000   
mean    0.176818  3250.303030     0.573636  211.212121   0.083424   
std     0.073801  2085.250348     0.241951  117.910594   0.034212   
min     0.050000   289.000000     0.130000   26.000000   0.024000   
25%     0.112500  1794.500000     0.390000  131.500000   0.058500   
50%     0.165000  2315.500000     0.550000  170.000000   0.078000   
75%     0.220000  4476.000000     0.740000  264.500000   0.107250   
max     0.350000  9725.000000     1.240000  542.000000   0.202000   

       Sack Per Att  Third Down %       qboc  
count     66.000000     66.000000  66.000000  
mean       0.070091     31.143788   0.272727  
std        0.016449     13.872813   0.448775  
min        0.030000      0.000000   0.000000  
25%        0.058000     25.000000   0.000000  
50%        0.069000     30.770000   0.000000  
75%        0.083500     39.755000   1.000000  
max        0.103000     77.780000   1.000000  

[8 rows x 25 columns]

Well, that was easy. Using the .describe() function allows for the examination of the data set’s values for central tendency (mean in this case) and the spread (standard deviation) and dispersion (technically). Although all the information is present to observe the dispersion of the information, it may hard to conceptualize the shape without using a visualization help. Histograms can aid in our observations.

Python
qbs.hist(figsize=(10, 15))

Adjust the figure size to be able to view all the histogram outputs more clearly. The distributions for Per Game and Per 100 Attempts look nearly normal, so that will allow us to use some parametric tests for analysis.

Another visual that is helpful to see the distribution is a boxplot (box-and whisker plot). This time let us just look at just the Per Game and Per 100 Attempts.

Python
qbs.boxplot(column=['Per Game', 'Per 100 Att'], figsize=(10,10))

The wonderful thing about boxplots is that they make it easy to identify outliers – observations that are outside the whiskers (either top or bottom). You can also easily see the centrality and spread of the data.  

One additional variable that we should look at is the ‘qboc’. Did you notice that there was an interesting distribution when the histograms plotted above? Let’s take a closer look.  

Python
qbs['qboc'].hist(figsize=(10, 15))

As you can see there are only two values for this variable. And this makes sense since we are categorizing the quarterbacks based on if they are (1) or are not (0) a quarterback of color. I am going to do a bit of foreshadowing, but this means that if we wanted to do any sort of predictive analysis, we need to think about some additional models beyond regular linear regression, logistic regression to be specific… but that is a post for another day. For now, I have identified the variables we can use for some multivariate analysis:

  • Per Game
  • Per 100 Attempts
  • qboc

In this data exploration process, univariate analysis was applied to understand and summarize the characteristics of individual variables, specifically focusing on those related to Roughing the Passer (RTP) calls. The dataset was examined using Python and Jupyter Notebooks. The summary statistics and visualizations were generated to gain insights into the central tendency, dispersion, and distribution of the data.

The analysis revealed that variables such as “Per Game” and “Per 100 Attempts” exhibited nearly normal distributions, making them suitable for parametric tests. Additionally, the “qboc” variable, which categorizes quarterbacks based on their ethnicity, showed a binary distribution, indicating potential utility in predictive modeling.

This initial exploration sets the stage for further multivariate analysis and modeling, offering a foundation for more in-depth investigations into the relationships between these variables and RTP calls in NFL quarterbacks.

Cruising the Data Landscape: Exploring the Fundamentals of Data Exploration

Recap

Well, well, well! Looks like we’ve embarked on quite the adventure here! Our first task was to get to the bottom of the burning question: “Do quarterbacks of color get the short end of the stick when it comes to roughing the passer penalties compared to their fair-skinned counterparts?” Exciting stuff, huh?

The next step in our journey involved gathering the all-important data. We needed to make sure we were dealing with top-notch, premium-quality data. None of that shady, questionable stuff for us! We demanded data that was reliable, traceable, and oh-so-accurate. Because in the world of data science, it’s all about that golden rule: “garbage in, garbage out!” Can’t have our analysis going awry now, can we?

For this particular project, we scoured the depths of NFL Penalties and even tapped into the vast knowledge reserves of the mighty Wikipedia. We left no stone unturned, my friend!

Now, onto the thrilling next stage of our expedition – data cleaning and reshaping! Brace yourself, because this is where things get spicy. We had to create a nifty variable to indicate the race of those quarterbacks. Easy-peasy, right? Well, almost. Turns out, we stumbled upon a little naming convention conundrum that resulted in a sneaky duplicate value. But fear not! With our eagle-eyed attention to detail, we swiftly detected and corrected that pesky error. Crisis averted!

And now, dear explorer, we venture forth to the exhilarating realm of Data Exploration. Excited? You should be! There’s so much more to discover and uncover on this grand data-driven expedition! So, buckle up and get ready for the thrill ride ahead! Let’s dive into the vast ocean of information and see what fascinating insights await us!

Data exploration is an essential step in the data analysis process. It allows us to gain an initial understanding of the dataset by examining its general properties and characteristics. By delving into the data, we can uncover valuable insights.

One of the key aspects of data exploration involves assessing the size of the dataset. This includes understanding the number of observations or records and the number of variables or features present in the dataset. Knowing the scope of the dataset helps us gauge its comprehensiveness and potential usefulness for our analysis.

Furthermore, examining the data types is crucial in data exploration. By identifying the types of data contained in the dataset, such as numerical, categorical, or textual, we can determine the appropriate statistical techniques or visualization methods to apply during analysis. This enables us to make accurate interpretations and draw meaningful conclusions from the data.

Another crucial aspect is the identification of initial patterns or trends within the dataset. Exploring the data can reveal inherent relationships, correlations, or anomalies that may exist. By uncovering these patterns, we can develop hypotheses, generate insights, and pose relevant research questions for further investigation.

You may have noticed that I talk about using Python a lot in my work. I actually prefer to use in my day job since it seems to get the most use and support from other data professionals. However, for this portion of my project, I’m going to be using R instead. It has been some time since I’ve used R (maybe before I even finished my master’s program) so full discloser, I’m pretty sure I had to look up most of that I’m doing to refresh my memory. But since it is a language, if you don’t use it, you lose it, so I’m going to shake things up a bit and revisit R.

Data Exploration

The first things I need to do is load in any packages I need and my dataset. I am loading the ‘cleaned_qbs.xlsx’ Excel file that I created during the data cleaning process.

R
# import and store the dataset
library(openxlsx)

qbs = read.xlsx("https://myordinaryjourney.com/wp-content/uploads/2023/09/cleaned_qbs.xlsx")

# Check to make sure the data loaded correctly
head(qbs, n=5)

#Output
A data.frame: 5 × 26
Player	Total	2009	2010	2011	2012	2013	2014	2015	2016	...	2023	Games	Per.Game	Attempts	Per.100.Att	Sacked	Per.Sack	Sack.Per.Att	Third.Down.%	qboc
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	A.Dalton	25	0	0	4	1	1	5	0	1	...	0	170	0.15	5557	0.45	361	0.069	0.065	40.00	0
2	A.Luck	17	0	0	0	5	3	4	1	1	...	0	94	0.18	3620	0.47	186	0.091	0.051	29.41	0
3	A.Rodgers	41	3	3	3	5	2	2	5	4	...	0	228	0.18	7840	0.52	542	0.076	0.069	39.02	0
4	A.Smith	23	1	1	1	1	1	6	4	1	...	0	149	0.15	4648	0.49	367	0.063	0.079	26.09	0
5	B.Bortles	12	0	0	0	0	0	1	2	1	...	0	79	0.15	2719	0.44	201	0.060	0.074	33.33	0

Looking over the raw data, it is hard to see any particular patterns in the data. However, a few data points do jump out as they are either larger or smaller than the data around the point. For example, the number of RTP penalties called each year numbers in the single digits with the exception of 2020 for the quarterback Josh Allen. That year he received 11 RTP called in his favor. Another example is the number of sacks recorded against Taysom Hill. Most of the QBs have a sack count in the triple digits, while only a few have double digit sack counts and Hill has the lowest of any QB in the data set. The next thing I want to do is check how many records are in my table and the total number of variables. It is helpful to know what your sample size is so you can select the correct analytic approach later on.

R
nrow(qbs)

# Output
66

length(qbs)

#Output
26

The next step is to understand the data types of each of the variables. Luckily, when the head of the table was printed, it also included that data types for each of the variables. In our cleaned_qbs table, we have one variable with a character data type (Player) and 25 variables with the data type of ‘dbl’, otherwise known float.

R
#Output
A data.frame: 5 × 26
Player	Total	2009	2010	2011	2012	2013	2014	2015	2016	...	2023	Games	Per.Game	Attempts	Per.100.Att	Sacked	Per.Sack	Sack.Per.Att	Third.Down.%	qboc
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>

Next Steps

Now that we understand how much and what kind of data we have, we can develop a game plan on what analytic techniques we can take in the next step. With Univariate Analysis, Bivariate Analysis, and Multivariate Analysis we will begin to explore the data and any relationships between the variables that might exist. This will get us closer to a possible answer to our original question.