Voyage Through Univariate Analysis: Charting the Solo Attributes of Roughing the Passer Penalties in the NFL

Univariate analysis is a crucial step in data analysis that focuses on examining and summarizing the characteristics of a single variable or attribute from the dataset. Univariate analysis provides a foundation for understanding the characteristics of individual variables, which is essential for more advanced multivariate analyses and modeling. It helps identify patterns, outliers, and potential data quality issues, making it a crucial step in the data analysis pipeline.

Python
import pandas as pd
# import and store the dataset
qbs = pd.read_excel("https://myordinaryjourney.com/wp-content/uploads/2023/09/cleaned_qbs.xlsx")
print(qbs.head())

#Output
      Player  Total  2009  2010  2011  2012  2013  2014  2015  2016  ...  \
0   A.Dalton     25     0     0     4     1     1     5     0     1  ...   
1     A.Luck     17     0     0     0     5     3     4     1     1  ...   
2  A.Rodgers     41     3     3     3     5     2     2     5     4  ...   
3    A.Smith     23     1     1     1     1     1     6     4     1  ...   
4  B.Bortles     12     0     0     0     0     0     1     2     1  ...   

   2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
0     0    170      0.15      5557         0.45     361     0.069   
1     0     94      0.18      3620         0.47     186     0.091   
2     0    228      0.18      7840         0.52     542     0.076   
3     0    149      0.15      4648         0.49     367     0.063   
4     0     79      0.15      2719         0.44     201     0.060   

   Sack Per Att  Third Down %  qboc  
0         0.065         40.00     0  
1         0.051         29.41     0  
2         0.069         39.02     0  
3         0.079         26.09     0  
4         0.074         33.33     0  

First, we should look to see what the variable datatypes are and if there are any null or missing values. 

Python
qbs.info(verbose=True)

#Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66 entries, 0 to 65
Data columns (total 26 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Player        66 non-null     object 
 1   Total         66 non-null     int64  
 2   2009          66 non-null     int64  
 3   2010          66 non-null     int64  
 4   2011          66 non-null     int64  
 5   2012          66 non-null     int64  
 6   2013          66 non-null     int64  
 7   2014          66 non-null     int64  
 8   2015          66 non-null     int64  
 9   2016          66 non-null     int64  
 10  2017          66 non-null     int64  
 11  2018          66 non-null     int64  
 12  2019          66 non-null     int64  
 13  2020          66 non-null     int64  
 14  2021          66 non-null     int64  
 15  2022          66 non-null     int64  
 16  2023          66 non-null     int64  
 17  Games         66 non-null     int64  
 18  Per Game      66 non-null     float64
 19  Attempts      66 non-null     int64  
 20  Per 100 Att   66 non-null     float64
 21  Sacked        66 non-null     int64  
 22  Per Sack      66 non-null     float64
 23  Sack Per Att  66 non-null     float64
 24  Third Down %  66 non-null     float64
 25  qboc          66 non-null     int64  
dtypes: float64(5), int64(20), object(1)

The variables I will examine more closely are the total number of Roughing the Passer (RTP) calls per quarterback, the number of RTP calls per year for each quarterback, and the number of RTP calls per games played for their distribution, central tendency, and variance. I will be using Python and Jupyter Notebooks.

Python
table_stats = qbs.describe()
print(table_stats)

#Output
           Total       2009       2010       2011       2012       2013  \
count  66.000000  66.000000  66.000000  66.000000  66.000000  66.000000   
mean   18.090909   0.696970   0.833333   1.121212   1.257576   1.075758   
std    12.505663   1.176301   1.452672   1.767355   1.774266   1.825550   
min     2.000000   0.000000   0.000000   0.000000   0.000000   0.000000   
25%     8.250000   0.000000   0.000000   0.000000   0.000000   0.000000   
50%    15.500000   0.000000   0.000000   0.000000   0.000000   0.000000   
75%    24.750000   1.000000   1.000000   2.000000   2.000000   1.000000   
max    57.000000   5.000000   6.000000   8.000000   6.000000   8.000000   

            2014       2015       2016      2017  ...       2023       Games  \
count  66.000000  66.000000  66.000000  66.00000  ...  66.000000   66.000000   
mean    1.333333   1.454545   1.287879   1.30303  ...   0.030303   99.530303   
std     1.842518   1.832878   1.698553   1.75385  ...   0.172733   53.915952   
min     0.000000   0.000000   0.000000   0.00000  ...   0.000000   42.000000   
25%     0.000000   0.000000   0.000000   0.00000  ...   0.000000   61.250000   
50%     0.500000   1.000000   1.000000   0.00000  ...   0.000000   77.500000   
75%     2.000000   2.000000   2.000000   2.00000  ...   0.000000  138.000000   
max     7.000000   7.000000   6.000000   7.00000  ...   1.000000  253.000000   

        Per Game     Attempts  Per 100 Att      Sacked   Per Sack  \
count  66.000000    66.000000    66.000000   66.000000  66.000000   
mean    0.176818  3250.303030     0.573636  211.212121   0.083424   
std     0.073801  2085.250348     0.241951  117.910594   0.034212   
min     0.050000   289.000000     0.130000   26.000000   0.024000   
25%     0.112500  1794.500000     0.390000  131.500000   0.058500   
50%     0.165000  2315.500000     0.550000  170.000000   0.078000   
75%     0.220000  4476.000000     0.740000  264.500000   0.107250   
max     0.350000  9725.000000     1.240000  542.000000   0.202000   

       Sack Per Att  Third Down %       qboc  
count     66.000000     66.000000  66.000000  
mean       0.070091     31.143788   0.272727  
std        0.016449     13.872813   0.448775  
min        0.030000      0.000000   0.000000  
25%        0.058000     25.000000   0.000000  
50%        0.069000     30.770000   0.000000  
75%        0.083500     39.755000   1.000000  
max        0.103000     77.780000   1.000000  

[8 rows x 25 columns]

Well, that was easy. Using the .describe() function allows for the examination of the data set’s values for central tendency (mean in this case) and the spread (standard deviation) and dispersion (technically). Although all the information is present to observe the dispersion of the information, it may hard to conceptualize the shape without using a visualization help. Histograms can aid in our observations.

Python
qbs.hist(figsize=(10, 15))

Adjust the figure size to be able to view all the histogram outputs more clearly. The distributions for Per Game and Per 100 Attempts look nearly normal, so that will allow us to use some parametric tests for analysis.

Another visual that is helpful to see the distribution is a boxplot (box-and whisker plot). This time let us just look at just the Per Game and Per 100 Attempts.

Python
qbs.boxplot(column=['Per Game', 'Per 100 Att'], figsize=(10,10))

The wonderful thing about boxplots is that they make it easy to identify outliers – observations that are outside the whiskers (either top or bottom). You can also easily see the centrality and spread of the data.  

One additional variable that we should look at is the ‘qboc’. Did you notice that there was an interesting distribution when the histograms plotted above? Let’s take a closer look.  

Python
qbs['qboc'].hist(figsize=(10, 15))

As you can see there are only two values for this variable. And this makes sense since we are categorizing the quarterbacks based on if they are (1) or are not (0) a quarterback of color. I am going to do a bit of foreshadowing, but this means that if we wanted to do any sort of predictive analysis, we need to think about some additional models beyond regular linear regression, logistic regression to be specific… but that is a post for another day. For now, I have identified the variables we can use for some multivariate analysis:

  • Per Game
  • Per 100 Attempts
  • qboc

In this data exploration process, univariate analysis was applied to understand and summarize the characteristics of individual variables, specifically focusing on those related to Roughing the Passer (RTP) calls. The dataset was examined using Python and Jupyter Notebooks. The summary statistics and visualizations were generated to gain insights into the central tendency, dispersion, and distribution of the data.

The analysis revealed that variables such as “Per Game” and “Per 100 Attempts” exhibited nearly normal distributions, making them suitable for parametric tests. Additionally, the “qboc” variable, which categorizes quarterbacks based on their ethnicity, showed a binary distribution, indicating potential utility in predictive modeling.

This initial exploration sets the stage for further multivariate analysis and modeling, offering a foundation for more in-depth investigations into the relationships between these variables and RTP calls in NFL quarterbacks.