Voyage Through Univariate Analysis: Charting the Solo Attributes of Roughing the Passer Penalties in the NFL
Univariate analysis is a crucial step in data analysis that focuses on examining and summarizing the characteristics of a single variable or attribute from the dataset. Univariate analysis provides a foundation for understanding the characteristics of individual variables, which is essential for more advanced multivariate analyses and modeling. It helps identify patterns, outliers, and potential data quality issues, making it a crucial step in the data analysis pipeline.
import pandas as pd
# import and store the dataset
qbs = pd.read_excel("https://myordinaryjourney.com/wp-content/uploads/2023/09/cleaned_qbs.xlsx")
print(qbs.head())
#Output
Player Total 2009 2010 2011 2012 2013 2014 2015 2016 ... \
0 A.Dalton 25 0 0 4 1 1 5 0 1 ...
1 A.Luck 17 0 0 0 5 3 4 1 1 ...
2 A.Rodgers 41 3 3 3 5 2 2 5 4 ...
3 A.Smith 23 1 1 1 1 1 6 4 1 ...
4 B.Bortles 12 0 0 0 0 0 1 2 1 ...
2023 Games Per Game Attempts Per 100 Att Sacked Per Sack \
0 0 170 0.15 5557 0.45 361 0.069
1 0 94 0.18 3620 0.47 186 0.091
2 0 228 0.18 7840 0.52 542 0.076
3 0 149 0.15 4648 0.49 367 0.063
4 0 79 0.15 2719 0.44 201 0.060
Sack Per Att Third Down % qboc
0 0.065 40.00 0
1 0.051 29.41 0
2 0.069 39.02 0
3 0.079 26.09 0
4 0.074 33.33 0
First, we should look to see what the variable datatypes are and if there are any null or missing values.
qbs.info(verbose=True)
#Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66 entries, 0 to 65
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Player 66 non-null object
1 Total 66 non-null int64
2 2009 66 non-null int64
3 2010 66 non-null int64
4 2011 66 non-null int64
5 2012 66 non-null int64
6 2013 66 non-null int64
7 2014 66 non-null int64
8 2015 66 non-null int64
9 2016 66 non-null int64
10 2017 66 non-null int64
11 2018 66 non-null int64
12 2019 66 non-null int64
13 2020 66 non-null int64
14 2021 66 non-null int64
15 2022 66 non-null int64
16 2023 66 non-null int64
17 Games 66 non-null int64
18 Per Game 66 non-null float64
19 Attempts 66 non-null int64
20 Per 100 Att 66 non-null float64
21 Sacked 66 non-null int64
22 Per Sack 66 non-null float64
23 Sack Per Att 66 non-null float64
24 Third Down % 66 non-null float64
25 qboc 66 non-null int64
dtypes: float64(5), int64(20), object(1)
The variables I will examine more closely are the total number of Roughing the Passer (RTP) calls per quarterback, the number of RTP calls per year for each quarterback, and the number of RTP calls per games played for their distribution, central tendency, and variance. I will be using Python and Jupyter Notebooks.
table_stats = qbs.describe()
print(table_stats)
#Output
Total 2009 2010 2011 2012 2013 \
count 66.000000 66.000000 66.000000 66.000000 66.000000 66.000000
mean 18.090909 0.696970 0.833333 1.121212 1.257576 1.075758
std 12.505663 1.176301 1.452672 1.767355 1.774266 1.825550
min 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 8.250000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 15.500000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 24.750000 1.000000 1.000000 2.000000 2.000000 1.000000
max 57.000000 5.000000 6.000000 8.000000 6.000000 8.000000
2014 2015 2016 2017 ... 2023 Games \
count 66.000000 66.000000 66.000000 66.00000 ... 66.000000 66.000000
mean 1.333333 1.454545 1.287879 1.30303 ... 0.030303 99.530303
std 1.842518 1.832878 1.698553 1.75385 ... 0.172733 53.915952
min 0.000000 0.000000 0.000000 0.00000 ... 0.000000 42.000000
25% 0.000000 0.000000 0.000000 0.00000 ... 0.000000 61.250000
50% 0.500000 1.000000 1.000000 0.00000 ... 0.000000 77.500000
75% 2.000000 2.000000 2.000000 2.00000 ... 0.000000 138.000000
max 7.000000 7.000000 6.000000 7.00000 ... 1.000000 253.000000
Per Game Attempts Per 100 Att Sacked Per Sack \
count 66.000000 66.000000 66.000000 66.000000 66.000000
mean 0.176818 3250.303030 0.573636 211.212121 0.083424
std 0.073801 2085.250348 0.241951 117.910594 0.034212
min 0.050000 289.000000 0.130000 26.000000 0.024000
25% 0.112500 1794.500000 0.390000 131.500000 0.058500
50% 0.165000 2315.500000 0.550000 170.000000 0.078000
75% 0.220000 4476.000000 0.740000 264.500000 0.107250
max 0.350000 9725.000000 1.240000 542.000000 0.202000
Sack Per Att Third Down % qboc
count 66.000000 66.000000 66.000000
mean 0.070091 31.143788 0.272727
std 0.016449 13.872813 0.448775
min 0.030000 0.000000 0.000000
25% 0.058000 25.000000 0.000000
50% 0.069000 30.770000 0.000000
75% 0.083500 39.755000 1.000000
max 0.103000 77.780000 1.000000
[8 rows x 25 columns]
Well, that was easy. Using the .describe() function allows for the examination of the data set’s values for central tendency (mean in this case) and the spread (standard deviation) and dispersion (technically). Although all the information is present to observe the dispersion of the information, it may hard to conceptualize the shape without using a visualization help. Histograms can aid in our observations.
qbs.hist(figsize=(10, 15))
Adjust the figure size to be able to view all the histogram outputs more clearly. The distributions for Per Game and Per 100 Attempts look nearly normal, so that will allow us to use some parametric tests for analysis.
Another visual that is helpful to see the distribution is a boxplot (box-and whisker plot). This time let us just look at just the Per Game and Per 100 Attempts.
qbs.boxplot(column=['Per Game', 'Per 100 Att'], figsize=(10,10))
The wonderful thing about boxplots is that they make it easy to identify outliers – observations that are outside the whiskers (either top or bottom). You can also easily see the centrality and spread of the data.
One additional variable that we should look at is the ‘qboc’. Did you notice that there was an interesting distribution when the histograms plotted above? Let’s take a closer look.
qbs['qboc'].hist(figsize=(10, 15))
As you can see there are only two values for this variable. And this makes sense since we are categorizing the quarterbacks based on if they are (1) or are not (0) a quarterback of color. I am going to do a bit of foreshadowing, but this means that if we wanted to do any sort of predictive analysis, we need to think about some additional models beyond regular linear regression, logistic regression to be specific… but that is a post for another day. For now, I have identified the variables we can use for some multivariate analysis:
- Per Game
- Per 100 Attempts
- qboc
In this data exploration process, univariate analysis was applied to understand and summarize the characteristics of individual variables, specifically focusing on those related to Roughing the Passer (RTP) calls. The dataset was examined using Python and Jupyter Notebooks. The summary statistics and visualizations were generated to gain insights into the central tendency, dispersion, and distribution of the data.
The analysis revealed that variables such as “Per Game” and “Per 100 Attempts” exhibited nearly normal distributions, making them suitable for parametric tests. Additionally, the “qboc” variable, which categorizes quarterbacks based on their ethnicity, showed a binary distribution, indicating potential utility in predictive modeling.
This initial exploration sets the stage for further multivariate analysis and modeling, offering a foundation for more in-depth investigations into the relationships between these variables and RTP calls in NFL quarterbacks.