Exploring the Bivariate Landscape: Navigating Relationships in EDA

If you are still with me, we are now entering the phase of analysis where the rubber begins to meet the road to eventually get to the answer to the original research question: Is there a connection between a quarterback’s race and the number of times they get a Roughing the Passer (RtP) call in their favor? So far, we have completed the following steps in our EDA journey:

Now, we are diving into Bivariate Analysis. In this step, we will examine relationships between pairs of variables. Utilizing scatter plots, correlation matrices, and other tools, we will investigate how variables interact with each other, aiming to identify correlations, dependencies, or trends of interest.
For this phase, I will once again employ R to gain additional practice. The initial step involves importing the necessary libraries for this part of the analysis, as well as the QBs dataset we’ve been working with.

R
install.packages("ggcorrplot")
library(ggcorrplot)
library(ggplot2)

url <- "https://raw.githubusercontent.com/lareynolds/QB-EDA/main/cleaned_qbs.csv"

qbs = read.csv(file = url)

Now it may seem pretty easy and straight forward to follow the steps and just do what I just did. But what you don’t see is the extra tabs I have open to help through the process. For example, At first I tried to import the dataset as an Excel file, but that made the system throw some errors. So then, I thought it would be easier to import the file as a .csv file, and it sort of was, but I still got a bunch of errors. Apparently Google Colab doesn’t like to import files straight from GitHub as is, so off to Stack Overflow or Toward Data Science I go. In case you haven’t read anything on TDS, you should. This is a gem of quick tips and tricks for everything in data science.

Thanks to A Apte’s TDS article “Get Started: 3 Ways to Load CSV files into Colab”, the answer is:

“Click on the dataset in your repository, then click on View Raw. Copy the link to the raw dataset and store it as a string variable called url in Colab.”

“Get Started: 3 Ways to Load CSV files into Colab”
R
# Double Checking to make sure the data is loaded into the notebook
str(qbs)

#Output
'data.frame':	66 obs. of  26 variables:
 $ Player      : chr  "A.Dalton" "A.Luck" "A.Rodgers" "A.Smith" ...
 $ Total       : int  25 17 41 23 12 9 6 18 2 30 ...
 $ X2009       : int  0 0 3 1 0 0 1 0 0 3 ...
 $ X2010       : int  0 0 3 1 0 0 1 0 0 0 ...
 $ X2011       : int  4 0 3 1 0 2 0 0 0 3 ...
 $ X2012       : int  1 5 5 1 0 1 0 0 0 1 ...
 $ X2013       : int  1 3 2 1 0 0 0 0 0 1 ...
 $ X2014       : int  5 4 2 6 1 0 1 0 0 4 ...
 $ X2015       : int  0 1 5 4 2 2 1 0 1 2 ...
 $ X2016       : int  1 1 4 1 1 1 0 0 0 1 ...
 $ X2017       : int  1 0 0 5 7 1 1 0 0 3 ...
 $ X2018       : int  1 3 1 2 1 0 0 3 1 4 ...
 $ X2019       : int  6 0 3 0 0 0 1 4 0 0 ...
 $ X2020       : int  0 0 2 0 0 1 0 4 0 2 ...
 $ X2021       : int  3 0 5 0 0 1 0 5 0 6 ...
 $ X2022       : int  2 0 3 0 0 0 0 2 0 0 ...
 $ X2023       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Games       : int  170 94 228 149 79 62 60 74 44 190 ...
 $ Per.Game    : num  0.15 0.18 0.18 0.15 0.15 0.15 0.1 0.24 0.05 0.16 ...
 $ Attempts    : int  5557 3620 7840 4648 2719 1534 1554 2330 1230 7035 ...
 $ Per.100.Att : num  0.45 0.47 0.52 0.49 0.44 0.59 0.39 0.77 0.16 0.43 ...
 $ Sacked      : int  361 186 542 367 201 139 92 171 84 395 ...
 $ Per.Sack    : num  0.069 0.091 0.076 0.063 0.06 0.065 0.065 0.105 0.024 0.076 ...
 $ Sack.Per.Att: num  0.065 0.051 0.069 0.079 0.074 0.091 0.059 0.073 0.068 0.056 ...
 $ Third.Down..: num  40 29.4 39 26.1 33.3 ...
 $ qboc        : int  0 0 0 0 0 0 0 0 0 0 ...
 
 # Checking the summary statistics of each of the variables in the dataset
summary(qbs)

#Output
    Player              Total           X2009           X2010       
 Length:66          Min.   : 2.00   Min.   :0.000   Min.   :0.0000  
 Class :character   1st Qu.: 8.25   1st Qu.:0.000   1st Qu.:0.0000  
 Mode  :character   Median :15.50   Median :0.000   Median :0.0000  
                    Mean   :18.09   Mean   :0.697   Mean   :0.8333  
                    3rd Qu.:24.75   3rd Qu.:1.000   3rd Qu.:1.0000  
                    Max.   :57.00   Max.   :5.000   Max.   :6.0000  
     X2011           X2012           X2013           X2014      
 Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
 1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000  
 Median :0.000   Median :0.000   Median :0.000   Median :0.500  
 Mean   :1.121   Mean   :1.258   Mean   :1.076   Mean   :1.333  
 3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000  
 Max.   :8.000   Max.   :6.000   Max.   :8.000   Max.   :7.000  
     X2015           X2016           X2017           X2018      
 Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
 1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000  
 Median :1.000   Median :1.000   Median :0.000   Median :1.000  
 Mean   :1.455   Mean   :1.288   Mean   :1.303   Mean   :1.561  
 3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:3.000  
 Max.   :7.000   Max.   :6.000   Max.   :7.000   Max.   :7.000  
     X2019            X2020            X2021            X2022      
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :0.000  
 1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:0.000  
 Median : 0.000   Median : 0.000   Median : 0.000   Median :0.000  
 Mean   : 1.773   Mean   : 1.545   Mean   : 1.742   Mean   :1.076  
 3rd Qu.: 3.000   3rd Qu.: 2.000   3rd Qu.: 3.000   3rd Qu.:2.000  
 Max.   :10.000   Max.   :11.000   Max.   :10.000   Max.   :6.000  
     X2023            Games           Per.Game         Attempts   
 Min.   :0.0000   Min.   : 42.00   Min.   :0.0500   Min.   : 289  
 1st Qu.:0.0000   1st Qu.: 61.25   1st Qu.:0.1125   1st Qu.:1794  
 Median :0.0000   Median : 77.50   Median :0.1650   Median :2316  
 Mean   :0.0303   Mean   : 99.53   Mean   :0.1768   Mean   :3250  
 3rd Qu.:0.0000   3rd Qu.:138.00   3rd Qu.:0.2200   3rd Qu.:4476  
 Max.   :1.0000   Max.   :253.00   Max.   :0.3500   Max.   :9725  
  Per.100.Att         Sacked         Per.Sack        Sack.Per.Att    
 Min.   :0.1300   Min.   : 26.0   Min.   :0.02400   Min.   :0.03000  
 1st Qu.:0.3900   1st Qu.:131.5   1st Qu.:0.05850   1st Qu.:0.05800  
 Median :0.5500   Median :170.0   Median :0.07800   Median :0.06900  
 Mean   :0.5736   Mean   :211.2   Mean   :0.08342   Mean   :0.07009  
 3rd Qu.:0.7400   3rd Qu.:264.5   3rd Qu.:0.10725   3rd Qu.:0.08350  
 Max.   :1.2400   Max.   :542.0   Max.   :0.20200   Max.   :0.10300  
  Third.Down..        qboc       
 Min.   : 0.00   Min.   :0.0000  
 1st Qu.:25.00   1st Qu.:0.0000  
 Median :30.77   Median :0.0000  
 Mean   :31.14   Mean   :0.2727  
 3rd Qu.:39.76   3rd Qu.:1.0000  
 Max.   :77.78   Max.   :1.0000  
 
 # Creates a subset dataframe with the numeric variables
df = subset(qbs, select = c(Per.Game, Per.100.Att, Per.Sack, Sack.Per.Att, qboc)) # To drop columns use a '-' before the c and list the columns to drop

# Create the scatter plot
x <- qbs$Per.Game # x is the column with the RtP values per game for each quarterback
y <- qbs$qboc # y is either the 1 or 0 value to indicate if the quarterback is black or not

plot(x, y, main = "Scatter Plot with Trend Line", xlab = "RtP Calls per Game", ylab = "QB of Color", col = "blue", pch = 19)

# Add the trend line
abline(lm(y ~ x), col = "red") # the 'lm' in this line is calling a liner model to fit a line that best fits the data by reducing the error

Notice that the qboc is either a 1 or a 0. This is a binary variable. This might cause some implications later on. But for right now, it does appear that there is a negative association between the quaterbacks’ race and the number of RtP calls made in the quarterback’s favor. This means that the number of RtP calls decrease for quarterbacks of color.

In the Univariate Analysis post, I decided that I would examine the Per.Game and qboc variables in this analysis, but after thinking about it, I decided it might be worth taking a look to see if maybe there was another reason for the number of RtP calls. Could a quarterback that gets sacked a lot also draw more RtP calls? Could this be an indication that the offensive line is not as good and can’t protect the quarterback as well? So I’m also going to do a scatter plot for the percentage of sacks by the number of RtP calls per game.

R
# Create the scatter plot
x <- qbs$Per.Game
y <- qbs$Per.Sack
plot(x, y, main = "Scatter Plot with Trend Line", xlab = "RtP Calls per Game", ylab = "Sacks per Game", col = "blue", pch = 19)

# Add the trend line
abline(lm(y ~ x), col = "red")
# This is the same code as above, but I switched out the variable for y.

There does indeed seem to be a strong positive linear connection between the number of sacks and the number of RtP calls. One other thing to notice is that there may be some outliers for these variables. I would like to see if this is true so I am going to use the Interquartile Range (IQR) Method to determine if this is the case. This method involves calculating the IQR for your data and identifying data points outside the “whiskers” of the box plot (usually 1.5 times the IQR).

R
# Calculate the IQR for 'y' variable
q <- quantile(y, c(0.25, 0.75))
iqr <- q[2] - q[1]

# Set a threshold for outliers
threshold <- 1.5

# Identify outliers
outliers <- y < (q[1] - threshold * iqr) | y > (q[2] + threshold * iqr)

# Print the indices of outlier data points
which(outliers)

# Output
56

value = qbs[56, 1]

value

# Output
'R.Fitzpatrick'

Well that is interesting, apparently Ryan Fitzpatrick took enough sacks to literally send his stats off the charts. In this case identifying the outlier was a “good to know” exercise. However, in other datasets outliers may indicate errors in the data and should be examined further. Now let’s get back to our task at hand.

Let’s see if there is a correlation between our numeric variables. It does look like there is some level of negative correlation between the number of RtP calls and the quarterback’s race. And the scatter plot of the Sacks and RtP calls seem to be very correlated. We can create a correlation matrix to visualize the correlation coefficient between each of the variables.

R
corr_matrix <- cor(df)
corr_plot<- ggcorrplot(corr_matrix,
           type = "lower",  # "lower" displays the lower triangular part of the matrix
           hc.order = TRUE, # reorder variables based on hierarchical clustering
           lab = TRUE,      # display variable names
           title = "Correlation Matrix Plot",
           ggtheme = theme_minimal()) # choose a theme for the plot (optional)


# Change the size of the display window
options(repr.plot.width = 10, repr.plot.height = 10)

# Display the correlation matrix plot
print(corr_plot)
R
RtP_correlation <- cor(qbs$Per.Game, qbs$qboc)
RtP_correlation

# Output
-0.0523628105182807

RtP_Sack_correlation <- cor(qbs$Per.Game, qbs$Per.Sack)
RtP_Sack_correlation

# Output
0.89240005509169

We can draw some conclusions from the correlation matrix, such as it appears as though quarterbacks of color get sacked more, but benefit less from RtP calls than their white counterparts. But don’t forget, the qboc variable is binary, so we need to make sure that we aren’t drawing some erroneous conclusions by using statistical methods meant to be used on continuous data.

For this reason, I am also going to also look at the point-biserial correlation coefficient. This is a correlation coefficient used to measure the strength and direction of the linear relationship between a binary (dichotomous) variable (0 or 1) and a continuous variable.. It is an adapted form of the Pearson correlation coefficient, which is used for two continuous variables. The point-biserial correlation is appropriate when one of the variables is binary, and the other is continuous.

Here are the key points about the point-biserial correlation coefficient:

Purpose: The point-biserial correlation is used to determine if there is a significant linear relationship between a binary predictor variable (e.g., yes/no, pass/fail) and a continuous outcome variable (e.g., test scores, income).

Range: The point-biserial correlation coefficient ranges from -1 to 1, similar to the Pearson correlation coefficient. A positive value indicates a positive relationship (as the binary variable increases, the continuous variable tends to increase), while a negative value indicates a negative relationship (as the binary variable increases, the continuous variable tends to decrease).

Interpretation:

  • A coefficient close to 1 or -1 suggests a strong linear relationship.
  • A coefficient close to 0 suggests a weak or no linear relationship.
  • The sign (+ or -) indicates the direction of the relationship.

Assumptions: Like the Pearson correlation, the point-biserial correlation assumes linearity and normality. It is also sensitive to outliers.

Use Cases:

  • In research, it is used to examine associations between a binary predictor variable (e.g., gender, treatment group) and a continuous outcome variable (e.g., test scores, response time).
  • Commonly used in educational research (e.g., comparing the performance of two groups).
  • Often used in psychology to assess the relationship between a binary variable (e.g., presence/absence of a condition) and a continuous measure (e.g., anxiety levels).
R
# Sample data: binary variable (0s and 1s) and continuous variable
binary_variable <- qbs$qboc
continuous_variable <- qbs$Per.Game

# Calculate the mean and standard deviation for both groups
mean_0 <- mean(continuous_variable[binary_variable == 0])
mean_1 <- mean(continuous_variable[binary_variable == 1])

sd_0 <- sd(continuous_variable[binary_variable == 0])
sd_1 <- sd(continuous_variable[binary_variable == 1])

# Calculate the difference in means
mean_diff <- mean_1 - mean_0

# Calculate the proportion of 1s in the binary variable
p <- mean(binary_variable)

# Calculate the pooled standard deviation
pooled_sd <- sqrt(((sd_0^2 + sd_1^2) / 2))

# Calculate the biserial correlation
biserial_correlation <- mean_diff / (pooled_sd * sqrt(p * (1 - p)))

# Print the result
print(biserial_correlation)

# Output
-0.2702969

The biserial_correlation does indicate that there is an association between the number of RtP called in favor of the quarterback is inversely related to their race. However, the strength of the correlation may be a little week given how far -0.27 is from -1. Biserial correlations are sensitive to outliers, so let’s see what happens if we remove Ryan Fitzpatrick from the dataset… just for giggles… and re-run the correlation before moving on to testing to see if the correlation is statistically significant.

R
row_index_to_remove <- 56
new_qbs <- qbs[-row_index_to_remove, ]

# Sample data: binary variable (0s and 1s) and continuous variable
binary_variable <- new_qbs$qboc
continuous_variable <- new_qbs$Per.Game

# Calculate the mean and standard deviation for both groups
mean_0 <- mean(continuous_variable[binary_variable == 0])
mean_1 <- mean(continuous_variable[binary_variable == 1])

sd_0 <- sd(continuous_variable[binary_variable == 0])
sd_1 <- sd(continuous_variable[binary_variable == 1])

# Calculate the difference in means
mean_diff <- mean_1 - mean_0

# Calculate the proportion of 1s in the binary variable
p <- mean(binary_variable)

# Calculate the pooled standard deviation
pooled_sd <- sqrt(((sd_0^2 + sd_1^2) / 2))

# Calculate the biserial correlation
biserial_correlation <- mean_diff / (pooled_sd * sqrt(p * (1 - p)))

# Print the result
print(biserial_correlation)

# Output
-0.1595576

So that is interesting that the strength of the correlation decreased by removing the outlier. Since I want as much data as possible, and there was no error in the data for the outlier, I am going to use the entire dataset when testing to see if the correlation is statistically significant.

The next step in this process is hypothesis testing. In this study:

  • Ho = There is no correlation between a quarterback’s race and the number of RtP calls made in his favor
  • Ha = There is a correlation between race and RtP calls

The alpha for this test will be 0.05.

R
categorical_variable <- qbs$qboc
continuous_variable <- qbs$Per.Game

# Compute the Point-Biserial Correlation and perform the test
correlation_test <- cor.test(categorical_variable, continuous_variable)

# Check the p-value
p_value <- correlation_test$p.value

# Set your significance level (e.g., 0.05)
alpha <- 0.05

# Determine if the correlation is statistically significant
if (p_value < alpha) {
  cat("The correlation is statistically significant (p-value:", p_value, ")\n")
} else {
  cat("The correlation is not statistically significant (p-value:", p_value, ")\n")
}

# Output
The correlation is not statistically significant (p-value: 0.6762716)

There you have it, folks. According to the findings of our analysis, the NFL is doing a good job of minimizing bias in penalty calls.

Summary

In this phase of the EDA, I examined the relationship between a quarterback’s race and the number of RtP calls made in favor of the QB. I utilized scatter plots with trend lines to visualize potential relationships, employed a correlation matrix to identify the strength and direction of these relationships, checked for outliers to assess their impact on the relationship’s strength, ensured the use of the appropriate correlation technique to account for a categorical independent variable, and conducted hypothesis testing to determine the statistical significance of the correlation.

Cruising the Data Landscape: Exploring the Fundamentals of Data Exploration

Recap

Well, well, well! Looks like we’ve embarked on quite the adventure here! Our first task was to get to the bottom of the burning question: “Do quarterbacks of color get the short end of the stick when it comes to roughing the passer penalties compared to their fair-skinned counterparts?” Exciting stuff, huh?

The next step in our journey involved gathering the all-important data. We needed to make sure we were dealing with top-notch, premium-quality data. None of that shady, questionable stuff for us! We demanded data that was reliable, traceable, and oh-so-accurate. Because in the world of data science, it’s all about that golden rule: “garbage in, garbage out!” Can’t have our analysis going awry now, can we?

For this particular project, we scoured the depths of NFL Penalties and even tapped into the vast knowledge reserves of the mighty Wikipedia. We left no stone unturned, my friend!

Now, onto the thrilling next stage of our expedition – data cleaning and reshaping! Brace yourself, because this is where things get spicy. We had to create a nifty variable to indicate the race of those quarterbacks. Easy-peasy, right? Well, almost. Turns out, we stumbled upon a little naming convention conundrum that resulted in a sneaky duplicate value. But fear not! With our eagle-eyed attention to detail, we swiftly detected and corrected that pesky error. Crisis averted!

And now, dear explorer, we venture forth to the exhilarating realm of Data Exploration. Excited? You should be! There’s so much more to discover and uncover on this grand data-driven expedition! So, buckle up and get ready for the thrill ride ahead! Let’s dive into the vast ocean of information and see what fascinating insights await us!

Data exploration is an essential step in the data analysis process. It allows us to gain an initial understanding of the dataset by examining its general properties and characteristics. By delving into the data, we can uncover valuable insights.

One of the key aspects of data exploration involves assessing the size of the dataset. This includes understanding the number of observations or records and the number of variables or features present in the dataset. Knowing the scope of the dataset helps us gauge its comprehensiveness and potential usefulness for our analysis.

Furthermore, examining the data types is crucial in data exploration. By identifying the types of data contained in the dataset, such as numerical, categorical, or textual, we can determine the appropriate statistical techniques or visualization methods to apply during analysis. This enables us to make accurate interpretations and draw meaningful conclusions from the data.

Another crucial aspect is the identification of initial patterns or trends within the dataset. Exploring the data can reveal inherent relationships, correlations, or anomalies that may exist. By uncovering these patterns, we can develop hypotheses, generate insights, and pose relevant research questions for further investigation.

You may have noticed that I talk about using Python a lot in my work. I actually prefer to use in my day job since it seems to get the most use and support from other data professionals. However, for this portion of my project, I’m going to be using R instead. It has been some time since I’ve used R (maybe before I even finished my master’s program) so full discloser, I’m pretty sure I had to look up most of that I’m doing to refresh my memory. But since it is a language, if you don’t use it, you lose it, so I’m going to shake things up a bit and revisit R.

Data Exploration

The first things I need to do is load in any packages I need and my dataset. I am loading the ‘cleaned_qbs.xlsx’ Excel file that I created during the data cleaning process.

R
# import and store the dataset
library(openxlsx)

qbs = read.xlsx("https://myordinaryjourney.com/wp-content/uploads/2023/09/cleaned_qbs.xlsx")

# Check to make sure the data loaded correctly
head(qbs, n=5)

#Output
A data.frame: 5 × 26
Player	Total	2009	2010	2011	2012	2013	2014	2015	2016	...	2023	Games	Per.Game	Attempts	Per.100.Att	Sacked	Per.Sack	Sack.Per.Att	Third.Down.%	qboc
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	A.Dalton	25	0	0	4	1	1	5	0	1	...	0	170	0.15	5557	0.45	361	0.069	0.065	40.00	0
2	A.Luck	17	0	0	0	5	3	4	1	1	...	0	94	0.18	3620	0.47	186	0.091	0.051	29.41	0
3	A.Rodgers	41	3	3	3	5	2	2	5	4	...	0	228	0.18	7840	0.52	542	0.076	0.069	39.02	0
4	A.Smith	23	1	1	1	1	1	6	4	1	...	0	149	0.15	4648	0.49	367	0.063	0.079	26.09	0
5	B.Bortles	12	0	0	0	0	0	1	2	1	...	0	79	0.15	2719	0.44	201	0.060	0.074	33.33	0

Looking over the raw data, it is hard to see any particular patterns in the data. However, a few data points do jump out as they are either larger or smaller than the data around the point. For example, the number of RTP penalties called each year numbers in the single digits with the exception of 2020 for the quarterback Josh Allen. That year he received 11 RTP called in his favor. Another example is the number of sacks recorded against Taysom Hill. Most of the QBs have a sack count in the triple digits, while only a few have double digit sack counts and Hill has the lowest of any QB in the data set. The next thing I want to do is check how many records are in my table and the total number of variables. It is helpful to know what your sample size is so you can select the correct analytic approach later on.

R
nrow(qbs)

# Output
66

length(qbs)

#Output
26

The next step is to understand the data types of each of the variables. Luckily, when the head of the table was printed, it also included that data types for each of the variables. In our cleaned_qbs table, we have one variable with a character data type (Player) and 25 variables with the data type of ‘dbl’, otherwise known float.

R
#Output
A data.frame: 5 × 26
Player	Total	2009	2010	2011	2012	2013	2014	2015	2016	...	2023	Games	Per.Game	Attempts	Per.100.Att	Sacked	Per.Sack	Sack.Per.Att	Third.Down.%	qboc
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>

Next Steps

Now that we understand how much and what kind of data we have, we can develop a game plan on what analytic techniques we can take in the next step. With Univariate Analysis, Bivariate Analysis, and Multivariate Analysis we will begin to explore the data and any relationships between the variables that might exist. This will get us closer to a possible answer to our original question.