In this blog series, I’ll delve into the exciting world of web development as I embark on the creation of GECACS (Guild Equipment Check-out and Control System) for the Fredericksburg Spinners and Weavers Guild. I have never developed a web application before, so I’m going to use ChatGPT 3.5 to guild me through the process.
Many years ago I elected to use the Forminator plug-in to create a check-out form for our guild members to use when they wanted to check-out equipment. The problem is that there was no way for the member or librarian to document the equipment had been turned back in. Trying to associate the check-out form with a check-in was becoming difficult for our librarian to keep up with. Since I am the guild’s “technology officer” and with my data science background, it seamed like a natural option to develop a solutions that would showcase my current skills and help me develop new ones.
If you are still with me, we are now entering the phase of analysis where the rubber begins to meet the road to eventually get to the answer to the original research question: Is there a connection between a quarterback’s race and the number of times they get a Roughing the Passer (RtP) call in their favor? So far, we have completed the following steps in our EDA journey:
Now, we are diving into Bivariate Analysis. In this step, we will examine relationships between pairs of variables. Utilizing scatter plots, correlation matrices, and other tools, we will investigate how variables interact with each other, aiming to identify correlations, dependencies, or trends of interest.
For this phase, I will once again employ R to gain additional practice. The initial step involves importing the necessary libraries for this part of the analysis, as well as the QBs dataset we’ve been working with.
Now it may seem pretty easy and straight forward to follow the steps and just do what I just did. But what you don’t see is the extra tabs I have open to help through the process. For example, At first I tried to import the dataset as an Excel file, but that made the system throw some errors. So then, I thought it would be easier to import the file as a .csv file, and it sort of was, but I still got a bunch of errors. Apparently Google Colab doesn’t like to import files straight from GitHub as is, so off to Stack Overflow or Toward Data Science I go. In case you haven’t read anything on TDS, you should. This is a gem of quick tips and tricks for everything in data science.
“Click on the dataset in your repository, then click on View Raw. Copy the link to the raw dataset and store it as a string variable called url in Colab.”
# Double Checking to make sure the data is loaded into the notebookstr(qbs)#Output'data.frame':66obs. of26variables:$Player:chr"A.Dalton""A.Luck""A.Rodgers""A.Smith"...$Total:int25174123129618230...$X2009:int0031001003...$X2010:int0031001000...$X2011:int4031020003...$X2012:int1551010001...$X2013:int1321000001...$X2014:int5426101004...$X2015:int0154221012...$X2016:int1141110001...$X2017:int1005711003...$X2018:int1312100314...$X2019:int6030001400...$X2020:int0020010402...$X2021:int3050010506...$X2022:int2030000200...$X2023:int0000000000...$Games:int170942281497962607444190...$Per.Game:num0.150.180.180.150.150.150.10.240.050.16...$Attempts:int5557362078404648271915341554233012307035...$Per.100.Att:num0.450.470.520.490.440.590.390.770.160.43...$Sacked:int3611865423672011399217184395...$Per.Sack:num0.0690.0910.0760.0630.060.0650.0650.1050.0240.076...$Sack.Per.Att:num0.0650.0510.0690.0790.0740.0910.0590.0730.0680.056...$Third.Down..:num4029.43926.133.3...$qboc:int0000000000...# Checking the summary statistics of each of the variables in the datasetsummary(qbs)#OutputPlayerTotalX2009X2010Length:66Min. :2.00Min. :0.000Min. :0.0000Class:character1stQu.:8.251stQu.:0.0001stQu.:0.0000Mode:characterMedian:15.50Median:0.000Median:0.0000Mean:18.09Mean:0.697Mean:0.83333rdQu.:24.753rdQu.:1.0003rdQu.:1.0000Max. :57.00Max. :5.000Max. :6.0000X2011X2012X2013X2014Min. :0.000Min. :0.000Min. :0.000Min. :0.0001stQu.:0.0001stQu.:0.0001stQu.:0.0001stQu.:0.000Median:0.000Median:0.000Median:0.000Median:0.500Mean:1.121Mean:1.258Mean:1.076Mean:1.3333rdQu.:2.0003rdQu.:2.0003rdQu.:1.0003rdQu.:2.000Max. :8.000Max. :6.000Max. :8.000Max. :7.000X2015X2016X2017X2018Min. :0.000Min. :0.000Min. :0.000Min. :0.0001stQu.:0.0001stQu.:0.0001stQu.:0.0001stQu.:0.000Median:1.000Median:1.000Median:0.000Median:1.000Mean:1.455Mean:1.288Mean:1.303Mean:1.5613rdQu.:2.0003rdQu.:2.0003rdQu.:2.0003rdQu.:3.000Max. :7.000Max. :6.000Max. :7.000Max. :7.000X2019X2020X2021X2022Min. :0.000Min. :0.000Min. :0.000Min. :0.0001stQu.:0.0001stQu.:0.0001stQu.:0.0001stQu.:0.000Median:0.000Median:0.000Median:0.000Median:0.000Mean:1.773Mean:1.545Mean:1.742Mean:1.0763rdQu.:3.0003rdQu.:2.0003rdQu.:3.0003rdQu.:2.000Max. :10.000Max. :11.000Max. :10.000Max. :6.000X2023GamesPer.GameAttemptsMin. :0.0000Min. :42.00Min. :0.0500Min. :2891stQu.:0.00001stQu.:61.251stQu.:0.11251stQu.:1794Median:0.0000Median:77.50Median:0.1650Median:2316Mean:0.0303Mean:99.53Mean:0.1768Mean:32503rdQu.:0.00003rdQu.:138.003rdQu.:0.22003rdQu.:4476Max. :1.0000Max. :253.00Max. :0.3500Max. :9725Per.100.AttSackedPer.SackSack.Per.AttMin. :0.1300Min. :26.0Min. :0.02400Min. :0.030001stQu.:0.39001stQu.:131.51stQu.:0.058501stQu.:0.05800Median:0.5500Median:170.0Median:0.07800Median:0.06900Mean:0.5736Mean:211.2Mean:0.08342Mean:0.070093rdQu.:0.74003rdQu.:264.53rdQu.:0.107253rdQu.:0.08350Max. :1.2400Max. :542.0Max. :0.20200Max. :0.10300Third.Down.. qbocMin. :0.00Min. :0.00001stQu.:25.001stQu.:0.0000Median:30.77Median:0.0000Mean:31.14Mean:0.27273rdQu.:39.763rdQu.:1.0000Max. :77.78Max. :1.0000# Creates a subset dataframe with the numeric variablesdf=subset(qbs, select=c(Per.Game, Per.100.Att, Per.Sack, Sack.Per.Att, qboc)) # To drop columns use a '-' before the c and list the columns to drop# Create the scatter plotx<-qbs$Per.Game# x is the column with the RtP values per game for each quarterbacky<-qbs$qboc# y is either the 1 or 0 value to indicate if the quarterback is black or notplot(x, y, main="Scatter Plot with Trend Line", xlab="RtP Calls per Game", ylab="QB of Color", col="blue", pch=19)# Add the trend lineabline(lm(y~x), col="red") # the 'lm' in this line is calling a liner model to fit a line that best fits the data by reducing the error
Notice that the qboc is either a 1 or a 0. This is a binary variable. This might cause some implications later on. But for right now, it does appear that there is a negative association between the quaterbacks’ race and the number of RtP calls made in the quarterback’s favor. This means that the number of RtP calls decrease for quarterbacks of color.
In the Univariate Analysis post, I decided that I would examine the Per.Game and qboc variables in this analysis, but after thinking about it, I decided it might be worth taking a look to see if maybe there was another reason for the number of RtP calls. Could a quarterback that gets sacked a lot also draw more RtP calls? Could this be an indication that the offensive line is not as good and can’t protect the quarterback as well? So I’m also going to do a scatter plot for the percentage of sacks by the number of RtP calls per game.
R
# Create the scatter plotx<-qbs$Per.Gamey<-qbs$Per.Sackplot(x, y, main="Scatter Plot with Trend Line", xlab="RtP Calls per Game", ylab="Sacks per Game", col="blue", pch=19)# Add the trend lineabline(lm(y~x), col="red")# This is the same code as above, but I switched out the variable for y.
There does indeed seem to be a strong positive linear connection between the number of sacks and the number of RtP calls. One other thing to notice is that there may be some outliers for these variables. I would like to see if this is true so I am going to use the Interquartile Range (IQR) Method to determine if this is the case. This method involves calculating the IQR for your data and identifying data points outside the “whiskers” of the box plot (usually 1.5 times the IQR).
R
# Calculate the IQR for 'y' variableq<-quantile(y, c(0.25, 0.75))iqr<-q[2]-q[1]# Set a threshold for outliersthreshold<-1.5# Identify outliersoutliers<-y<(q[1]-threshold*iqr)|y>(q[2]+threshold*iqr)# Print the indices of outlier data pointswhich(outliers)# Output56value=qbs[56, 1]value# Output'R.Fitzpatrick'
Well that is interesting, apparently Ryan Fitzpatrick took enough sacks to literally send his stats off the charts. In this case identifying the outlier was a “good to know” exercise. However, in other datasets outliers may indicate errors in the data and should be examined further. Now let’s get back to our task at hand.
Let’s see if there is a correlation between our numeric variables. It does look like there is some level of negative correlation between the number of RtP calls and the quarterback’s race. And the scatter plot of the Sacks and RtP calls seem to be very correlated. We can create a correlation matrix to visualize the correlation coefficient between each of the variables.
R
corr_matrix<-cor(df)corr_plot<-ggcorrplot(corr_matrix,type="lower",# "lower" displays the lower triangular part of the matrixhc.order=TRUE,# reorder variables based on hierarchical clusteringlab=TRUE,# display variable namestitle="Correlation Matrix Plot",ggtheme=theme_minimal())# choose a theme for the plot (optional)# Change the size of the display windowoptions(repr.plot.width=10, repr.plot.height=10)# Display the correlation matrix plotprint(corr_plot)
We can draw some conclusions from the correlation matrix, such as it appears as though quarterbacks of color get sacked more, but benefit less from RtP calls than their white counterparts. But don’t forget, the qboc variable is binary, so we need to make sure that we aren’t drawing some erroneous conclusions by using statistical methods meant to be used on continuous data.
Here are the key points about the point-biserial correlation coefficient:
Purpose: The point-biserial correlation is used to determine if there is a significant linear relationship between a binary predictor variable (e.g., yes/no, pass/fail) and a continuous outcome variable (e.g., test scores, income).
Range: The point-biserial correlation coefficient ranges from -1 to 1, similar to the Pearson correlation coefficient. A positive value indicates a positive relationship (as the binary variable increases, the continuous variable tends to increase), while a negative value indicates a negative relationship (as the binary variable increases, the continuous variable tends to decrease).
Interpretation:
A coefficient close to 1 or -1 suggests a strong linear relationship.
A coefficient close to 0 suggests a weak or no linear relationship.
The sign (+ or -) indicates the direction of the relationship.
Assumptions: Like the Pearson correlation, the point-biserial correlation assumes linearity and normality. It is also sensitive to outliers.
Use Cases:
In research, it is used to examine associations between a binary predictor variable (e.g., gender, treatment group) and a continuous outcome variable (e.g., test scores, response time).
Commonly used in educational research (e.g., comparing the performance of two groups).
Often used in psychology to assess the relationship between a binary variable (e.g., presence/absence of a condition) and a continuous measure (e.g., anxiety levels).
R
# Sample data: binary variable (0s and 1s) and continuous variablebinary_variable<-qbs$qboccontinuous_variable<-qbs$Per.Game# Calculate the mean and standard deviation for both groupsmean_0<-mean(continuous_variable[binary_variable==0])mean_1<-mean(continuous_variable[binary_variable==1])sd_0<-sd(continuous_variable[binary_variable==0])sd_1<-sd(continuous_variable[binary_variable==1])# Calculate the difference in meansmean_diff<-mean_1-mean_0# Calculate the proportion of 1s in the binary variablep<-mean(binary_variable)# Calculate the pooled standard deviationpooled_sd<-sqrt(((sd_0^2+sd_1^2)/2))# Calculate the biserial correlationbiserial_correlation<-mean_diff/(pooled_sd*sqrt(p*(1-p)))# Print the resultprint(biserial_correlation)# Output-0.2702969
The biserial_correlation does indicate that there is an association between the number of RtP called in favor of the quarterback is inversely related to their race. However, the strength of the correlation may be a little week given how far -0.27 is from -1. Biserial correlations are sensitive to outliers, so let’s see what happens if we remove Ryan Fitzpatrick from the dataset… just for giggles… and re-run the correlation before moving on to testing to see if the correlation is statistically significant.
R
row_index_to_remove<-56new_qbs<-qbs[-row_index_to_remove, ]# Sample data: binary variable (0s and 1s) and continuous variablebinary_variable<-new_qbs$qboccontinuous_variable<-new_qbs$Per.Game# Calculate the mean and standard deviation for both groupsmean_0<-mean(continuous_variable[binary_variable==0])mean_1<-mean(continuous_variable[binary_variable==1])sd_0<-sd(continuous_variable[binary_variable==0])sd_1<-sd(continuous_variable[binary_variable==1])# Calculate the difference in meansmean_diff<-mean_1-mean_0# Calculate the proportion of 1s in the binary variablep<-mean(binary_variable)# Calculate the pooled standard deviationpooled_sd<-sqrt(((sd_0^2+sd_1^2)/2))# Calculate the biserial correlationbiserial_correlation<-mean_diff/(pooled_sd*sqrt(p*(1-p)))# Print the resultprint(biserial_correlation)# Output-0.1595576
So that is interesting that the strength of the correlation decreased by removing the outlier. Since I want as much data as possible, and there was no error in the data for the outlier, I am going to use the entire dataset when testing to see if the correlation is statistically significant.
The next step in this process is hypothesis testing. In this study:
Ho = There is no correlation between a quarterback’s race and the number of RtP calls made in his favor
Ha = There is a correlation between race and RtP calls
The alpha for this test will be 0.05.
R
categorical_variable<-qbs$qboccontinuous_variable<-qbs$Per.Game# Compute the Point-Biserial Correlation and perform the testcorrelation_test<-cor.test(categorical_variable, continuous_variable)# Check the p-valuep_value<-correlation_test$p.value# Set your significance level (e.g., 0.05)alpha<-0.05# Determine if the correlation is statistically significantif(p_value<alpha){cat("The correlation is statistically significant (p-value:", p_value, ")\n")}else{cat("The correlation is not statistically significant (p-value:", p_value, ")\n")}# OutputThecorrelationisnotstatisticallysignificant(p-value:0.6762716)
There you have it, folks. According to the findings of our analysis, the NFL is doing a good job of minimizing bias in penalty calls.
Summary
In this phase of the EDA, I examined the relationship between a quarterback’s race and the number of RtP calls made in favor of the QB. I utilized scatter plots with trend lines to visualize potential relationships, employed a correlation matrix to identify the strength and direction of these relationships, checked for outliers to assess their impact on the relationship’s strength, ensured the use of the appropriate correlation technique to account for a categorical independent variable, and conducted hypothesis testing to determine the statistical significance of the correlation.