Oh boy, time to dive into another crazy data science adventure! This time, we’re tackling the chaotic realm of traffic in good ol’ Washington, D.C. Brace yourself for the daily battle against bumper-to-bumper madness and the heart-stopping dance of merging lanes. As a brave commuter, I’ve had enough of this madness and I refuse to succumb to its soul-sucking ways. My mission: to outsmart the traffic gods and find that sweet spot of minimum congestion and maximum savings. Picture the infamous interchange of northbound I-95 Express Lanes and the 495 Inner Loop Express Lane as our arch-nemesis, and we, the fearless data scientists, are here to give it a taste of its own medicine. Buckle up, my friend, because this is going to be one wild ride!
Although time series analysis is a major component of this project, several opportunities exist to use multiple analytic techniques to include:
Spatial Analysis: This involves analyzing the spatial distribution of traffic congestion. Geographic Information Systems (GIS) can be used to visualize traffic patterns on maps and identify congestion hotspots.
Machine Learning: Beyond time series analysis, various machine learning techniques can be applied for traffic prediction and congestion analysis. These include regression models, clustering algorithms, and neural networks.
Network Analysis: This method focuses on the structure of transportation networks. It can be used to analyze road connectivity, identify bottlenecks, and optimize traffic flow.
Simulation Modeling: Traffic simulation models like microsimulation or agent-based modeling can be used to simulate and analyze traffic behavior under different scenarios. This is particularly useful for studying the impact of infrastructure changes.
Statistical Analysis: Traditional statistical methods can be employed to analyze relationships between traffic congestion and various factors such as weather, time of day, or road type.
Deep Learning: Deep learning techniques, such as Convolutional Neural Networks (CNNs), can be applied to analyze traffic camera images or video feeds for real-time congestion detection.
Optimization Models: Mathematical optimization models can be used to optimize traffic signal timings, route planning, and congestion mitigation strategies.
Behavioral Analysis: Understanding driver behavior and decision-making processes can be crucial for predicting and managing congestion. Behavioral analysis methods, such as choice modeling, can be applied.
If you are still with me, we are now entering the phase of analysis where the rubber begins to meet the road to eventually get to the answer to the original research question: Is there a connection between a quarterback’s race and the number of times they get a Roughing the Passer (RtP) call in their favor? So far, we have completed the following steps in our EDA journey:
Now, we are diving into Bivariate Analysis. In this step, we will examine relationships between pairs of variables. Utilizing scatter plots, correlation matrices, and other tools, we will investigate how variables interact with each other, aiming to identify correlations, dependencies, or trends of interest.
For this phase, I will once again employ R to gain additional practice. The initial step involves importing the necessary libraries for this part of the analysis, as well as the QBs dataset we’ve been working with.
Now it may seem pretty easy and straight forward to follow the steps and just do what I just did. But what you don’t see is the extra tabs I have open to help through the process. For example, At first I tried to import the dataset as an Excel file, but that made the system throw some errors. So then, I thought it would be easier to import the file as a .csv file, and it sort of was, but I still got a bunch of errors. Apparently Google Colab doesn’t like to import files straight from GitHub as is, so off to Stack Overflow or Toward Data Science I go. In case you haven’t read anything on TDS, you should. This is a gem of quick tips and tricks for everything in data science.
“Click on the dataset in your repository, then click on View Raw. Copy the link to the raw dataset and store it as a string variable called url in Colab.”
# Double Checking to make sure the data is loaded into the notebookstr(qbs)#Output'data.frame':66obs. of26variables:$Player:chr"A.Dalton""A.Luck""A.Rodgers""A.Smith"...$Total:int25174123129618230...$X2009:int0031001003...$X2010:int0031001000...$X2011:int4031020003...$X2012:int1551010001...$X2013:int1321000001...$X2014:int5426101004...$X2015:int0154221012...$X2016:int1141110001...$X2017:int1005711003...$X2018:int1312100314...$X2019:int6030001400...$X2020:int0020010402...$X2021:int3050010506...$X2022:int2030000200...$X2023:int0000000000...$Games:int170942281497962607444190...$Per.Game:num0.150.180.180.150.150.150.10.240.050.16...$Attempts:int5557362078404648271915341554233012307035...$Per.100.Att:num0.450.470.520.490.440.590.390.770.160.43...$Sacked:int3611865423672011399217184395...$Per.Sack:num0.0690.0910.0760.0630.060.0650.0650.1050.0240.076...$Sack.Per.Att:num0.0650.0510.0690.0790.0740.0910.0590.0730.0680.056...$Third.Down..:num4029.43926.133.3...$qboc:int0000000000...# Checking the summary statistics of each of the variables in the datasetsummary(qbs)#OutputPlayerTotalX2009X2010Length:66Min. :2.00Min. :0.000Min. :0.0000Class:character1stQu.:8.251stQu.:0.0001stQu.:0.0000Mode:characterMedian:15.50Median:0.000Median:0.0000Mean:18.09Mean:0.697Mean:0.83333rdQu.:24.753rdQu.:1.0003rdQu.:1.0000Max. :57.00Max. :5.000Max. :6.0000X2011X2012X2013X2014Min. :0.000Min. :0.000Min. :0.000Min. :0.0001stQu.:0.0001stQu.:0.0001stQu.:0.0001stQu.:0.000Median:0.000Median:0.000Median:0.000Median:0.500Mean:1.121Mean:1.258Mean:1.076Mean:1.3333rdQu.:2.0003rdQu.:2.0003rdQu.:1.0003rdQu.:2.000Max. :8.000Max. :6.000Max. :8.000Max. :7.000X2015X2016X2017X2018Min. :0.000Min. :0.000Min. :0.000Min. :0.0001stQu.:0.0001stQu.:0.0001stQu.:0.0001stQu.:0.000Median:1.000Median:1.000Median:0.000Median:1.000Mean:1.455Mean:1.288Mean:1.303Mean:1.5613rdQu.:2.0003rdQu.:2.0003rdQu.:2.0003rdQu.:3.000Max. :7.000Max. :6.000Max. :7.000Max. :7.000X2019X2020X2021X2022Min. :0.000Min. :0.000Min. :0.000Min. :0.0001stQu.:0.0001stQu.:0.0001stQu.:0.0001stQu.:0.000Median:0.000Median:0.000Median:0.000Median:0.000Mean:1.773Mean:1.545Mean:1.742Mean:1.0763rdQu.:3.0003rdQu.:2.0003rdQu.:3.0003rdQu.:2.000Max. :10.000Max. :11.000Max. :10.000Max. :6.000X2023GamesPer.GameAttemptsMin. :0.0000Min. :42.00Min. :0.0500Min. :2891stQu.:0.00001stQu.:61.251stQu.:0.11251stQu.:1794Median:0.0000Median:77.50Median:0.1650Median:2316Mean:0.0303Mean:99.53Mean:0.1768Mean:32503rdQu.:0.00003rdQu.:138.003rdQu.:0.22003rdQu.:4476Max. :1.0000Max. :253.00Max. :0.3500Max. :9725Per.100.AttSackedPer.SackSack.Per.AttMin. :0.1300Min. :26.0Min. :0.02400Min. :0.030001stQu.:0.39001stQu.:131.51stQu.:0.058501stQu.:0.05800Median:0.5500Median:170.0Median:0.07800Median:0.06900Mean:0.5736Mean:211.2Mean:0.08342Mean:0.070093rdQu.:0.74003rdQu.:264.53rdQu.:0.107253rdQu.:0.08350Max. :1.2400Max. :542.0Max. :0.20200Max. :0.10300Third.Down.. qbocMin. :0.00Min. :0.00001stQu.:25.001stQu.:0.0000Median:30.77Median:0.0000Mean:31.14Mean:0.27273rdQu.:39.763rdQu.:1.0000Max. :77.78Max. :1.0000# Creates a subset dataframe with the numeric variablesdf=subset(qbs, select=c(Per.Game, Per.100.Att, Per.Sack, Sack.Per.Att, qboc)) # To drop columns use a '-' before the c and list the columns to drop# Create the scatter plotx<-qbs$Per.Game# x is the column with the RtP values per game for each quarterbacky<-qbs$qboc# y is either the 1 or 0 value to indicate if the quarterback is black or notplot(x, y, main="Scatter Plot with Trend Line", xlab="RtP Calls per Game", ylab="QB of Color", col="blue", pch=19)# Add the trend lineabline(lm(y~x), col="red") # the 'lm' in this line is calling a liner model to fit a line that best fits the data by reducing the error
Notice that the qboc is either a 1 or a 0. This is a binary variable. This might cause some implications later on. But for right now, it does appear that there is a negative association between the quaterbacks’ race and the number of RtP calls made in the quarterback’s favor. This means that the number of RtP calls decrease for quarterbacks of color.
In the Univariate Analysis post, I decided that I would examine the Per.Game and qboc variables in this analysis, but after thinking about it, I decided it might be worth taking a look to see if maybe there was another reason for the number of RtP calls. Could a quarterback that gets sacked a lot also draw more RtP calls? Could this be an indication that the offensive line is not as good and can’t protect the quarterback as well? So I’m also going to do a scatter plot for the percentage of sacks by the number of RtP calls per game.
R
# Create the scatter plotx<-qbs$Per.Gamey<-qbs$Per.Sackplot(x, y, main="Scatter Plot with Trend Line", xlab="RtP Calls per Game", ylab="Sacks per Game", col="blue", pch=19)# Add the trend lineabline(lm(y~x), col="red")# This is the same code as above, but I switched out the variable for y.
There does indeed seem to be a strong positive linear connection between the number of sacks and the number of RtP calls. One other thing to notice is that there may be some outliers for these variables. I would like to see if this is true so I am going to use the Interquartile Range (IQR) Method to determine if this is the case. This method involves calculating the IQR for your data and identifying data points outside the “whiskers” of the box plot (usually 1.5 times the IQR).
R
# Calculate the IQR for 'y' variableq<-quantile(y, c(0.25, 0.75))iqr<-q[2]-q[1]# Set a threshold for outliersthreshold<-1.5# Identify outliersoutliers<-y<(q[1]-threshold*iqr)|y>(q[2]+threshold*iqr)# Print the indices of outlier data pointswhich(outliers)# Output56value=qbs[56, 1]value# Output'R.Fitzpatrick'
Well that is interesting, apparently Ryan Fitzpatrick took enough sacks to literally send his stats off the charts. In this case identifying the outlier was a “good to know” exercise. However, in other datasets outliers may indicate errors in the data and should be examined further. Now let’s get back to our task at hand.
Let’s see if there is a correlation between our numeric variables. It does look like there is some level of negative correlation between the number of RtP calls and the quarterback’s race. And the scatter plot of the Sacks and RtP calls seem to be very correlated. We can create a correlation matrix to visualize the correlation coefficient between each of the variables.
R
corr_matrix<-cor(df)corr_plot<-ggcorrplot(corr_matrix,type="lower",# "lower" displays the lower triangular part of the matrixhc.order=TRUE,# reorder variables based on hierarchical clusteringlab=TRUE,# display variable namestitle="Correlation Matrix Plot",ggtheme=theme_minimal())# choose a theme for the plot (optional)# Change the size of the display windowoptions(repr.plot.width=10, repr.plot.height=10)# Display the correlation matrix plotprint(corr_plot)
We can draw some conclusions from the correlation matrix, such as it appears as though quarterbacks of color get sacked more, but benefit less from RtP calls than their white counterparts. But don’t forget, the qboc variable is binary, so we need to make sure that we aren’t drawing some erroneous conclusions by using statistical methods meant to be used on continuous data.
Here are the key points about the point-biserial correlation coefficient:
Purpose: The point-biserial correlation is used to determine if there is a significant linear relationship between a binary predictor variable (e.g., yes/no, pass/fail) and a continuous outcome variable (e.g., test scores, income).
Range: The point-biserial correlation coefficient ranges from -1 to 1, similar to the Pearson correlation coefficient. A positive value indicates a positive relationship (as the binary variable increases, the continuous variable tends to increase), while a negative value indicates a negative relationship (as the binary variable increases, the continuous variable tends to decrease).
Interpretation:
A coefficient close to 1 or -1 suggests a strong linear relationship.
A coefficient close to 0 suggests a weak or no linear relationship.
The sign (+ or -) indicates the direction of the relationship.
Assumptions: Like the Pearson correlation, the point-biserial correlation assumes linearity and normality. It is also sensitive to outliers.
Use Cases:
In research, it is used to examine associations between a binary predictor variable (e.g., gender, treatment group) and a continuous outcome variable (e.g., test scores, response time).
Commonly used in educational research (e.g., comparing the performance of two groups).
Often used in psychology to assess the relationship between a binary variable (e.g., presence/absence of a condition) and a continuous measure (e.g., anxiety levels).
R
# Sample data: binary variable (0s and 1s) and continuous variablebinary_variable<-qbs$qboccontinuous_variable<-qbs$Per.Game# Calculate the mean and standard deviation for both groupsmean_0<-mean(continuous_variable[binary_variable==0])mean_1<-mean(continuous_variable[binary_variable==1])sd_0<-sd(continuous_variable[binary_variable==0])sd_1<-sd(continuous_variable[binary_variable==1])# Calculate the difference in meansmean_diff<-mean_1-mean_0# Calculate the proportion of 1s in the binary variablep<-mean(binary_variable)# Calculate the pooled standard deviationpooled_sd<-sqrt(((sd_0^2+sd_1^2)/2))# Calculate the biserial correlationbiserial_correlation<-mean_diff/(pooled_sd*sqrt(p*(1-p)))# Print the resultprint(biserial_correlation)# Output-0.2702969
The biserial_correlation does indicate that there is an association between the number of RtP called in favor of the quarterback is inversely related to their race. However, the strength of the correlation may be a little week given how far -0.27 is from -1. Biserial correlations are sensitive to outliers, so let’s see what happens if we remove Ryan Fitzpatrick from the dataset… just for giggles… and re-run the correlation before moving on to testing to see if the correlation is statistically significant.
R
row_index_to_remove<-56new_qbs<-qbs[-row_index_to_remove, ]# Sample data: binary variable (0s and 1s) and continuous variablebinary_variable<-new_qbs$qboccontinuous_variable<-new_qbs$Per.Game# Calculate the mean and standard deviation for both groupsmean_0<-mean(continuous_variable[binary_variable==0])mean_1<-mean(continuous_variable[binary_variable==1])sd_0<-sd(continuous_variable[binary_variable==0])sd_1<-sd(continuous_variable[binary_variable==1])# Calculate the difference in meansmean_diff<-mean_1-mean_0# Calculate the proportion of 1s in the binary variablep<-mean(binary_variable)# Calculate the pooled standard deviationpooled_sd<-sqrt(((sd_0^2+sd_1^2)/2))# Calculate the biserial correlationbiserial_correlation<-mean_diff/(pooled_sd*sqrt(p*(1-p)))# Print the resultprint(biserial_correlation)# Output-0.1595576
So that is interesting that the strength of the correlation decreased by removing the outlier. Since I want as much data as possible, and there was no error in the data for the outlier, I am going to use the entire dataset when testing to see if the correlation is statistically significant.
The next step in this process is hypothesis testing. In this study:
Ho = There is no correlation between a quarterback’s race and the number of RtP calls made in his favor
Ha = There is a correlation between race and RtP calls
The alpha for this test will be 0.05.
R
categorical_variable<-qbs$qboccontinuous_variable<-qbs$Per.Game# Compute the Point-Biserial Correlation and perform the testcorrelation_test<-cor.test(categorical_variable, continuous_variable)# Check the p-valuep_value<-correlation_test$p.value# Set your significance level (e.g., 0.05)alpha<-0.05# Determine if the correlation is statistically significantif(p_value<alpha){cat("The correlation is statistically significant (p-value:", p_value, ")\n")}else{cat("The correlation is not statistically significant (p-value:", p_value, ")\n")}# OutputThecorrelationisnotstatisticallysignificant(p-value:0.6762716)
There you have it, folks. According to the findings of our analysis, the NFL is doing a good job of minimizing bias in penalty calls.
Summary
In this phase of the EDA, I examined the relationship between a quarterback’s race and the number of RtP calls made in favor of the QB. I utilized scatter plots with trend lines to visualize potential relationships, employed a correlation matrix to identify the strength and direction of these relationships, checked for outliers to assess their impact on the relationship’s strength, ensured the use of the appropriate correlation technique to account for a categorical independent variable, and conducted hypothesis testing to determine the statistical significance of the correlation.
Univariate analysis is a crucial step in data analysis that focuses on examining and summarizing the characteristics of a single variable or attribute from the dataset. Univariate analysis provides a foundation for understanding the characteristics of individual variables, which is essential for more advanced multivariate analyses and modeling. It helps identify patterns, outliers, and potential data quality issues, making it a crucial step in the data analysis pipeline.
Python
import pandas as pd# import and store the datasetqbs = pd.read_excel("https://myordinaryjourney.com/wp-content/uploads/2023/09/cleaned_qbs.xlsx")print(qbs.head())#Output Player Total 20092010201120122013201420152016...\0 A.Dalton 2500411501...1 A.Luck 1700053411...2 A.Rodgers 4133352254...3 A.Smith 2311111641...4 B.Bortles 1200000121...2023 Games Per Game Attempts Per 100 Att Sacked Per Sack \001700.1555570.453610.06910940.1836200.471860.091202280.1878400.525420.076301490.1546480.493670.06340790.1527190.442010.060 Sack Per Att Third Down % qboc 00.06540.00010.05129.41020.06939.02030.07926.09040.07433.330
First, we should look to see what the variable datatypes are and if there are any null or missing values.
Python
qbs.info(verbose=True)#Output<class'pandas.core.frame.DataFrame'>RangeIndex:66 entries,0 to 65Data columns (total 26 columns):# Column Non-Null Count Dtype ----------------------------0 Player 66 non-null object1 Total 66 non-null int64 2200966 non-null int64 3201066 non-null int64 4201166 non-null int64 5201266 non-null int64 6201366 non-null int64 7201466 non-null int64 8201566 non-null int64 9201666 non-null int64 10201766 non-null int64 11201866 non-null int64 12201966 non-null int64 13202066 non-null int64 14202166 non-null int64 15202266 non-null int64 16202366 non-null int64 17 Games 66 non-null int64 18 Per Game 66 non-null float6419 Attempts 66 non-null int64 20 Per 100 Att 66 non-null float6421 Sacked 66 non-null int64 22 Per Sack 66 non-null float6423 Sack Per Att 66 non-null float6424 Third Down %66 non-null float6425 qboc 66 non-null int64 dtypes: float64(5), int64(20),object(1)
The variables I will examine more closely are the total number of Roughing the Passer (RTP) calls per quarterback, the number of RTP calls per year for each quarterback, and the number of RTP calls per games played for their distribution, central tendency, and variance. I will be using Python and Jupyter Notebooks.
Python
table_stats = qbs.describe()print(table_stats)#Output Total 20092010201120122013\count 66.00000066.00000066.00000066.00000066.00000066.000000mean 18.0909090.6969700.8333331.1212121.2575761.075758std 12.5056631.1763011.4526721.7673551.7742661.825550min2.0000000.0000000.0000000.0000000.0000000.00000025%8.2500000.0000000.0000000.0000000.0000000.00000050%15.5000000.0000000.0000000.0000000.0000000.00000075%24.7500001.0000001.0000002.0000002.0000001.000000max57.0000005.0000006.0000008.0000006.0000008.0000002014201520162017...2023 Games \count 66.00000066.00000066.00000066.00000...66.00000066.000000mean 1.3333331.4545451.2878791.30303...0.03030399.530303std 1.8425181.8328781.6985531.75385...0.17273353.915952min0.0000000.0000000.0000000.00000...0.00000042.00000025%0.0000000.0000000.0000000.00000...0.00000061.25000050%0.5000001.0000001.0000000.00000...0.00000077.50000075%2.0000002.0000002.0000002.00000...0.000000138.000000max7.0000007.0000006.0000007.00000...1.000000253.000000 Per Game Attempts Per 100 Att Sacked Per Sack \count 66.00000066.00000066.00000066.00000066.000000mean 0.1768183250.3030300.573636211.2121210.083424std 0.0738012085.2503480.241951117.9105940.034212min0.050000289.0000000.13000026.0000000.02400025%0.1125001794.5000000.390000131.5000000.05850050%0.1650002315.5000000.550000170.0000000.07800075%0.2200004476.0000000.740000264.5000000.107250max0.3500009725.0000001.240000542.0000000.202000 Sack Per Att Third Down % qboc count 66.00000066.00000066.000000mean 0.07009131.1437880.272727std 0.01644913.8728130.448775min0.0300000.0000000.00000025%0.05800025.0000000.00000050%0.06900030.7700000.00000075%0.08350039.7550001.000000max0.10300077.7800001.000000[8 rows x 25 columns]
Well, that was easy. Using the .describe() function allows for the examination of the data set’s values for central tendency (mean in this case) and the spread (standard deviation) and dispersion (technically). Although all the information is present to observe the dispersion of the information, it may hard to conceptualize the shape without using a visualization help. Histograms can aid in our observations.
Python
qbs.hist(figsize=(10,15))
Adjust the figure size to be able to view all the histogram outputs more clearly. The distributions for Per Game and Per 100 Attempts look nearly normal, so that will allow us to use some parametric tests for analysis.
Another visual that is helpful to see the distribution is a boxplot (box-and whisker plot). This time let us just look at just the Per Game and Per 100 Attempts.
The wonderful thing about boxplots is that they make it easy to identify outliers – observations that are outside the whiskers (either top or bottom). You can also easily see the centrality and spread of the data.
One additional variable that we should look at is the ‘qboc’. Did you notice that there was an interesting distribution when the histograms plotted above? Let’s take a closer look.
Python
qbs['qboc'].hist(figsize=(10,15))
As you can see there are only two values for this variable. And this makes sense since we are categorizing the quarterbacks based on if they are (1) or are not (0) a quarterback of color. I am going to do a bit of foreshadowing, but this means that if we wanted to do any sort of predictive analysis, we need to think about some additional models beyond regular linear regression, logistic regression to be specific… but that is a post for another day. For now, I have identified the variables we can use for some multivariate analysis:
Per Game
Per 100 Attempts
qboc
In this data exploration process, univariate analysis was applied to understand and summarize the characteristics of individual variables, specifically focusing on those related to Roughing the Passer (RTP) calls. The dataset was examined using Python and Jupyter Notebooks. The summary statistics and visualizations were generated to gain insights into the central tendency, dispersion, and distribution of the data.
The analysis revealed that variables such as “Per Game” and “Per 100 Attempts” exhibited nearly normal distributions, making them suitable for parametric tests. Additionally, the “qboc” variable, which categorizes quarterbacks based on their ethnicity, showed a binary distribution, indicating potential utility in predictive modeling.
This initial exploration sets the stage for further multivariate analysis and modeling, offering a foundation for more in-depth investigations into the relationships between these variables and RTP calls in NFL quarterbacks.
Well, well, well! Looks like we’ve embarked on quite the adventure here! Our first task was to get to the bottom of the burning question: “Do quarterbacks of color get the short end of the stick when it comes to roughing the passer penalties compared to their fair-skinned counterparts?” Exciting stuff, huh?
The next step in our journey involved gathering the all-important data. We needed to make sure we were dealing with top-notch, premium-quality data. None of that shady, questionable stuff for us! We demanded data that was reliable, traceable, and oh-so-accurate. Because in the world of data science, it’s all about that golden rule: “garbage in, garbage out!” Can’t have our analysis going awry now, can we?
For this particular project, we scoured the depths of NFL Penalties and even tapped into the vast knowledge reserves of the mighty Wikipedia. We left no stone unturned, my friend!
Now, onto the thrilling next stage of our expedition – data cleaning and reshaping! Brace yourself, because this is where things get spicy. We had to create a nifty variable to indicate the race of those quarterbacks. Easy-peasy, right? Well, almost. Turns out, we stumbled upon a little naming convention conundrum that resulted in a sneaky duplicate value. But fear not! With our eagle-eyed attention to detail, we swiftly detected and corrected that pesky error. Crisis averted!
And now, dear explorer, we venture forth to the exhilarating realm of Data Exploration. Excited? You should be! There’s so much more to discover and uncover on this grand data-driven expedition! So, buckle up and get ready for the thrill ride ahead! Let’s dive into the vast ocean of information and see what fascinating insights await us!
Data exploration is an essential step in the data analysis process. It allows us to gain an initial understanding of the dataset by examining its general properties and characteristics. By delving into the data, we can uncover valuable insights.
One of the key aspects of data exploration involves assessing the size of the dataset. This includes understanding the number of observations or records and the number of variables or features present in the dataset. Knowing the scope of the dataset helps us gauge its comprehensiveness and potential usefulness for our analysis.
Furthermore, examining the data types is crucial in data exploration. By identifying the types of data contained in the dataset, such as numerical, categorical, or textual, we can determine the appropriate statistical techniques or visualization methods to apply during analysis. This enables us to make accurate interpretations and draw meaningful conclusions from the data.
Another crucial aspect is the identification of initial patterns or trends within the dataset. Exploring the data can reveal inherent relationships, correlations, or anomalies that may exist. By uncovering these patterns, we can develop hypotheses, generate insights, and pose relevant research questions for further investigation.
You may have noticed that I talk about using Python a lot in my work. I actually prefer to use in my day job since it seems to get the most use and support from other data professionals. However, for this portion of my project, I’m going to be using R instead. It has been some time since I’ve used R (maybe before I even finished my master’s program) so full discloser, I’m pretty sure I had to look up most of that I’m doing to refresh my memory. But since it is a language, if you don’t use it, you lose it, so I’m going to shake things up a bit and revisit R.
Data Exploration
The first things I need to do is load in any packages I need and my dataset. I am loading the ‘cleaned_qbs.xlsx’ Excel file that I created during the data cleaning process.
R
# import and store the datasetlibrary(openxlsx)qbs=read.xlsx("https://myordinaryjourney.com/wp-content/uploads/2023/09/cleaned_qbs.xlsx")# Check to make sure the data loaded correctlyhead(qbs, n=5)#OutputAdata.frame:5 × 26PlayerTotal20092010201120122013201420152016...2023GamesPer.GameAttemptsPer.100.AttSackedPer.SackSack.Per.AttThird.Down.% qboc<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>...<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>1A.Dalton2500411501...01700.1555570.453610.0690.06540.0002A.Luck1700053411...0940.1836200.471860.0910.05129.4103A.Rodgers4133352254...02280.1878400.525420.0760.06939.0204A.Smith2311111641...01490.1546480.493670.0630.07926.0905B.Bortles1200000121...0790.1527190.442010.0600.07433.330
Looking over the raw data, it is hard to see any particular patterns in the data. However, a few data points do jump out as they are either larger or smaller than the data around the point. For example, the number of RTP penalties called each year numbers in the single digits with the exception of 2020 for the quarterback Josh Allen. That year he received 11 RTP called in his favor. Another example is the number of sacks recorded against Taysom Hill. Most of the QBs have a sack count in the triple digits, while only a few have double digit sack counts and Hill has the lowest of any QB in the data set. The next thing I want to do is check how many records are in my table and the total number of variables. It is helpful to know what your sample size is so you can select the correct analytic approach later on.
R
nrow(qbs)# Output66length(qbs)#Output26
The next step is to understand the data types of each of the variables. Luckily, when the head of the table was printed, it also included that data types for each of the variables. In our cleaned_qbs table, we have one variable with a character data type (Player) and 25 variables with the data type of ‘dbl’, otherwise known float.
Now that we understand how much and what kind of data we have, we can develop a game plan on what analytic techniques we can take in the next step. With Univariate Analysis, Bivariate Analysis, and Multivariate Analysis we will begin to explore the data and any relationships between the variables that might exist. This will get us closer to a possible answer to our original question.
When working on a project that requires data, it is essential to consider the various sources and formats in which the information may be available. Often, the desired data cannot be found in one single location, requiring careful compilation from multiple sources. This process can be time-consuming and challenging, as each source may present its own set of complications and intricacies.
While some data may be readily accessible through online databases, APIs, or official government reports, it may not always be presented in the most convenient or usable form. For instance, a dataset could be in a raw, unstructured format, making it difficult to analyze and extract meaningful insights. In such cases, data manipulation techniques are necessary to transform the data into a suitable format.
Data manipulation involves various tasks such as cleaning, merging, and reshaping the data. Cleaning the data involves removing any inconsistencies, errors, or missing values. This step ensures that the data is accurate and reliable for analysis. Merging data from multiple sources requires aligning and combining different datasets based on common attributes or keys. This consolidation allows for a comprehensive and holistic view of the data.
Reshaping the data involves rearranging its structure to fit the desired analysis or visualization needs. This can include tasks such as pivoting, aggregating, or disaggregating the data. By reshaping the data, researchers and analysts can uncover patterns, trends, and relationships that might otherwise go unnoticed.
In the given scenario, the QB Penalty Stats table lacked information on the racial background of the players. While this data may not be readily available in the original source, alternative sources can be explored. In this case, a reliable and well-sourced list of black quarterbacks could be found on Wikipedia. Although caution is advisable when using information from Wikipedia, the aforementioned table demonstrates reliability in this particular case.
Working with data often goes beyond simply retrieving information from a single source. It requires diligent gathering, cleaning, merging, and reshaping of data to ensure accuracy and usability. With the right techniques and approaches, researchers can unlock the full potential of data and derive valuable insights.
Embarking on an Epic Journey
I now have a list of all QBs from 2009 until present (that’s 2023 for those of you that stumble upon my ramblings in the future) in the qb_rtp table. I also have a list of black QBs in the qbs_oc table. Sometimes I find it can be faster to do some of the data cleaning work in the native application. In this case, I needed to create a column in the black_qbs table that I could use to compare the qb_rtp[‘Player’] column in the other table. Luckily Excel’s autofill feature made this particularly easy. I inserted a blank column next to the [‘Quarterback’] called [‘short_name’]. After typing a few examples, I selected the “Autofill” option in the Data drop down menu, and voilà, I had the new column that I can use for comparison.
Starting to Use Notebooks
I haven’t used a Jupyter Notebook to code in a very long time. I usually use the Spyder IDE as I can see all my variables’ values at the same time. But I’m going to give it a shot. You can get the entire notebook here. The first step is to import the different libraries that I think I’ll need.
Make sure to verify the data has been loaded by viewing a few rows from each dataframe. If you are wondering what the ‘\n’ means, it is the regular expression for a new line. It creates some extra space between the table headings and the table data to make them more readable. Dataquest has a great quick guide and cheat sheet available here.
print('Quarterback Stats table:\n', qbs.head(),'\n')print('Quarterback of Color table:\n', qbs_oc.head())#OutputQuarterback Stats table: Player Total 20092010201120122013201420152016...\0 A.Dalton 2500411501...1 A.Luck 1700053411...2 A.Rodgers 4133352254...3 A.Smith 2311111641...4 B.Bortles 1200000121...20222023 Games Per Game Attempts Per 100 Att Sacked Per Sack \0201700.1555570.453610.069100940.1836200.471860.0912302280.1878400.525420.0763001490.1546480.493670.063400790.1527190.442010.060 Sack Per Att Third Down %00.06540.0010.05129.4120.06939.0230.07926.0940.07433.33[5 rows x 25 columns]Quarterback of Color table: Quarterback short_name Years active \0 Fritz Pollard F.Pollard 1920–19261 Joe Lillard J.Lillard 1932–19332 George Taliaferro G.Taliaferro 1950–19553 Marlin Briscoe M.Briscoe 1968*4 James Harris J.Harris 1969–1981 Team 0 Akron Pros, Milwaukee Badgers, Hammond Pros, P...1 Chicago Cardinals 2 New York Yanks, Dallas Texans, Baltimore Colts...3 Denver Broncos 4 Buffalo Bills, Los Angeles Rams, San Diego Cha...
Smushing the Data Together
Now that the two initial datasets have been read into Pandas dataframes, it’s time to smush them together in a way that makes them useful for further analysis. The first step I take is adding a new variable to the Quarterback Stats Table (qbs) called ‘qboc’. I set all the values to 0 for right now.
qbs['qboc']=0
Once again check to make sure the variable was added to the dataframe as expected.
print(qbs.head())#OutputPlayer Total 20092010201120122013201420152016...\0 A.Dalton 2500411501...1 A.Luck 1700053411...2 A.Rodgers 4133352254...3 A.Smith 2311111641...4 B.Bortles 1200000121...2023 Games Per Game Attempts Per 100 Att Sacked Per Sack \001700.1555570.453610.06910940.1836200.471860.091202280.1878400.525420.076301490.1546480.493670.06340790.1527190.442010.060 Sack Per Att Third Down % qboc 00.06540.00010.05129.41020.06939.02030.07926.09040.07433.330[5 rows x 26 columns]
The ‘qboc’ column was added to the end of the dataframe. The next step I take is to make the qbs[‘Player’] and the qbs_oc[‘short_name’] columns the indexes of the dataframes. This will speed up the comparison process. Although it wouldn’t be noticeable with these datasets, if you have a dataset will millions of records, it will bog down the system… ask me how I know.
Now this is the really fun part. All I need to do now is compare the indexes of the two dataframes. If the indexes match, then I can assign a ‘1’ to the qbs[‘qboc’] variable to signify that that quarterback is indeed a quarterback of color.
qbs.loc[(qbs.index.isin(qbs_oc.index)),'qboc']=1
This one line of code is pretty cool, I think. I want to dig into what this line actually does. I know this isn’t a Python coding post, but explaining will only take a minute and it unpacks some interesting concepts.
qbs.index and qbs_oc.index specify that we are going to be doing something with the indexes of the two dataframes.
The .isin is a function that returns a Boolean mask. So what (qbs.index.isin(qbs_oc.index) does is takes the index values of the qbs_oc dataframe and look to see if those values are in the qbs dataframe index. If they are, then the Boolean value = True, if not, the value = False. You can also save that part of the code to a variable and use over and over in other parts of your project, but in this instance, we don’t need to.
The qbs.loc is locating all the the indexes in the qbs dataframe where the mask value = True.
‘qboc’ specifies the column we want to assign our value to. In this case a 1 to signify that the quarterback is a person of color.
The qbs.loc() designates the dataframe we are intending to modify; (qbs.index.isin(qbs_oc.index)) designates the row we want to work with; and ‘qboc’ designates the column.
print(qbs.head())#Output Total 200920102011201220132014201520162017...\Player ...A.Dalton 25004115011...A.Luck 17000534110...A.Rodgers 41333522540...A.Smith 23111116415...B.Bortles 12000001217...2023 Games Per Game Attempts Per 100 Att Sacked Per Sack \Player A.Dalton 01700.1555570.453610.069A.Luck 0940.1836200.471860.091A.Rodgers 02280.1878400.525420.076A.Smith 01490.1546480.493670.063B.Bortles 0790.1527190.442010.060 Sack Per Att Third Down % qboc Player A.Dalton 0.06540.000A.Luck 0.05129.410A.Rodgers 0.06939.020A.Smith 0.07926.091B.Bortles 0.07433.330
As you can see, A.Smith now has a 1 in the qboc column. However, the A.Smith in the qbs data set is actually Alex Smith not Akili Smith who is the actual quarterback referenced in the qbs_oc dataset. This illustrates the importance of going through your data to ensure accuracy. Luckily these datasets are small enough to go through to spot any errors.
double_check = qbs.loc[(qbs['qboc']==1)]print(double_check.index)#OutputIndex(['A.Smith','C.Kaepernick','C.Newton','D.Prescott','D.Watson','G.Smith','J.Brissett','J.Campbell','J.Freeman','J.Hurts','J.Winston','K.Murray','L.Jackson','M.Vick','P.Mahomes','R.Griffin III','R.Wilson','T.Bridgewater','T.Taylor'],dtype='object',name='Player')qbs['qboc']['A.Smith']=0print(qbs)#Output Total 200920102011201220132014201520162017...\Player ...A.Smith 23111116415...2023 Games Per Game Attempts Per 100 Att Sacked Per Sack \Player A.Smith 01490.1546480.493670.063 Sack Per Att Third Down % qboc Player A.Smith 0.07926.090[1 rows x 25 columns]
Next Steps
Get ready for some quarterback madness in the upcoming post! We’ll be diving into our fancy new quarterbacks table and unleashing some wild analysis on the gridiron.
Once upon a time, in the realm of weather-related nerdiness, I embarked on a quest to decipher the secrets of changing weather patterns. Armed with my mighty keyboard and a burning hatred for sweltering summers, I planned to uncover the truth about my local area’s climate evolution. You see, summer and I have never been the best of friends. The scorching heat and suffocating humidity make me cringe harder than a cat stuck in a cucumber maze. So, I figured, why not dive into the delightful world of data and investigate if there’s any hope for an early arrival of autumn? I dubbed it my “Project Meteorological Marvel.”
My cunning plan involved sifting through decades of weather records, gathering juicy tidbits on how temperatures have tortured us poor mortals over the years. I wanted to spot trends, make dazzling graphs, and perhaps even predict when autumn would grace us with its blessed presence. Oh, how I yearned for a reliable sign that summer’s reign of terror would soon be over! Of course, this was no ordinary undertaking. I needed a trustworthy data source, and what better place to turn to than the National Oceanic and Atmospheric Administration (NOAA)? If you can’t trust the NOAA to provide accurate historical weather data, well, I guess we’re all doomed!
Now, I must confess, I had no intention of becoming a weather forecaster during this escapade. That’s a whole different level of sorcery reserved for the truly brave and slightly crazy souls. No, my friends, my mission was solely to unravel the mysteries of the past, not predict the future. So, off I went, armed with my web-scraping skills and a fervent desire to put an end to endless summers. And thus, my epic journey into the realm of weather data began… but did it?
Well, it seems that once people discover your supreme data-crunching powers, they start throwing project ideas at you like confetti at a parade. Take my poor, football-obsessed husband for example. He came up with the brilliant notion of analyzing if there’s any connection between a quarterback’s race and the number of times they get a sweet, sweet roughing the passer call in their favor. And as if that wasn’t enough, I thought, why not spice it up even more and explore if the defender’s race also plays a role in how many roughing the passer flags rain down upon them? Heck, let’s even toss in the officials’ race for good measure. Who knew the football field could have so many hidden layers of sociology and statistics? But hey, I’ll play along and start with quarterbacks for now. Let the mind-boggling journey begin! Just don’t blame me if we end up in a statistical black hole of absurdity.
At first, I was going to look at NCAA football statistics given that the sample size would be much larger than for the NFL. However, I didn’t really find a good source for the data to either download or extract. It just doesn’t seem like the NCAA collects that data down to the player level. As luck would have it I was able to find a source for NFL penalty data. the aptly named NFL Penalties is a sited dedicated to capturing penalty data so that users can basically settle disputes “over a player and their frequent ability to get away with murder, or not.” The site’s author does a pretty good job at articulating problems with the data and any mitigation actions taken. Ultimately the data on the site is provided by nflfastR. 1
Now that I’ve talked about the general concept and the search for a data source, here my next steps.:
Collect Roughing the Passer (RTP) data for the quarterback from NFL Penalties.
Collect the relevant biographical data on each of the quarterbacks.
Use Python and relevant libraries such as Pandas2 and matplotlib3 to perform data cleaning, exploration, univariate and bivariate analysis, and visualizations.
Publish the findings along with the documented methodology.
I’m not sure how long this will take me, could be a day, could be weeks. Either way, check back often for updates on the project’s progress.
Carl S, Baldwin B (2023). nflfastR: Functions to Efficiently Access NFL Play by Play Data. https://www.nflfastr.com/, https://github.com/nflverse/nflfastR. ↩︎
The pandas development team. (2023). pandas-dev/pandas: Pandas (v2.1.0). Zenodo. https://doi.org/10.5281/zenodo.8301632 ↩︎
So, the first step in building my data science portfolio was to “Select Relevant Projects”, at least according to ChatGPT. After thinking about this for some time, I decided I was wasting my time trying to think of project, so back to the ChatGPT I went and provided the following prompt:
Can you suggest some relevant data science projects to start with?
The next set of answers really started to provide me with some direction. If you are not an industry expert in a particular field, it can be difficult to see the possibilities in all the data that is available out there in the world. At work, I know what data I have available to me, and I can visualize in my mind’s eye the questions my clients haven’t even thought to ask about. This way, I can proactively dig into my data and tease out details. Presenting the answers to the client before they even know to ask the question tends to make me look like a rockstar. However, coming up with practice projects and problems without being a subject matter expert tends to give me a bit of “writer’s block” so to speak. Asking for help from ChatGPT helped to refine my focus. From the list provided below, I can now take each topic and develop them into individual project to showcase my skills. These are the suggestions and explanations it provided.:
Exploratory Data Analysis (EDA) on a Dataset: Conduct a thorough analysis of a dataset, including data cleaning, visualization, and deriving meaningful insights. You can choose a dataset related to a topic of interest, such as finance, healthcare, or sports.
Predictive Modeling with a Classic Dataset: Work with well-known datasets like the Iris dataset, Boston Housing dataset, or Titanic dataset. Build predictive models using techniques like linear regression, decision trees, or logistic regression.
Natural Language Processing (NLP) Project: Analyze and extract insights from text data. You can start with sentiment analysis on Twitter data or build a simple chatbot using NLP libraries like NLTK or spaCy.
Image Classification: Implement an image classification project using deep learning. The MNIST dataset for digit recognition or the CIFAR-10 dataset for object recognition are good starting points.
Time Series Analysis: Work with time-series data to forecast trends or anomalies. You can use financial market data or weather data to build forecasting models.
Customer Churn Prediction: Analyze customer data to predict customer churn for a business. This is a common problem in industries like telecommunications and subscription-based services.
Recommendation System: Build a recommendation system, like the ones used by e-commerce platforms or streaming services. Collaborative filtering and content-based methods are popular approaches.
Healthcare Data Analysis: Analyze healthcare data to identify trends, patient outcomes, or factors affecting disease prevalence. You can use datasets related to COVID-19 or electronic health records.
A/B Testing: Design and analyze an A/B test for a hypothetical or real-world scenario. Determine the impact of changes in a website or app on user behavior.
Fraud Detection: Build a fraud detection model using transaction data. This is valuable in the finance and e-commerce sectors.
Social Media Analysis: Analyze social media data to gain insights into user behavior, sentiment, or trends. Use APIs to collect data from platforms like Twitter or Reddit.
Geospatial Data Analysis: Work with geospatial data to solve location-based problems. You can analyze transportation data, plot maps, or predict real estate prices.
Market Basket Analysis: Analyze transaction data to discover associations between products and make product recommendations.
Text Generation with GANs: Experiment with Generative Adversarial Networks (GANs) to generate realistic text data, such as generating paragraphs in the style of famous authors.
Time Series Anomaly Detection: Implement anomaly detection techniques on time series data to identify unusual patterns or outliers.
Are you a new data professional trying to break into the data science space, or are you a presently working in the field? Although I currently work as a data professional and have a master’s degree in data analytics, sharing my work with anyone outside my current employer or clients is an industry No-No. Intellectual property and non-disclosure agreements make the lawyers frown upon such actions. The solution: “Build a data science portfolio”, they say. I wish it were that easy. No one teaches you how to do that or what a data science portfolio even means. Questions like: “Where do I start?”, or “What do I include?” may nag at you and keep you from even getting started. Just take a look at the /datascience subreddit and you will find threads like What makes a good personal project – from the perspective of a hiring manager, People who make hiring decisions: what do you want to see in a portfolio?, and How to Build a Data Science Portfolio. As you can see, I was obviously wondering the same things since I did the searches too. However, I tend to learn better by doing, so I’m planning on “winging-it,” sort of.
What do I mean by winging it, you may wonder? I’m all in favor of using the tools that are already available, so I took to ChatGPT for some guidance. My first prompt was simple: “how to build a data scientist portfolio.” True to form, ChatGPT did not disappoint, and its advice was simple and concise:
Select Relevant Projects
Clean and Document Your Code
Create a Portfolio Website
Project Descriptions
Include Jupyter Notebooks
Visualize Data Effectively
Highlight Your Skills
Include a Blog Section
Add a Resume or CV
Engage in Open-Source Contributions
Seek Feedback
Update Regularly
Network and Share
Consulting ChatGPT will continue throughout the processes.
To help manage the moving parts in this process, I’m relying on Atlassian’sJIRA software to build a roadmap that will not only manage the process of standing up my portfolio, but to also keep track of the progress of my individual projects.
Jira Software is the #1 agile project management tool used by teams to plan, track, release and support world-class software with confidence.
As for sharing my work with the world, you’re here so it must be working. As I work through the points laid out by ChatGPT above, I will document my journey and share my thoughts, successes, and frustrations here. Follow along to see the portfolio grow.