Cruising the Data Landscape: Exploring the Fundamentals of Data Exploration
Recap
Well, well, well! Looks like we’ve embarked on quite the adventure here! Our first task was to get to the bottom of the burning question: “Do quarterbacks of color get the short end of the stick when it comes to roughing the passer penalties compared to their fair-skinned counterparts?” Exciting stuff, huh?
The next step in our journey involved gathering the all-important data. We needed to make sure we were dealing with top-notch, premium-quality data. None of that shady, questionable stuff for us! We demanded data that was reliable, traceable, and oh-so-accurate. Because in the world of data science, it’s all about that golden rule: “garbage in, garbage out!” Can’t have our analysis going awry now, can we?
For this particular project, we scoured the depths of NFL Penalties and even tapped into the vast knowledge reserves of the mighty Wikipedia. We left no stone unturned, my friend!
Now, onto the thrilling next stage of our expedition – data cleaning and reshaping! Brace yourself, because this is where things get spicy. We had to create a nifty variable to indicate the race of those quarterbacks. Easy-peasy, right? Well, almost. Turns out, we stumbled upon a little naming convention conundrum that resulted in a sneaky duplicate value. But fear not! With our eagle-eyed attention to detail, we swiftly detected and corrected that pesky error. Crisis averted!
And now, dear explorer, we venture forth to the exhilarating realm of Data Exploration. Excited? You should be! There’s so much more to discover and uncover on this grand data-driven expedition! So, buckle up and get ready for the thrill ride ahead! Let’s dive into the vast ocean of information and see what fascinating insights await us!
Data exploration is an essential step in the data analysis process. It allows us to gain an initial understanding of the dataset by examining its general properties and characteristics. By delving into the data, we can uncover valuable insights.
One of the key aspects of data exploration involves assessing the size of the dataset. This includes understanding the number of observations or records and the number of variables or features present in the dataset. Knowing the scope of the dataset helps us gauge its comprehensiveness and potential usefulness for our analysis.
Furthermore, examining the data types is crucial in data exploration. By identifying the types of data contained in the dataset, such as numerical, categorical, or textual, we can determine the appropriate statistical techniques or visualization methods to apply during analysis. This enables us to make accurate interpretations and draw meaningful conclusions from the data.
Another crucial aspect is the identification of initial patterns or trends within the dataset. Exploring the data can reveal inherent relationships, correlations, or anomalies that may exist. By uncovering these patterns, we can develop hypotheses, generate insights, and pose relevant research questions for further investigation.
You may have noticed that I talk about using Python a lot in my work. I actually prefer to use in my day job since it seems to get the most use and support from other data professionals. However, for this portion of my project, I’m going to be using R instead. It has been some time since I’ve used R (maybe before I even finished my master’s program) so full discloser, I’m pretty sure I had to look up most of that I’m doing to refresh my memory. But since it is a language, if you don’t use it, you lose it, so I’m going to shake things up a bit and revisit R.
Data Exploration
The first things I need to do is load in any packages I need and my dataset. I am loading the ‘cleaned_qbs.xlsx’ Excel file that I created during the data cleaning process.
# import and store the dataset
library(openxlsx)
qbs = read.xlsx("https://myordinaryjourney.com/wp-content/uploads/2023/09/cleaned_qbs.xlsx")
# Check to make sure the data loaded correctly
head(qbs, n=5)
#Output
A data.frame: 5 × 26
Player Total 2009 2010 2011 2012 2013 2014 2015 2016 ... 2023 Games Per.Game Attempts Per.100.Att Sacked Per.Sack Sack.Per.Att Third.Down.% qboc
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ... <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A.Dalton 25 0 0 4 1 1 5 0 1 ... 0 170 0.15 5557 0.45 361 0.069 0.065 40.00 0
2 A.Luck 17 0 0 0 5 3 4 1 1 ... 0 94 0.18 3620 0.47 186 0.091 0.051 29.41 0
3 A.Rodgers 41 3 3 3 5 2 2 5 4 ... 0 228 0.18 7840 0.52 542 0.076 0.069 39.02 0
4 A.Smith 23 1 1 1 1 1 6 4 1 ... 0 149 0.15 4648 0.49 367 0.063 0.079 26.09 0
5 B.Bortles 12 0 0 0 0 0 1 2 1 ... 0 79 0.15 2719 0.44 201 0.060 0.074 33.33 0
Looking over the raw data, it is hard to see any particular patterns in the data. However, a few data points do jump out as they are either larger or smaller than the data around the point. For example, the number of RTP penalties called each year numbers in the single digits with the exception of 2020 for the quarterback Josh Allen. That year he received 11 RTP called in his favor. Another example is the number of sacks recorded against Taysom Hill. Most of the QBs have a sack count in the triple digits, while only a few have double digit sack counts and Hill has the lowest of any QB in the data set. The next thing I want to do is check how many records are in my table and the total number of variables. It is helpful to know what your sample size is so you can select the correct analytic approach later on.
nrow(qbs)
# Output
66
length(qbs)
#Output
26
The next step is to understand the data types of each of the variables. Luckily, when the head of the table was printed, it also included that data types for each of the variables. In our cleaned_qbs table, we have one variable with a character data type (Player) and 25 variables with the data type of ‘dbl’, otherwise known float.
#Output
A data.frame: 5 × 26
Player Total 2009 2010 2011 2012 2013 2014 2015 2016 ... 2023 Games Per.Game Attempts Per.100.Att Sacked Per.Sack Sack.Per.Att Third.Down.% qboc
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ... <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Next Steps
Now that we understand how much and what kind of data we have, we can develop a game plan on what analytic techniques we can take in the next step. With Univariate Analysis, Bivariate Analysis, and Multivariate Analysis we will begin to explore the data and any relationships between the variables that might exist. This will get us closer to a possible answer to our original question.