First the Idea, then the Data: Navigating the Depths of Information
When working on a project that requires data, it is essential to consider the various sources and formats in which the information may be available. Often, the desired data cannot be found in one single location, requiring careful compilation from multiple sources. This process can be time-consuming and challenging, as each source may present its own set of complications and intricacies.
While some data may be readily accessible through online databases, APIs, or official government reports, it may not always be presented in the most convenient or usable form. For instance, a dataset could be in a raw, unstructured format, making it difficult to analyze and extract meaningful insights. In such cases, data manipulation techniques are necessary to transform the data into a suitable format.
Data manipulation involves various tasks such as cleaning, merging, and reshaping the data. Cleaning the data involves removing any inconsistencies, errors, or missing values. This step ensures that the data is accurate and reliable for analysis. Merging data from multiple sources requires aligning and combining different datasets based on common attributes or keys. This consolidation allows for a comprehensive and holistic view of the data.
Reshaping the data involves rearranging its structure to fit the desired analysis or visualization needs. This can include tasks such as pivoting, aggregating, or disaggregating the data. By reshaping the data, researchers and analysts can uncover patterns, trends, and relationships that might otherwise go unnoticed.
In the given scenario, the QB Penalty Stats table lacked information on the racial background of the players. While this data may not be readily available in the original source, alternative sources can be explored. In this case, a reliable and well-sourced list of black quarterbacks could be found on Wikipedia. Although caution is advisable when using information from Wikipedia, the aforementioned table demonstrates reliability in this particular case.
Working with data often goes beyond simply retrieving information from a single source. It requires diligent gathering, cleaning, merging, and reshaping of data to ensure accuracy and usability. With the right techniques and approaches, researchers can unlock the full potential of data and derive valuable insights.
Embarking on an Epic Journey
I now have a list of all QBs from 2009 until present (that’s 2023 for those of you that stumble upon my ramblings in the future) in the qb_rtp table. I also have a list of black QBs in the qbs_oc table. Sometimes I find it can be faster to do some of the data cleaning work in the native application. In this case, I needed to create a column in the black_qbs table that I could use to compare the qb_rtp[‘Player’] column in the other table. Luckily Excel’s autofill feature made this particularly easy. I inserted a blank column next to the [‘Quarterback’] called [‘short_name’]. After typing a few examples, I selected the “Autofill” option in the Data drop down menu, and voilà, I had the new column that I can use for comparison.
Starting to Use Notebooks
I haven’t used a Jupyter Notebook to code in a very long time. I usually use the Spyder IDE as I can see all my variables’ values at the same time. But I’m going to give it a shot. You can get the entire notebook here. The first step is to import the different libraries that I think I’ll need.
import pandas as pd
import openpyxl
Next, I will import the data sets.
qbs = pd.read_excel('https://myordinaryjourney.com/wp-content/uploads/2023/09/QB_Table-1.xlsx', engine='openpyxl')
qbs_oc = pd.read_excel('https://myordinaryjourney.com/wp-content/uploads/2023/09/QB_Table-2.xlsx', engine='openpyxl')
Make sure to verify the data has been loaded by viewing a few rows from each dataframe. If you are wondering what the ‘\n’ means, it is the regular expression for a new line. It creates some extra space between the table headings and the table data to make them more readable. Dataquest has a great quick guide and cheat sheet available here.
print('Quarterback Stats table:\n', qbs.head(), '\n')
print('Quarterback of Color table:\n', qbs_oc.head())
#Output
Quarterback Stats table:
Player Total 2009 2010 2011 2012 2013 2014 2015 2016 ... \
0 A.Dalton 25 0 0 4 1 1 5 0 1 ...
1 A.Luck 17 0 0 0 5 3 4 1 1 ...
2 A.Rodgers 41 3 3 3 5 2 2 5 4 ...
3 A.Smith 23 1 1 1 1 1 6 4 1 ...
4 B.Bortles 12 0 0 0 0 0 1 2 1 ...
2022 2023 Games Per Game Attempts Per 100 Att Sacked Per Sack \
0 2 0 170 0.15 5557 0.45 361 0.069
1 0 0 94 0.18 3620 0.47 186 0.091
2 3 0 228 0.18 7840 0.52 542 0.076
3 0 0 149 0.15 4648 0.49 367 0.063
4 0 0 79 0.15 2719 0.44 201 0.060
Sack Per Att Third Down %
0 0.065 40.00
1 0.051 29.41
2 0.069 39.02
3 0.079 26.09
4 0.074 33.33
[5 rows x 25 columns]
Quarterback of Color table:
Quarterback short_name Years active \
0 Fritz Pollard F.Pollard 1920–1926
1 Joe Lillard J.Lillard 1932–1933
2 George Taliaferro G.Taliaferro 1950–1955
3 Marlin Briscoe M.Briscoe 1968*
4 James Harris J.Harris 1969–1981
Team
0 Akron Pros, Milwaukee Badgers, Hammond Pros, P...
1 Chicago Cardinals
2 New York Yanks, Dallas Texans, Baltimore Colts...
3 Denver Broncos
4 Buffalo Bills, Los Angeles Rams, San Diego Cha...
Smushing the Data Together
Now that the two initial datasets have been read into Pandas dataframes, it’s time to smush them together in a way that makes them useful for further analysis. The first step I take is adding a new variable to the Quarterback Stats Table (qbs) called ‘qboc’. I set all the values to 0 for right now.
qbs['qboc'] = 0
Once again check to make sure the variable was added to the dataframe as expected.
print(qbs.head())
#Output
Player Total 2009 2010 2011 2012 2013 2014 2015 2016 ... \
0 A.Dalton 25 0 0 4 1 1 5 0 1 ...
1 A.Luck 17 0 0 0 5 3 4 1 1 ...
2 A.Rodgers 41 3 3 3 5 2 2 5 4 ...
3 A.Smith 23 1 1 1 1 1 6 4 1 ...
4 B.Bortles 12 0 0 0 0 0 1 2 1 ...
2023 Games Per Game Attempts Per 100 Att Sacked Per Sack \
0 0 170 0.15 5557 0.45 361 0.069
1 0 94 0.18 3620 0.47 186 0.091
2 0 228 0.18 7840 0.52 542 0.076
3 0 149 0.15 4648 0.49 367 0.063
4 0 79 0.15 2719 0.44 201 0.060
Sack Per Att Third Down % qboc
0 0.065 40.00 0
1 0.051 29.41 0
2 0.069 39.02 0
3 0.079 26.09 0
4 0.074 33.33 0
[5 rows x 26 columns]
The ‘qboc’ column was added to the end of the dataframe. The next step I take is to make the qbs[‘Player’] and the qbs_oc[‘short_name’] columns the indexes of the dataframes. This will speed up the comparison process. Although it wouldn’t be noticeable with these datasets, if you have a dataset will millions of records, it will bog down the system… ask me how I know.
qbs = qbs.set_index('Player')
qbs_oc = qbs_oc.set_index('short_name')
Now this is the really fun part. All I need to do now is compare the indexes of the two dataframes. If the indexes match, then I can assign a ‘1’ to the qbs[‘qboc’] variable to signify that that quarterback is indeed a quarterback of color.
qbs.loc[(qbs.index.isin(qbs_oc.index)), 'qboc'] = 1
This one line of code is pretty cool, I think. I want to dig into what this line actually does. I know this isn’t a Python coding post, but explaining will only take a minute and it unpacks some interesting concepts.
- qbs.index and qbs_oc.index specify that we are going to be doing something with the indexes of the two dataframes.
- The .isin is a function that returns a Boolean mask. So what (qbs.index.isin(qbs_oc.index) does is takes the index values of the qbs_oc dataframe and look to see if those values are in the qbs dataframe index. If they are, then the Boolean value = True, if not, the value = False. You can also save that part of the code to a variable and use over and over in other parts of your project, but in this instance, we don’t need to.
- The qbs.loc is locating all the the indexes in the qbs dataframe where the mask value = True.
- ‘qboc’ specifies the column we want to assign our value to. In this case a 1 to signify that the quarterback is a person of color.
The qbs.loc() designates the dataframe we are intending to modify; (qbs.index.isin(qbs_oc.index)) designates the row we want to work with; and ‘qboc’ designates the column.
print(qbs.head())
#Output
Total 2009 2010 2011 2012 2013 2014 2015 2016 2017 ... \
Player ...
A.Dalton 25 0 0 4 1 1 5 0 1 1 ...
A.Luck 17 0 0 0 5 3 4 1 1 0 ...
A.Rodgers 41 3 3 3 5 2 2 5 4 0 ...
A.Smith 23 1 1 1 1 1 6 4 1 5 ...
B.Bortles 12 0 0 0 0 0 1 2 1 7 ...
2023 Games Per Game Attempts Per 100 Att Sacked Per Sack \
Player
A.Dalton 0 170 0.15 5557 0.45 361 0.069
A.Luck 0 94 0.18 3620 0.47 186 0.091
A.Rodgers 0 228 0.18 7840 0.52 542 0.076
A.Smith 0 149 0.15 4648 0.49 367 0.063
B.Bortles 0 79 0.15 2719 0.44 201 0.060
Sack Per Att Third Down % qboc
Player
A.Dalton 0.065 40.00 0
A.Luck 0.051 29.41 0
A.Rodgers 0.069 39.02 0
A.Smith 0.079 26.09 1
B.Bortles 0.074 33.33 0
As you can see, A.Smith now has a 1 in the qboc column. However, the A.Smith in the qbs data set is actually Alex Smith not Akili Smith who is the actual quarterback referenced in the qbs_oc dataset. This illustrates the importance of going through your data to ensure accuracy. Luckily these datasets are small enough to go through to spot any errors.
double_check = qbs.loc[(qbs['qboc'] == 1)]
print(double_check.index)
#Output
Index(['A.Smith', 'C.Kaepernick', 'C.Newton', 'D.Prescott', 'D.Watson',
'G.Smith', 'J.Brissett', 'J.Campbell', 'J.Freeman', 'J.Hurts',
'J.Winston', 'K.Murray', 'L.Jackson', 'M.Vick', 'P.Mahomes',
'R.Griffin III', 'R.Wilson', 'T.Bridgewater', 'T.Taylor'],
dtype='object', name='Player')
qbs['qboc']['A.Smith'] = 0
print(qbs)
#Output
Total 2009 2010 2011 2012 2013 2014 2015 2016 2017 ... \
Player ...
A.Smith 23 1 1 1 1 1 6 4 1 5 ...
2023 Games Per Game Attempts Per 100 Att Sacked Per Sack \
Player
A.Smith 0 149 0.15 4648 0.49 367 0.063
Sack Per Att Third Down % qboc
Player
A.Smith 0.079 26.09 0
[1 rows x 25 columns]
Next Steps
Get ready for some quarterback madness in the upcoming post! We’ll be diving into our fancy new quarterbacks table and unleashing some wild analysis on the gridiron.