First the Idea, then the Data: Navigating the Depths of Information

First the Idea, then the Data: Navigating the Depths of Information

When working on a project that requires data, it is essential to consider the various sources and formats in which the information may be available. Often, the desired data cannot be found in one single location, requiring careful compilation from multiple sources. This process can be time-consuming and challenging, as each source may present its own set of complications and intricacies.

While some data may be readily accessible through online databases, APIs, or official government reports, it may not always be presented in the most convenient or usable form. For instance, a dataset could be in a raw, unstructured format, making it difficult to analyze and extract meaningful insights. In such cases, data manipulation techniques are necessary to transform the data into a suitable format.

Data manipulation involves various tasks such as cleaning, merging, and reshaping the data. Cleaning the data involves removing any inconsistencies, errors, or missing values. This step ensures that the data is accurate and reliable for analysis. Merging data from multiple sources requires aligning and combining different datasets based on common attributes or keys. This consolidation allows for a comprehensive and holistic view of the data.

Reshaping the data involves rearranging its structure to fit the desired analysis or visualization needs. This can include tasks such as pivoting, aggregating, or disaggregating the data. By reshaping the data, researchers and analysts can uncover patterns, trends, and relationships that might otherwise go unnoticed.

In the given scenario, the QB Penalty Stats table lacked information on the racial background of the players. While this data may not be readily available in the original source, alternative sources can be explored. In this case, a reliable and well-sourced list of black quarterbacks could be found on Wikipedia. Although caution is advisable when using information from Wikipedia, the aforementioned table demonstrates reliability in this particular case.

Working with data often goes beyond simply retrieving information from a single source. It requires diligent gathering, cleaning, merging, and reshaping of data to ensure accuracy and usability. With the right techniques and approaches, researchers can unlock the full potential of data and derive valuable insights.

Embarking on an Epic Journey

I now have a list of all QBs from 2009 until present (that’s 2023 for those of you that stumble upon my ramblings in the future) in the qb_rtp table. I also have a list of black QBs in the qbs_oc table. Sometimes I find it can be faster to do some of the data cleaning work in the native application. In this case, I needed to create a column in the black_qbs table that I could use to compare the qb_rtp[‘Player’] column in the other table. Luckily Excel’s autofill feature made this particularly easy. I inserted a blank column next to the [‘Quarterback’] called [‘short_name’]. After typing a few examples, I selected the “Autofill” option in the Data drop down menu, and voilà, I had the new column that I can use for comparison.

Starting to Use Notebooks

I haven’t used a Jupyter Notebook to code in a very long time. I usually use the Spyder IDE as I can see all my variables’ values at the same time. But I’m going to give it a shot. You can get the entire notebook here. The first step is to import the different libraries that I think I’ll need.

Python

import pandas as pd
import openpyxl

Next, I will import the data sets.

qbs = pd.read_excel('https://myordinaryjourney.com/wp-content/uploads/2023/09/QB_Table-1.xlsx', engine='openpyxl')
qbs_oc = pd.read_excel('https://myordinaryjourney.com/wp-content/uploads/2023/09/QB_Table-2.xlsx', engine='openpyxl')

Make sure to verify the data has been loaded by viewing a few rows from each dataframe. If you are wondering what the ‘\n’ means, it is the regular expression for a new line. It creates some extra space between the table headings and the table data to make them more readable. Dataquest has a great quick guide and cheat sheet available here.

print('Quarterback Stats table:\n', qbs.head(), '\n')
print('Quarterback of Color table:\n', qbs_oc.head())

#Output
Quarterback Stats table:
       Player  Total  2009  2010  2011  2012  2013  2014  2015  2016  ...  \
0   A.Dalton     25     0     0     4     1     1     5     0     1  ...   
1     A.Luck     17     0     0     0     5     3     4     1     1  ...   
2  A.Rodgers     41     3     3     3     5     2     2     5     4  ...   
3    A.Smith     23     1     1     1     1     1     6     4     1  ...   
4  B.Bortles     12     0     0     0     0     0     1     2     1  ...   

   2022  2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
0     2     0    170      0.15      5557         0.45     361     0.069   
1     0     0     94      0.18      3620         0.47     186     0.091   
2     3     0    228      0.18      7840         0.52     542     0.076   
3     0     0    149      0.15      4648         0.49     367     0.063   
4     0     0     79      0.15      2719         0.44     201     0.060   

   Sack Per Att  Third Down %  
0         0.065         40.00  
1         0.051         29.41  
2         0.069         39.02  
3         0.079         26.09  
4         0.074         33.33  

[5 rows x 25 columns] 

Quarterback of Color table:
          Quarterback    short_name Years active  \
0      Fritz Pollard     F.Pollard    1920–1926   
1        Joe Lillard     J.Lillard    1932–1933   
2  George Taliaferro  G.Taliaferro    1950–1955   
3     Marlin Briscoe     M.Briscoe        1968*   
4       James Harris      J.Harris    1969–1981   

                                                Team  
0  Akron Pros, Milwaukee Badgers, Hammond Pros, P...  
1                                  Chicago Cardinals  
2  New York Yanks, Dallas Texans, Baltimore Colts...  
3                                     Denver Broncos  
4  Buffalo Bills, Los Angeles Rams, San Diego Cha...

Smushing the Data Together

Now that the two initial datasets have been read into Pandas dataframes, it’s time to smush them together in a way that makes them useful for further analysis. The first step I take is adding a new variable to the Quarterback Stats Table (qbs) called ‘qboc’. I set all the values to 0 for right now.

qbs['qboc'] = 0

Once again check to make sure the variable was added to the dataframe as expected.

print(qbs.head())

#Output
Player  Total  2009  2010  2011  2012  2013  2014  2015  2016  ...  \
0   A.Dalton     25     0     0     4     1     1     5     0     1  ...   
1     A.Luck     17     0     0     0     5     3     4     1     1  ...   
2  A.Rodgers     41     3     3     3     5     2     2     5     4  ...   
3    A.Smith     23     1     1     1     1     1     6     4     1  ...   
4  B.Bortles     12     0     0     0     0     0     1     2     1  ...   

   2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
0     0    170      0.15      5557         0.45     361     0.069   
1     0     94      0.18      3620         0.47     186     0.091   
2     0    228      0.18      7840         0.52     542     0.076   
3     0    149      0.15      4648         0.49     367     0.063   
4     0     79      0.15      2719         0.44     201     0.060   

   Sack Per Att  Third Down %  qboc  
0         0.065         40.00     0  
1         0.051         29.41     0  
2         0.069         39.02     0  
3         0.079         26.09     0  
4         0.074         33.33     0  

[5 rows x 26 columns]

The ‘qboc’ column was added to the end of the dataframe. The next step I take is to make the qbs[‘Player’] and the qbs_oc[‘short_name’] columns the indexes of the dataframes. This will speed up the comparison process. Although it wouldn’t be noticeable with these datasets, if you have a dataset will millions of records, it will bog down the system… ask me how I know.

qbs = qbs.set_index('Player')
qbs_oc = qbs_oc.set_index('short_name')

Now this is the really fun part. All I need to do now is compare the indexes of the two dataframes. If the indexes match, then I can assign a ‘1’ to the qbs[‘qboc’] variable to signify that that quarterback is indeed a quarterback of color.

qbs.loc[(qbs.index.isin(qbs_oc.index)), 'qboc'] = 1

This one line of code is pretty cool, I think. I want to dig into what this line actually does. I know this isn’t a Python coding post, but explaining will only take a minute and it unpacks some interesting concepts.

qbs.index and qbs_oc.index specify that we are going to be doing something with the indexes of the two dataframes.
The .isin is a function that returns a Boolean mask. So what (qbs.index.isin(qbs_oc.index) does is takes the index values of the qbs_oc dataframe and look to see if those values are in the qbs dataframe index. If they are, then the Boolean value = True, if not, the value = False. You can also save that part of the code to a variable and use over and over in other parts of your project, but in this instance, we don’t need to.
The qbs.loc is locating all the the indexes in the qbs dataframe where the mask value = True.
‘qboc’ specifies the column we want to assign our value to. In this case a 1 to signify that the quarterback is a person of color.

The qbs.loc() designates the dataframe we are intending to modify; (qbs.index.isin(qbs_oc.index)) designates the row we want to work with; and ‘qboc’ designates the column.

print(qbs.head())

#Output
           Total  2009  2010  2011  2012  2013  2014  2015  2016  2017  ...  \
Player                                                                  ...   
A.Dalton      25     0     0     4     1     1     5     0     1     1  ...   
A.Luck        17     0     0     0     5     3     4     1     1     0  ...   
A.Rodgers     41     3     3     3     5     2     2     5     4     0  ...   
A.Smith       23     1     1     1     1     1     6     4     1     5  ...   
B.Bortles     12     0     0     0     0     0     1     2     1     7  ...   

           2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
Player                                                                      
A.Dalton      0    170      0.15      5557         0.45     361     0.069   
A.Luck        0     94      0.18      3620         0.47     186     0.091   
A.Rodgers     0    228      0.18      7840         0.52     542     0.076   
A.Smith       0    149      0.15      4648         0.49     367     0.063   
B.Bortles     0     79      0.15      2719         0.44     201     0.060   

           Sack Per Att  Third Down %  qboc  
Player                                       
A.Dalton          0.065         40.00     0  
A.Luck            0.051         29.41     0  
A.Rodgers         0.069         39.02     0  
A.Smith           0.079         26.09     1  
B.Bortles         0.074         33.33     0

As you can see, A.Smith now has a 1 in the qboc column. However, the A.Smith in the qbs data set is actually Alex Smith not Akili Smith who is the actual quarterback referenced in the qbs_oc dataset. This illustrates the importance of going through your data to ensure accuracy. Luckily these datasets are small enough to go through to spot any errors.

double_check = qbs.loc[(qbs['qboc'] == 1)]
print(double_check.index)

#Output
Index(['A.Smith', 'C.Kaepernick', 'C.Newton', 'D.Prescott', 'D.Watson',
       'G.Smith', 'J.Brissett', 'J.Campbell', 'J.Freeman', 'J.Hurts',
       'J.Winston', 'K.Murray', 'L.Jackson', 'M.Vick', 'P.Mahomes',
       'R.Griffin III', 'R.Wilson', 'T.Bridgewater', 'T.Taylor'],
      dtype='object', name='Player')

qbs['qboc']['A.Smith'] = 0
print(qbs)

#Output
         Total  2009  2010  2011  2012  2013  2014  2015  2016  2017  ...  \
Player                                                                ...   
A.Smith     23     1     1     1     1     1     6     4     1     5  ...   

         2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
Player                                                                    
A.Smith     0    149      0.15      4648         0.49     367     0.063   

         Sack Per Att  Third Down %  qboc  
Player                                     
A.Smith         0.079         26.09     0  

[1 rows x 25 columns]

Next Steps

Get ready for some quarterback madness in the upcoming post! We’ll be diving into our fancy new quarterbacks table and unleashing some wild analysis on the gridiron.

Excel, Jupyter Notebook, Pandas, Python

Comments 0

My Data Science Journey

My Data Science Journey

First the Idea, then the Data: Navigating the Depths of Information

First the Idea, then the Data: Navigating the Depths of Information

Embarking on an Epic Journey

Starting to Use Notebooks

Smushing the Data Together

Next Steps