The Journey

Mapping Your Project Odyssey: Jira Tasks to Chart the Course for Web App Development

While I have used JIRA in a limited way in the past, I’m still pretty new to the platform. I wanted to see if ChatGPT could help with some of the tasks to make sure I’m hitting all the right points. This way I can get some practice with JIRA and integrating it into my development process. However, since I’m a team of one at the moment, I think some of the really important collaboration features JIRA offers will go to waste.

The first thing I did was to make sure that my JIRA project and my GitHub repo for the project were integrated with each other. I’m using the free version of JIRA, so I can’t actually set permissions for viewing, and you’ll have to forgive me if I don’t want to make everyone on earth an admin, so hopefully the integration with GitHub will serve as a stopgap. In case you’re wondering how things are going, check out the GECACS repo.

In my previous post, Crafting Clear Paths: A Comprehensive Guide to Defining Requirements for Your Web Application Project – My Data Science Journey (myordinaryjourney.com), ChatGPT outlined the steps I should take for developing and capturing system requirements. In this post, I’m going to be adding tasks into JIRA based on Chat’s previous step 1 recommendations.

My ChatGPT Prompt

For step one, can you recommend some JIRA issues I should be tracking?

ChatGPT Response

At the initial stage of defining requirements and planning your project, here are some Jira issues or tasks you might consider tracking:

  1. Epic Creation:
    • Create epics in Jira to represent major themes or sections of your project. For example:
      • Epic 1: Requirement Gathering
      • Epic 2: Database Setup
      • Epic 3: User Authentication
  2. User Stories:
    • Break down your requirements into user stories. These represent specific functionalities or features from a user’s perspective. For instance:
      • Story 1: As a guild librarian, I want to add new equipment to the system.
      • Story 2: As a guild member, I want to check out equipment for use.
      • Story 3: As a guild member, I want to see a history of equipment I’ve checked out.
  3. Technical Tasks:
    • Include technical tasks necessary for setting up your development environment, integrating Firebase, and configuring the backend. For instance:
      • Task 1: Set up Python and Flask development environment.
      • Task 2: Configure Firebase project for the database.
      • Task 3: Implement Firebase user authentication.
  4. Documentation:
    • Track tasks related to documentation, which is crucial for maintaining clear records and instructions for the project. For example:
      • Task 4: Create a requirements document based on stakeholder meetings.
      • Task 5: Draft user documentation for the guild librarian on using the application.
  5. Stakeholder Collaboration:
    • Create tasks related to stakeholder collaboration, such as scheduling meetings and gathering requirements. For instance:
      • Task 6: Schedule meeting with guild librarian to discuss equipment tracking needs.
      • Task 7: Review and validate requirements with stakeholders.
  6. Project Setup:
    • Include tasks related to setting up your project management system and initial configurations. For example:
      • Task 8: Set up Scrum/Kanban board in Jira for project management.
      • Task 9: Define workflow stages for user stories and epics in Jira.
  7. Research and Exploration:
    • Track tasks related to exploring different tools, technologies, or best practices that might be beneficial for your project. For instance:
      • Task 10: Research best practices for Firebase database design.
      • Task 11: Explore Flask extensions for Firebase integration.

These suggested issues and tasks in Jira can serve as a starting point for organizing your project and breaking down the work into manageable components.

The Result

Epic Creation

I just used what Chat gave me for the Epics. I will have to add more when I get to the other steps.

User Stories

In addition to the user stories Chat defined for me, I added:

  • As a guild member, I want to see where I am on the waitlist.
  • As a guild member, I was to reserve a piece of equipment.
  • As a guild librarian, I want to see what equipment is checked out.
  • As a guild librarian, I want to see where each piece of equipment is currently located.

Task Creation

For the most part I used that tasks that Chat recommended. However, I chose to omit the tasks defined in the project setup step, as I’ve already done those. Although thinking about it now, I wonder if I should add them and just mark them complete.

    Next Steps

    My next steps will be to do my task estimation using story points, associating the tasks and user stories with the epics, and creating any additional epics, user stories, or tasks that I may have missed. I will also need to define my sprint length. Right now, in my communications plan, I am set up to do weekly check-ins with the librarian to give status updates. I think I’ll change that to every two weeks so that I can align the communication plan with the sprint length.

    Feedback

    Crafting Clear Paths: A Comprehensive Guide to Defining Requirements for Your Web Application Project

    Turns out step one isn’t just step one.

    My ChatGPT Prompt

    How do I do Step One: Define Requirements

    ChatGPT Response

    Defining requirements is a crucial step in the project development process, and it involves gathering, documenting, and understanding what your stakeholders (in this case, the guild librarian) need from the web application. Here’s a guide on how to approach this:

    1. Identify Stakeholders:
      • Determine who the key stakeholders are. In this case, it’s likely the guild librarian, but there may be other individuals or groups with a vested interest in the project.
    2. Hold Stakeholder Meetings:
      • Schedule meetings with the stakeholders to discuss the project. During these meetings, ask open-ended questions to encourage discussions about their needs, expectations, and any specific functionalities they envision.
    3. Ask Specific Questions:
      • To gather detailed information, ask specific questions about the equipment tracking system. For example:
        • What types of equipment does the guild have?
        • What information needs to be tracked for each piece of equipment?
        • How should the system handle check-in and check-out processes?
        • Are there specific reports or notifications that would be helpful?
    4. Document Requirements:
      • Document all the information gathered in a structured manner. Create a Requirements Document or use a tool like Confluence, Google Docs, or even a simple spreadsheet. Ensure that the document is clear and accessible to all team members.
    5. Clarify Priorities:
      • Work with the stakeholders to prioritize the identified requirements. This will help you focus on the most critical features and functionalities, especially if there are time or resource constraints.
    6. Verify and Validate Requirements:
      • Once you have a draft of the requirements, go back to the stakeholders to verify and validate the information. This ensures that you’ve captured their needs accurately and haven’t missed anything.
    7. Consider Edge Cases:
      • Think about potential edge cases or scenarios that might not have been explicitly mentioned. Anticipate challenges or exceptions that could occur in real-world usage.
    8. Iterate as Necessary:
      • Requirements gathering is an iterative process. Be prepared to revisit and update the requirements document as the project progresses and new insights emerge.
    9. Create User Stories:
      • Based on the requirements, start creating user stories. User stories represent specific functionalities from an end-user perspective and will be useful when you set up your project management system.
    10. Communicate Clearly:
      • Ensure effective communication between the development team and stakeholders. Clearly articulate what is feasible within the project scope and timeline.

    By following these steps, you’ll establish a solid foundation for your project, ensuring that you have a clear understanding of what needs to be built and how it aligns with the expectations of the guild librarian and other stakeholders.

    The Results

    I have already collected some of the information already. So this is the inventory of what I have, what I think I know, and what I need to follow-up on.

    Identify Stakeholders: This is indeed pretty straight forward. The main stakeholder is going to be the guild librarian, but I also want to include the guild’s membership because they are going to be the ones that actually initiate the check-out process.

    Hold Stakeholder Meetings: This step is a little over kill for our little group. We only have about 55 members. During a previous meeting with the librarian I feel like we already completed this step. Follow-up meetings will probably need to happen, at least to demonstrate what’s been done in the development and to present my approach to finding the solution that works best for our needs.

    Ask Specific Questions:

    • What types of equipment does the guild have? We already have an inventory of the equipment. It includes looms, spinning wheels, and other tools.
    • What information needs to be tracked for each piece of equipment?
      • Where the item is stored?
      • What is Its status?
      • Who has it checked-out if it is checked-out?
      • Is there someone on the waitlist for the equipment?
      • When is it expected to be checked-in?
    • How should the system handle check-in and check-out processes?
      • –> I had to ask a follow up question on this one because this sounded a little vague to me. ChatGPT came back with the following:
        • User Interaction: Through a web form
        • Equipment Identification: Each piece of equipment has a unique serial number (SN)
        • Transaction Logging: Collection of data such as the equipment SN, when it was checked out and to whom.
        • Notification System: Several notifications will be generated for each check-out/in
        • Reservation/Waitlist System: I will need to follow up on the reservation system, but there will need to have a waitlist for items.
        • Overdue Alerts: Lending time is 30-days, but can be re-checked out if there is no waitlist.
        • Access Controls: Only guild members should have access.
        • Integration with Database: The DB should be updated whenever equipment is checked-out/in
    • Are there specific reports or notifications that would be helpful?
      • Yes, the librarian needs to be alerted when the equipment is returned and if the item has a waitlist, the next user is notified that the item is available. 30-Day Check-out renewal sent if no waitlist, otherwise a notification that the item needs to be returned because there is a waitlist.

    Document Requirements: Since I already have the project setup with JIRA I will use Confluence to document what those requirements are.

    Clarify Priorities: The most important feature for the librarian is to be able to accurately and easily track the check-in of the equipment. There are several other features that need to be developed that will be transparent to the end-users. The stakeholders just want the ability, they are not expecting an iterative process.

    Verify and Validate Requirements: Need to do

    Consider Edge Cases: The system is pretty straight forward, but some of the other things I can see needing:

    • A way for the members to check the status of the equipment
    • A way to submit a query about the equipment or about the process

    Iterate as Necessary: Self-explanatory

    Create User Stories: Need to do

    Communicate Clearly: Need to create a regular update schedule to let the librarian and officers know the project’s progress.

    Next Step(s)

    Clearly, I’m going to need to do some documentation and go back to communicate my assumptions with the stakeholders to ensure I’m going in the right direction. I need to add to Confluence and JIRA to track what I’ve been working on and my work with the stakeholders.


    Submit Feedback

    Crafting Efficiency: Building a Guild Equipment Tracking System with Flask, Firebase, and WordPress

    Like I said, I’ve never done something like this before, so I’m going to use ChatGPT as a teacher/project manager. So the first thing I did was told CGPT what it is I want to do, and how I want to use it.

    My ChatGPT Prompt

    I want to develop a web application for the Fredericksburg spinners and weavers guild called GECACS. It stands for Guild Equipment Check-out and Control System. The guild librarian needs a way to track the guild’s equipment. The guild has equipment stored at the studio and different member’s homes. The librarian needs a way to track the piece of equipment, when it is checked-out/in and to whom. I imagine the app will use Firebase for the databases. I will do coding in Python, and the app will be hosted on our WordPress website. I have never developed a web application and I would like for you the help guide me through the process.

    ChatGPT Response

    1. Define Requirements:
      • Collaborate with the guild librarian to define specific requirements for equipment tracking. Understand the necessary features and functionalities.
    2. Database Design (Firebase):
      • Utilize Firebase as the database platform. Design the data structure to accommodate equipment, member details, and transaction records in a NoSQL format.
    3. Choose a Web Framework:
      • Opt for Flask as the web framework for Python, considering your familiarity with the language. This choice aligns well with Firebase for backend development.
    4. Set Up Development Environment:
      • Install necessary tools, including Python, Flask, and Firebase SDK for Python. Configure your virtual environment to manage dependencies.
    5. Coding the Application:
      • Begin building the application by breaking down tasks. Develop routes, set up Firebase connections, and design templates for the frontend.
    6. Database Integration (Firebase):
      • Connect your Flask application to the Firebase NoSQL database. Utilize Firebase SDK or a similar library to facilitate seamless interactions.
    7. User Authentication (Firebase):
      • Implement user authentication using Firebase Authentication to control access. This ensures secure tracking of equipment check-ins and check-outs.
    8. Testing:
      • Regularly test the application, covering various scenarios to ensure its robustness. Test Firebase interactions and authentication thoroughly.
    9. Deployment (WordPress and Firebase Hosting):
      • Deploy the application to your WordPress website. Utilize Firebase Hosting for backend services and consider integrating the frontend into your WordPress site.
    10. User Training:
      • Provide comprehensive training for the guild librarian on how to effectively use the application, emphasizing the new Firebase backend.
    11. Project Management (Jira):
      • Implement a project management system using Jira. Define epics, user stories, and tasks. Choose between Scrum or Kanban based on your team’s preferences. Use Jira to track progress, assign tasks, and facilitate communication.
    12. Maintenance and Updates:
      • Plan for ongoing maintenance and updates. Utilize Jira to track issues, prioritize updates, and ensure the long-term stability of the application.

    Next Step

    Define Requirements – so let’s get started.

    Embarking on the Web Development Journey: GECACS – Weaving a Digital Tapestry for Guild Equipment Management

    In this blog series, I’ll delve into the exciting world of web development as I embark on the creation of GECACS (Guild Equipment Check-out and Control System) for the Fredericksburg Spinners and Weavers Guild. I have never developed a web application before, so I’m going to use ChatGPT 3.5 to guild me through the process.

    In addition to ChatGPT, some of the other tools I will use include GitHub, JIRA, Google Colab, Google Firebase, Python, and WordPress. Since I like to learn new things, I’m thinking of going with Google’s Firebase Realtime Database for my data handling needs.

    Why am I doing this?

    Many years ago I elected to use the Forminator plug-in to create a check-out form for our guild members to use when they wanted to check-out equipment. The problem is that there was no way for the member or librarian to document the equipment had been turned back in. Trying to associate the check-out form with a check-in was becoming difficult for our librarian to keep up with. Since I am the guild’s “technology officer” and with my data science background, it seamed like a natural option to develop a solutions that would showcase my current skills and help me develop new ones.

    Transportation and Traffic Congestion Analysis Project Plan

    Problem

    What are the main causes of traffic congestion? Can I predict traffic congestion? Can I optimize my commute for time and cost?

    Objective

    To analyze traffic congestion patterns in Washington, D.C metro area and identify factors contributing to congestion. Additionally, to develop predictive models for traffic congestion based on historical data.

    Project Steps

    Problem Definition and Data Collection

    • Define the scope of the project, including the target city or area.
    • Identify data sources:
      • Traffic data: Real-time traffic data from transportation authorities or APIs (e.g., Google Maps API, HERE API).
      • Weather data: Historical weather data from sources like NOAA or weather APIs.
      • Event data: Information on accidents, road closures, and special events affecting traffic.
      • Road infrastructure data: Road network maps and information about traffic signals.
      • Historical traffic data: Historical traffic flow data for model training.

    Data Collection and Preprocessing

    • Gather data from the identified sources, including API integration where applicable.
    • Clean and preprocess the data:
      • Handle missing values.
      • Standardize formats and units (e.g., time zones, measurement units).
      • Merge and aggregate data from different sources into a unified dataset.
      • Perform exploratory data analysis to understand data distributions and patterns.

    Feature Engineering

    • Create relevant features for analysis and modeling:
    • Time-based features (e.g., time of day, day of week, holidays).
    • Weather-related features (e.g., temperature, precipitation).
    • Road-specific features (e.g., road type, number of lanes).
    • Event-related features (e.g., accident occurrence).

    Data Analysis

    • Visualize traffic congestion patterns over time.
    • Explore correlations between traffic congestion and factors like weather, events, and road characteristics.
    • Conduct statistical analyses to identify significant contributors to congestion.

    Predictive Modeling

    • Split the dataset into training and testing sets.
    • Develop machine learning models to predict traffic congestion levels.
    • Experiment with various algorithms (e.g., regression, time series forecasting, neural networks).
    • Evaluate model performance using appropriate metrics (e.g., RMSE, MAE).

    Interpretation and Insights

    • Interpret model results to understand which factors are most influential in predicting traffic congestion.
    • Provide recommendations for congestion mitigation based on findings.

    Documentation and Reporting

    • Create a comprehensive report summarizing the project, methodology, and results.
    • Include visualizations and insights.
    • Share the code and documentation on platforms like GitHub.

    Presentation and Communication

    • Prepare a presentation to communicate findings and recommendations to stakeholders.
    • Future Work
    • Discuss potential extensions or improvements to the project, such as real-time congestion prediction or integration with traffic management systems.

    Final Portfolio Inclusion

    • Document the entire project, including code, datasets, reports, and presentations, in your data science portfolio.

    Possible Data Sources

    • Real-time Traffic Data API (e.g., Google Maps Traffic API)
    • Weather Data API (e.g., NOAA API)
    • Government Traffic Data Portals
    • Local Transportation Authorities
    • Event Data from News Sources or Event APIs
    • Road Network Data Providers
    Time-Driven Traffic Insights: A Deep Dive into Transportation Data

    Oh boy, time to dive into another crazy data science adventure! This time, we’re tackling the chaotic realm of traffic in good ol’ Washington, D.C. Brace yourself for the daily battle against bumper-to-bumper madness and the heart-stopping dance of merging lanes. As a brave commuter, I’ve had enough of this madness and I refuse to succumb to its soul-sucking ways. My mission: to outsmart the traffic gods and find that sweet spot of minimum congestion and maximum savings. Picture the infamous interchange of northbound I-95 Express Lanes and the 495 Inner Loop Express Lane as our arch-nemesis, and we, the fearless data scientists, are here to give it a taste of its own medicine. Buckle up, my friend, because this is going to be one wild ride!

    Although time series analysis is a major component of this project, several opportunities exist to use multiple analytic techniques to include:

    • Spatial Analysis: This involves analyzing the spatial distribution of traffic congestion. Geographic Information Systems (GIS) can be used to visualize traffic patterns on maps and identify congestion hotspots.
    • Machine Learning: Beyond time series analysis, various machine learning techniques can be applied for traffic prediction and congestion analysis. These include regression models, clustering algorithms, and neural networks.
    • Network Analysis: This method focuses on the structure of transportation networks. It can be used to analyze road connectivity, identify bottlenecks, and optimize traffic flow.
    • Simulation Modeling: Traffic simulation models like microsimulation or agent-based modeling can be used to simulate and analyze traffic behavior under different scenarios. This is particularly useful for studying the impact of infrastructure changes.
    • Statistical Analysis: Traditional statistical methods can be employed to analyze relationships between traffic congestion and various factors such as weather, time of day, or road type.
    • Deep Learning: Deep learning techniques, such as Convolutional Neural Networks (CNNs), can be applied to analyze traffic camera images or video feeds for real-time congestion detection.
    • Optimization Models: Mathematical optimization models can be used to optimize traffic signal timings, route planning, and congestion mitigation strategies.
    • Behavioral Analysis: Understanding driver behavior and decision-making processes can be crucial for predicting and managing congestion. Behavioral analysis methods, such as choice modeling, can be applied.
    Exploring the Bivariate Landscape: Navigating Relationships in EDA

    If you are still with me, we are now entering the phase of analysis where the rubber begins to meet the road to eventually get to the answer to the original research question: Is there a connection between a quarterback’s race and the number of times they get a Roughing the Passer (RtP) call in their favor? So far, we have completed the following steps in our EDA journey:

    Now, we are diving into Bivariate Analysis. In this step, we will examine relationships between pairs of variables. Utilizing scatter plots, correlation matrices, and other tools, we will investigate how variables interact with each other, aiming to identify correlations, dependencies, or trends of interest.
    For this phase, I will once again employ R to gain additional practice. The initial step involves importing the necessary libraries for this part of the analysis, as well as the QBs dataset we’ve been working with.

    R
    install.packages("ggcorrplot")
    library(ggcorrplot)
    library(ggplot2)
    
    url <- "https://raw.githubusercontent.com/lareynolds/QB-EDA/main/cleaned_qbs.csv"
    
    qbs = read.csv(file = url)

    Now it may seem pretty easy and straight forward to follow the steps and just do what I just did. But what you don’t see is the extra tabs I have open to help through the process. For example, At first I tried to import the dataset as an Excel file, but that made the system throw some errors. So then, I thought it would be easier to import the file as a .csv file, and it sort of was, but I still got a bunch of errors. Apparently Google Colab doesn’t like to import files straight from GitHub as is, so off to Stack Overflow or Toward Data Science I go. In case you haven’t read anything on TDS, you should. This is a gem of quick tips and tricks for everything in data science.

    Thanks to A Apte’s TDS article “Get Started: 3 Ways to Load CSV files into Colab”, the answer is:

    “Click on the dataset in your repository, then click on View Raw. Copy the link to the raw dataset and store it as a string variable called url in Colab.”

    “Get Started: 3 Ways to Load CSV files into Colab”
    R
    # Double Checking to make sure the data is loaded into the notebook
    str(qbs)
    
    #Output
    'data.frame':	66 obs. of  26 variables:
     $ Player      : chr  "A.Dalton" "A.Luck" "A.Rodgers" "A.Smith" ...
     $ Total       : int  25 17 41 23 12 9 6 18 2 30 ...
     $ X2009       : int  0 0 3 1 0 0 1 0 0 3 ...
     $ X2010       : int  0 0 3 1 0 0 1 0 0 0 ...
     $ X2011       : int  4 0 3 1 0 2 0 0 0 3 ...
     $ X2012       : int  1 5 5 1 0 1 0 0 0 1 ...
     $ X2013       : int  1 3 2 1 0 0 0 0 0 1 ...
     $ X2014       : int  5 4 2 6 1 0 1 0 0 4 ...
     $ X2015       : int  0 1 5 4 2 2 1 0 1 2 ...
     $ X2016       : int  1 1 4 1 1 1 0 0 0 1 ...
     $ X2017       : int  1 0 0 5 7 1 1 0 0 3 ...
     $ X2018       : int  1 3 1 2 1 0 0 3 1 4 ...
     $ X2019       : int  6 0 3 0 0 0 1 4 0 0 ...
     $ X2020       : int  0 0 2 0 0 1 0 4 0 2 ...
     $ X2021       : int  3 0 5 0 0 1 0 5 0 6 ...
     $ X2022       : int  2 0 3 0 0 0 0 2 0 0 ...
     $ X2023       : int  0 0 0 0 0 0 0 0 0 0 ...
     $ Games       : int  170 94 228 149 79 62 60 74 44 190 ...
     $ Per.Game    : num  0.15 0.18 0.18 0.15 0.15 0.15 0.1 0.24 0.05 0.16 ...
     $ Attempts    : int  5557 3620 7840 4648 2719 1534 1554 2330 1230 7035 ...
     $ Per.100.Att : num  0.45 0.47 0.52 0.49 0.44 0.59 0.39 0.77 0.16 0.43 ...
     $ Sacked      : int  361 186 542 367 201 139 92 171 84 395 ...
     $ Per.Sack    : num  0.069 0.091 0.076 0.063 0.06 0.065 0.065 0.105 0.024 0.076 ...
     $ Sack.Per.Att: num  0.065 0.051 0.069 0.079 0.074 0.091 0.059 0.073 0.068 0.056 ...
     $ Third.Down..: num  40 29.4 39 26.1 33.3 ...
     $ qboc        : int  0 0 0 0 0 0 0 0 0 0 ...
     
     # Checking the summary statistics of each of the variables in the dataset
    summary(qbs)
    
    #Output
        Player              Total           X2009           X2010       
     Length:66          Min.   : 2.00   Min.   :0.000   Min.   :0.0000  
     Class :character   1st Qu.: 8.25   1st Qu.:0.000   1st Qu.:0.0000  
     Mode  :character   Median :15.50   Median :0.000   Median :0.0000  
                        Mean   :18.09   Mean   :0.697   Mean   :0.8333  
                        3rd Qu.:24.75   3rd Qu.:1.000   3rd Qu.:1.0000  
                        Max.   :57.00   Max.   :5.000   Max.   :6.0000  
         X2011           X2012           X2013           X2014      
     Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
     1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000  
     Median :0.000   Median :0.000   Median :0.000   Median :0.500  
     Mean   :1.121   Mean   :1.258   Mean   :1.076   Mean   :1.333  
     3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000  
     Max.   :8.000   Max.   :6.000   Max.   :8.000   Max.   :7.000  
         X2015           X2016           X2017           X2018      
     Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
     1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000  
     Median :1.000   Median :1.000   Median :0.000   Median :1.000  
     Mean   :1.455   Mean   :1.288   Mean   :1.303   Mean   :1.561  
     3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:3.000  
     Max.   :7.000   Max.   :6.000   Max.   :7.000   Max.   :7.000  
         X2019            X2020            X2021            X2022      
     Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :0.000  
     1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:0.000  
     Median : 0.000   Median : 0.000   Median : 0.000   Median :0.000  
     Mean   : 1.773   Mean   : 1.545   Mean   : 1.742   Mean   :1.076  
     3rd Qu.: 3.000   3rd Qu.: 2.000   3rd Qu.: 3.000   3rd Qu.:2.000  
     Max.   :10.000   Max.   :11.000   Max.   :10.000   Max.   :6.000  
         X2023            Games           Per.Game         Attempts   
     Min.   :0.0000   Min.   : 42.00   Min.   :0.0500   Min.   : 289  
     1st Qu.:0.0000   1st Qu.: 61.25   1st Qu.:0.1125   1st Qu.:1794  
     Median :0.0000   Median : 77.50   Median :0.1650   Median :2316  
     Mean   :0.0303   Mean   : 99.53   Mean   :0.1768   Mean   :3250  
     3rd Qu.:0.0000   3rd Qu.:138.00   3rd Qu.:0.2200   3rd Qu.:4476  
     Max.   :1.0000   Max.   :253.00   Max.   :0.3500   Max.   :9725  
      Per.100.Att         Sacked         Per.Sack        Sack.Per.Att    
     Min.   :0.1300   Min.   : 26.0   Min.   :0.02400   Min.   :0.03000  
     1st Qu.:0.3900   1st Qu.:131.5   1st Qu.:0.05850   1st Qu.:0.05800  
     Median :0.5500   Median :170.0   Median :0.07800   Median :0.06900  
     Mean   :0.5736   Mean   :211.2   Mean   :0.08342   Mean   :0.07009  
     3rd Qu.:0.7400   3rd Qu.:264.5   3rd Qu.:0.10725   3rd Qu.:0.08350  
     Max.   :1.2400   Max.   :542.0   Max.   :0.20200   Max.   :0.10300  
      Third.Down..        qboc       
     Min.   : 0.00   Min.   :0.0000  
     1st Qu.:25.00   1st Qu.:0.0000  
     Median :30.77   Median :0.0000  
     Mean   :31.14   Mean   :0.2727  
     3rd Qu.:39.76   3rd Qu.:1.0000  
     Max.   :77.78   Max.   :1.0000  
     
     # Creates a subset dataframe with the numeric variables
    df = subset(qbs, select = c(Per.Game, Per.100.Att, Per.Sack, Sack.Per.Att, qboc)) # To drop columns use a '-' before the c and list the columns to drop
    
    # Create the scatter plot
    x <- qbs$Per.Game # x is the column with the RtP values per game for each quarterback
    y <- qbs$qboc # y is either the 1 or 0 value to indicate if the quarterback is black or not
    
    plot(x, y, main = "Scatter Plot with Trend Line", xlab = "RtP Calls per Game", ylab = "QB of Color", col = "blue", pch = 19)
    
    # Add the trend line
    abline(lm(y ~ x), col = "red") # the 'lm' in this line is calling a liner model to fit a line that best fits the data by reducing the error

    Notice that the qboc is either a 1 or a 0. This is a binary variable. This might cause some implications later on. But for right now, it does appear that there is a negative association between the quaterbacks’ race and the number of RtP calls made in the quarterback’s favor. This means that the number of RtP calls decrease for quarterbacks of color.

    In the Univariate Analysis post, I decided that I would examine the Per.Game and qboc variables in this analysis, but after thinking about it, I decided it might be worth taking a look to see if maybe there was another reason for the number of RtP calls. Could a quarterback that gets sacked a lot also draw more RtP calls? Could this be an indication that the offensive line is not as good and can’t protect the quarterback as well? So I’m also going to do a scatter plot for the percentage of sacks by the number of RtP calls per game.

    R
    # Create the scatter plot
    x <- qbs$Per.Game
    y <- qbs$Per.Sack
    plot(x, y, main = "Scatter Plot with Trend Line", xlab = "RtP Calls per Game", ylab = "Sacks per Game", col = "blue", pch = 19)
    
    # Add the trend line
    abline(lm(y ~ x), col = "red")
    # This is the same code as above, but I switched out the variable for y.

    There does indeed seem to be a strong positive linear connection between the number of sacks and the number of RtP calls. One other thing to notice is that there may be some outliers for these variables. I would like to see if this is true so I am going to use the Interquartile Range (IQR) Method to determine if this is the case. This method involves calculating the IQR for your data and identifying data points outside the “whiskers” of the box plot (usually 1.5 times the IQR).

    R
    # Calculate the IQR for 'y' variable
    q <- quantile(y, c(0.25, 0.75))
    iqr <- q[2] - q[1]
    
    # Set a threshold for outliers
    threshold <- 1.5
    
    # Identify outliers
    outliers <- y < (q[1] - threshold * iqr) | y > (q[2] + threshold * iqr)
    
    # Print the indices of outlier data points
    which(outliers)
    
    # Output
    56
    
    value = qbs[56, 1]
    
    value
    
    # Output
    'R.Fitzpatrick'

    Well that is interesting, apparently Ryan Fitzpatrick took enough sacks to literally send his stats off the charts. In this case identifying the outlier was a “good to know” exercise. However, in other datasets outliers may indicate errors in the data and should be examined further. Now let’s get back to our task at hand.

    Let’s see if there is a correlation between our numeric variables. It does look like there is some level of negative correlation between the number of RtP calls and the quarterback’s race. And the scatter plot of the Sacks and RtP calls seem to be very correlated. We can create a correlation matrix to visualize the correlation coefficient between each of the variables.

    R
    corr_matrix <- cor(df)
    corr_plot<- ggcorrplot(corr_matrix,
               type = "lower",  # "lower" displays the lower triangular part of the matrix
               hc.order = TRUE, # reorder variables based on hierarchical clustering
               lab = TRUE,      # display variable names
               title = "Correlation Matrix Plot",
               ggtheme = theme_minimal()) # choose a theme for the plot (optional)
    
    
    # Change the size of the display window
    options(repr.plot.width = 10, repr.plot.height = 10)
    
    # Display the correlation matrix plot
    print(corr_plot)
    R
    RtP_correlation <- cor(qbs$Per.Game, qbs$qboc)
    RtP_correlation
    
    # Output
    -0.0523628105182807
    
    RtP_Sack_correlation <- cor(qbs$Per.Game, qbs$Per.Sack)
    RtP_Sack_correlation
    
    # Output
    0.89240005509169

    We can draw some conclusions from the correlation matrix, such as it appears as though quarterbacks of color get sacked more, but benefit less from RtP calls than their white counterparts. But don’t forget, the qboc variable is binary, so we need to make sure that we aren’t drawing some erroneous conclusions by using statistical methods meant to be used on continuous data.

    For this reason, I am also going to also look at the point-biserial correlation coefficient. This is a correlation coefficient used to measure the strength and direction of the linear relationship between a binary (dichotomous) variable (0 or 1) and a continuous variable.. It is an adapted form of the Pearson correlation coefficient, which is used for two continuous variables. The point-biserial correlation is appropriate when one of the variables is binary, and the other is continuous.

    Here are the key points about the point-biserial correlation coefficient:

    Purpose: The point-biserial correlation is used to determine if there is a significant linear relationship between a binary predictor variable (e.g., yes/no, pass/fail) and a continuous outcome variable (e.g., test scores, income).

    Range: The point-biserial correlation coefficient ranges from -1 to 1, similar to the Pearson correlation coefficient. A positive value indicates a positive relationship (as the binary variable increases, the continuous variable tends to increase), while a negative value indicates a negative relationship (as the binary variable increases, the continuous variable tends to decrease).

    Interpretation:

    • A coefficient close to 1 or -1 suggests a strong linear relationship.
    • A coefficient close to 0 suggests a weak or no linear relationship.
    • The sign (+ or -) indicates the direction of the relationship.

    Assumptions: Like the Pearson correlation, the point-biserial correlation assumes linearity and normality. It is also sensitive to outliers.

    Use Cases:

    • In research, it is used to examine associations between a binary predictor variable (e.g., gender, treatment group) and a continuous outcome variable (e.g., test scores, response time).
    • Commonly used in educational research (e.g., comparing the performance of two groups).
    • Often used in psychology to assess the relationship between a binary variable (e.g., presence/absence of a condition) and a continuous measure (e.g., anxiety levels).
    R
    # Sample data: binary variable (0s and 1s) and continuous variable
    binary_variable <- qbs$qboc
    continuous_variable <- qbs$Per.Game
    
    # Calculate the mean and standard deviation for both groups
    mean_0 <- mean(continuous_variable[binary_variable == 0])
    mean_1 <- mean(continuous_variable[binary_variable == 1])
    
    sd_0 <- sd(continuous_variable[binary_variable == 0])
    sd_1 <- sd(continuous_variable[binary_variable == 1])
    
    # Calculate the difference in means
    mean_diff <- mean_1 - mean_0
    
    # Calculate the proportion of 1s in the binary variable
    p <- mean(binary_variable)
    
    # Calculate the pooled standard deviation
    pooled_sd <- sqrt(((sd_0^2 + sd_1^2) / 2))
    
    # Calculate the biserial correlation
    biserial_correlation <- mean_diff / (pooled_sd * sqrt(p * (1 - p)))
    
    # Print the result
    print(biserial_correlation)
    
    # Output
    -0.2702969

    The biserial_correlation does indicate that there is an association between the number of RtP called in favor of the quarterback is inversely related to their race. However, the strength of the correlation may be a little week given how far -0.27 is from -1. Biserial correlations are sensitive to outliers, so let’s see what happens if we remove Ryan Fitzpatrick from the dataset… just for giggles… and re-run the correlation before moving on to testing to see if the correlation is statistically significant.

    R
    row_index_to_remove <- 56
    new_qbs <- qbs[-row_index_to_remove, ]
    
    # Sample data: binary variable (0s and 1s) and continuous variable
    binary_variable <- new_qbs$qboc
    continuous_variable <- new_qbs$Per.Game
    
    # Calculate the mean and standard deviation for both groups
    mean_0 <- mean(continuous_variable[binary_variable == 0])
    mean_1 <- mean(continuous_variable[binary_variable == 1])
    
    sd_0 <- sd(continuous_variable[binary_variable == 0])
    sd_1 <- sd(continuous_variable[binary_variable == 1])
    
    # Calculate the difference in means
    mean_diff <- mean_1 - mean_0
    
    # Calculate the proportion of 1s in the binary variable
    p <- mean(binary_variable)
    
    # Calculate the pooled standard deviation
    pooled_sd <- sqrt(((sd_0^2 + sd_1^2) / 2))
    
    # Calculate the biserial correlation
    biserial_correlation <- mean_diff / (pooled_sd * sqrt(p * (1 - p)))
    
    # Print the result
    print(biserial_correlation)
    
    # Output
    -0.1595576

    So that is interesting that the strength of the correlation decreased by removing the outlier. Since I want as much data as possible, and there was no error in the data for the outlier, I am going to use the entire dataset when testing to see if the correlation is statistically significant.

    The next step in this process is hypothesis testing. In this study:

    • Ho = There is no correlation between a quarterback’s race and the number of RtP calls made in his favor
    • Ha = There is a correlation between race and RtP calls

    The alpha for this test will be 0.05.

    R
    categorical_variable <- qbs$qboc
    continuous_variable <- qbs$Per.Game
    
    # Compute the Point-Biserial Correlation and perform the test
    correlation_test <- cor.test(categorical_variable, continuous_variable)
    
    # Check the p-value
    p_value <- correlation_test$p.value
    
    # Set your significance level (e.g., 0.05)
    alpha <- 0.05
    
    # Determine if the correlation is statistically significant
    if (p_value < alpha) {
      cat("The correlation is statistically significant (p-value:", p_value, ")\n")
    } else {
      cat("The correlation is not statistically significant (p-value:", p_value, ")\n")
    }
    
    # Output
    The correlation is not statistically significant (p-value: 0.6762716)

    There you have it, folks. According to the findings of our analysis, the NFL is doing a good job of minimizing bias in penalty calls.

    Summary

    In this phase of the EDA, I examined the relationship between a quarterback’s race and the number of RtP calls made in favor of the QB. I utilized scatter plots with trend lines to visualize potential relationships, employed a correlation matrix to identify the strength and direction of these relationships, checked for outliers to assess their impact on the relationship’s strength, ensured the use of the appropriate correlation technique to account for a categorical independent variable, and conducted hypothesis testing to determine the statistical significance of the correlation.

    Voyage Through Univariate Analysis: Charting the Solo Attributes of Roughing the Passer Penalties in the NFL

    Univariate analysis is a crucial step in data analysis that focuses on examining and summarizing the characteristics of a single variable or attribute from the dataset. Univariate analysis provides a foundation for understanding the characteristics of individual variables, which is essential for more advanced multivariate analyses and modeling. It helps identify patterns, outliers, and potential data quality issues, making it a crucial step in the data analysis pipeline.

    Python
    import pandas as pd
    # import and store the dataset
    qbs = pd.read_excel("https://myordinaryjourney.com/wp-content/uploads/2023/09/cleaned_qbs.xlsx")
    print(qbs.head())
    
    #Output
          Player  Total  2009  2010  2011  2012  2013  2014  2015  2016  ...  \
    0   A.Dalton     25     0     0     4     1     1     5     0     1  ...   
    1     A.Luck     17     0     0     0     5     3     4     1     1  ...   
    2  A.Rodgers     41     3     3     3     5     2     2     5     4  ...   
    3    A.Smith     23     1     1     1     1     1     6     4     1  ...   
    4  B.Bortles     12     0     0     0     0     0     1     2     1  ...   
    
       2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
    0     0    170      0.15      5557         0.45     361     0.069   
    1     0     94      0.18      3620         0.47     186     0.091   
    2     0    228      0.18      7840         0.52     542     0.076   
    3     0    149      0.15      4648         0.49     367     0.063   
    4     0     79      0.15      2719         0.44     201     0.060   
    
       Sack Per Att  Third Down %  qboc  
    0         0.065         40.00     0  
    1         0.051         29.41     0  
    2         0.069         39.02     0  
    3         0.079         26.09     0  
    4         0.074         33.33     0  

    First, we should look to see what the variable datatypes are and if there are any null or missing values. 

    Python
    qbs.info(verbose=True)
    
    #Output
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 66 entries, 0 to 65
    Data columns (total 26 columns):
     #   Column        Non-Null Count  Dtype  
    ---  ------        --------------  -----  
     0   Player        66 non-null     object 
     1   Total         66 non-null     int64  
     2   2009          66 non-null     int64  
     3   2010          66 non-null     int64  
     4   2011          66 non-null     int64  
     5   2012          66 non-null     int64  
     6   2013          66 non-null     int64  
     7   2014          66 non-null     int64  
     8   2015          66 non-null     int64  
     9   2016          66 non-null     int64  
     10  2017          66 non-null     int64  
     11  2018          66 non-null     int64  
     12  2019          66 non-null     int64  
     13  2020          66 non-null     int64  
     14  2021          66 non-null     int64  
     15  2022          66 non-null     int64  
     16  2023          66 non-null     int64  
     17  Games         66 non-null     int64  
     18  Per Game      66 non-null     float64
     19  Attempts      66 non-null     int64  
     20  Per 100 Att   66 non-null     float64
     21  Sacked        66 non-null     int64  
     22  Per Sack      66 non-null     float64
     23  Sack Per Att  66 non-null     float64
     24  Third Down %  66 non-null     float64
     25  qboc          66 non-null     int64  
    dtypes: float64(5), int64(20), object(1)

    The variables I will examine more closely are the total number of Roughing the Passer (RTP) calls per quarterback, the number of RTP calls per year for each quarterback, and the number of RTP calls per games played for their distribution, central tendency, and variance. I will be using Python and Jupyter Notebooks.

    Python
    table_stats = qbs.describe()
    print(table_stats)
    
    #Output
               Total       2009       2010       2011       2012       2013  \
    count  66.000000  66.000000  66.000000  66.000000  66.000000  66.000000   
    mean   18.090909   0.696970   0.833333   1.121212   1.257576   1.075758   
    std    12.505663   1.176301   1.452672   1.767355   1.774266   1.825550   
    min     2.000000   0.000000   0.000000   0.000000   0.000000   0.000000   
    25%     8.250000   0.000000   0.000000   0.000000   0.000000   0.000000   
    50%    15.500000   0.000000   0.000000   0.000000   0.000000   0.000000   
    75%    24.750000   1.000000   1.000000   2.000000   2.000000   1.000000   
    max    57.000000   5.000000   6.000000   8.000000   6.000000   8.000000   
    
                2014       2015       2016      2017  ...       2023       Games  \
    count  66.000000  66.000000  66.000000  66.00000  ...  66.000000   66.000000   
    mean    1.333333   1.454545   1.287879   1.30303  ...   0.030303   99.530303   
    std     1.842518   1.832878   1.698553   1.75385  ...   0.172733   53.915952   
    min     0.000000   0.000000   0.000000   0.00000  ...   0.000000   42.000000   
    25%     0.000000   0.000000   0.000000   0.00000  ...   0.000000   61.250000   
    50%     0.500000   1.000000   1.000000   0.00000  ...   0.000000   77.500000   
    75%     2.000000   2.000000   2.000000   2.00000  ...   0.000000  138.000000   
    max     7.000000   7.000000   6.000000   7.00000  ...   1.000000  253.000000   
    
            Per Game     Attempts  Per 100 Att      Sacked   Per Sack  \
    count  66.000000    66.000000    66.000000   66.000000  66.000000   
    mean    0.176818  3250.303030     0.573636  211.212121   0.083424   
    std     0.073801  2085.250348     0.241951  117.910594   0.034212   
    min     0.050000   289.000000     0.130000   26.000000   0.024000   
    25%     0.112500  1794.500000     0.390000  131.500000   0.058500   
    50%     0.165000  2315.500000     0.550000  170.000000   0.078000   
    75%     0.220000  4476.000000     0.740000  264.500000   0.107250   
    max     0.350000  9725.000000     1.240000  542.000000   0.202000   
    
           Sack Per Att  Third Down %       qboc  
    count     66.000000     66.000000  66.000000  
    mean       0.070091     31.143788   0.272727  
    std        0.016449     13.872813   0.448775  
    min        0.030000      0.000000   0.000000  
    25%        0.058000     25.000000   0.000000  
    50%        0.069000     30.770000   0.000000  
    75%        0.083500     39.755000   1.000000  
    max        0.103000     77.780000   1.000000  
    
    [8 rows x 25 columns]

    Well, that was easy. Using the .describe() function allows for the examination of the data set’s values for central tendency (mean in this case) and the spread (standard deviation) and dispersion (technically). Although all the information is present to observe the dispersion of the information, it may hard to conceptualize the shape without using a visualization help. Histograms can aid in our observations.

    Python
    qbs.hist(figsize=(10, 15))

    Adjust the figure size to be able to view all the histogram outputs more clearly. The distributions for Per Game and Per 100 Attempts look nearly normal, so that will allow us to use some parametric tests for analysis.

    Another visual that is helpful to see the distribution is a boxplot (box-and whisker plot). This time let us just look at just the Per Game and Per 100 Attempts.

    Python
    qbs.boxplot(column=['Per Game', 'Per 100 Att'], figsize=(10,10))

    The wonderful thing about boxplots is that they make it easy to identify outliers – observations that are outside the whiskers (either top or bottom). You can also easily see the centrality and spread of the data.  

    One additional variable that we should look at is the ‘qboc’. Did you notice that there was an interesting distribution when the histograms plotted above? Let’s take a closer look.  

    Python
    qbs['qboc'].hist(figsize=(10, 15))

    As you can see there are only two values for this variable. And this makes sense since we are categorizing the quarterbacks based on if they are (1) or are not (0) a quarterback of color. I am going to do a bit of foreshadowing, but this means that if we wanted to do any sort of predictive analysis, we need to think about some additional models beyond regular linear regression, logistic regression to be specific… but that is a post for another day. For now, I have identified the variables we can use for some multivariate analysis:

    • Per Game
    • Per 100 Attempts
    • qboc

    In this data exploration process, univariate analysis was applied to understand and summarize the characteristics of individual variables, specifically focusing on those related to Roughing the Passer (RTP) calls. The dataset was examined using Python and Jupyter Notebooks. The summary statistics and visualizations were generated to gain insights into the central tendency, dispersion, and distribution of the data.

    The analysis revealed that variables such as “Per Game” and “Per 100 Attempts” exhibited nearly normal distributions, making them suitable for parametric tests. Additionally, the “qboc” variable, which categorizes quarterbacks based on their ethnicity, showed a binary distribution, indicating potential utility in predictive modeling.

    This initial exploration sets the stage for further multivariate analysis and modeling, offering a foundation for more in-depth investigations into the relationships between these variables and RTP calls in NFL quarterbacks.

    Cruising the Data Landscape: Exploring the Fundamentals of Data Exploration

    Recap

    Well, well, well! Looks like we’ve embarked on quite the adventure here! Our first task was to get to the bottom of the burning question: “Do quarterbacks of color get the short end of the stick when it comes to roughing the passer penalties compared to their fair-skinned counterparts?” Exciting stuff, huh?

    The next step in our journey involved gathering the all-important data. We needed to make sure we were dealing with top-notch, premium-quality data. None of that shady, questionable stuff for us! We demanded data that was reliable, traceable, and oh-so-accurate. Because in the world of data science, it’s all about that golden rule: “garbage in, garbage out!” Can’t have our analysis going awry now, can we?

    For this particular project, we scoured the depths of NFL Penalties and even tapped into the vast knowledge reserves of the mighty Wikipedia. We left no stone unturned, my friend!

    Now, onto the thrilling next stage of our expedition – data cleaning and reshaping! Brace yourself, because this is where things get spicy. We had to create a nifty variable to indicate the race of those quarterbacks. Easy-peasy, right? Well, almost. Turns out, we stumbled upon a little naming convention conundrum that resulted in a sneaky duplicate value. But fear not! With our eagle-eyed attention to detail, we swiftly detected and corrected that pesky error. Crisis averted!

    And now, dear explorer, we venture forth to the exhilarating realm of Data Exploration. Excited? You should be! There’s so much more to discover and uncover on this grand data-driven expedition! So, buckle up and get ready for the thrill ride ahead! Let’s dive into the vast ocean of information and see what fascinating insights await us!

    Data exploration is an essential step in the data analysis process. It allows us to gain an initial understanding of the dataset by examining its general properties and characteristics. By delving into the data, we can uncover valuable insights.

    One of the key aspects of data exploration involves assessing the size of the dataset. This includes understanding the number of observations or records and the number of variables or features present in the dataset. Knowing the scope of the dataset helps us gauge its comprehensiveness and potential usefulness for our analysis.

    Furthermore, examining the data types is crucial in data exploration. By identifying the types of data contained in the dataset, such as numerical, categorical, or textual, we can determine the appropriate statistical techniques or visualization methods to apply during analysis. This enables us to make accurate interpretations and draw meaningful conclusions from the data.

    Another crucial aspect is the identification of initial patterns or trends within the dataset. Exploring the data can reveal inherent relationships, correlations, or anomalies that may exist. By uncovering these patterns, we can develop hypotheses, generate insights, and pose relevant research questions for further investigation.

    You may have noticed that I talk about using Python a lot in my work. I actually prefer to use in my day job since it seems to get the most use and support from other data professionals. However, for this portion of my project, I’m going to be using R instead. It has been some time since I’ve used R (maybe before I even finished my master’s program) so full discloser, I’m pretty sure I had to look up most of that I’m doing to refresh my memory. But since it is a language, if you don’t use it, you lose it, so I’m going to shake things up a bit and revisit R.

    Data Exploration

    The first things I need to do is load in any packages I need and my dataset. I am loading the ‘cleaned_qbs.xlsx’ Excel file that I created during the data cleaning process.

    R
    # import and store the dataset
    library(openxlsx)
    
    qbs = read.xlsx("https://myordinaryjourney.com/wp-content/uploads/2023/09/cleaned_qbs.xlsx")
    
    # Check to make sure the data loaded correctly
    head(qbs, n=5)
    
    #Output
    A data.frame: 5 × 26
    Player	Total	2009	2010	2011	2012	2013	2014	2015	2016	...	2023	Games	Per.Game	Attempts	Per.100.Att	Sacked	Per.Sack	Sack.Per.Att	Third.Down.%	qboc
    <chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
    1	A.Dalton	25	0	0	4	1	1	5	0	1	...	0	170	0.15	5557	0.45	361	0.069	0.065	40.00	0
    2	A.Luck	17	0	0	0	5	3	4	1	1	...	0	94	0.18	3620	0.47	186	0.091	0.051	29.41	0
    3	A.Rodgers	41	3	3	3	5	2	2	5	4	...	0	228	0.18	7840	0.52	542	0.076	0.069	39.02	0
    4	A.Smith	23	1	1	1	1	1	6	4	1	...	0	149	0.15	4648	0.49	367	0.063	0.079	26.09	0
    5	B.Bortles	12	0	0	0	0	0	1	2	1	...	0	79	0.15	2719	0.44	201	0.060	0.074	33.33	0
    

    Looking over the raw data, it is hard to see any particular patterns in the data. However, a few data points do jump out as they are either larger or smaller than the data around the point. For example, the number of RTP penalties called each year numbers in the single digits with the exception of 2020 for the quarterback Josh Allen. That year he received 11 RTP called in his favor. Another example is the number of sacks recorded against Taysom Hill. Most of the QBs have a sack count in the triple digits, while only a few have double digit sack counts and Hill has the lowest of any QB in the data set. The next thing I want to do is check how many records are in my table and the total number of variables. It is helpful to know what your sample size is so you can select the correct analytic approach later on.

    R
    nrow(qbs)
    
    # Output
    66
    
    length(qbs)
    
    #Output
    26

    The next step is to understand the data types of each of the variables. Luckily, when the head of the table was printed, it also included that data types for each of the variables. In our cleaned_qbs table, we have one variable with a character data type (Player) and 25 variables with the data type of ‘dbl’, otherwise known float.

    R
    #Output
    A data.frame: 5 × 26
    Player	Total	2009	2010	2011	2012	2013	2014	2015	2016	...	2023	Games	Per.Game	Attempts	Per.100.Att	Sacked	Per.Sack	Sack.Per.Att	Third.Down.%	qboc
    <chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>

    Next Steps

    Now that we understand how much and what kind of data we have, we can develop a game plan on what analytic techniques we can take in the next step. With Univariate Analysis, Bivariate Analysis, and Multivariate Analysis we will begin to explore the data and any relationships between the variables that might exist. This will get us closer to a possible answer to our original question.

    First the Idea, then the Data: Navigating the Depths of Information

    When working on a project that requires data, it is essential to consider the various sources and formats in which the information may be available. Often, the desired data cannot be found in one single location, requiring careful compilation from multiple sources. This process can be time-consuming and challenging, as each source may present its own set of complications and intricacies.

    While some data may be readily accessible through online databases, APIs, or official government reports, it may not always be presented in the most convenient or usable form. For instance, a dataset could be in a raw, unstructured format, making it difficult to analyze and extract meaningful insights. In such cases, data manipulation techniques are necessary to transform the data into a suitable format.

    Data manipulation involves various tasks such as cleaning, merging, and reshaping the data. Cleaning the data involves removing any inconsistencies, errors, or missing values. This step ensures that the data is accurate and reliable for analysis. Merging data from multiple sources requires aligning and combining different datasets based on common attributes or keys. This consolidation allows for a comprehensive and holistic view of the data.

    Reshaping the data involves rearranging its structure to fit the desired analysis or visualization needs. This can include tasks such as pivoting, aggregating, or disaggregating the data. By reshaping the data, researchers and analysts can uncover patterns, trends, and relationships that might otherwise go unnoticed.

    In the given scenario, the QB Penalty Stats table lacked information on the racial background of the players. While this data may not be readily available in the original source, alternative sources can be explored. In this case, a reliable and well-sourced list of black quarterbacks could be found on Wikipedia. Although caution is advisable when using information from Wikipedia, the aforementioned table demonstrates reliability in this particular case.

    Working with data often goes beyond simply retrieving information from a single source. It requires diligent gathering, cleaning, merging, and reshaping of data to ensure accuracy and usability. With the right techniques and approaches, researchers can unlock the full potential of data and derive valuable insights.

    Embarking on an Epic Journey

    I now have a list of all QBs from 2009 until present (that’s 2023 for those of you that stumble upon my ramblings in the future) in the qb_rtp table. I also have a list of black QBs in the qbs_oc table. Sometimes I find it can be faster to do some of the data cleaning work in the native application. In this case, I needed to create a column in the black_qbs table that I could use to compare the qb_rtp[‘Player’] column in the other table. Luckily Excel’s autofill feature made this particularly easy. I inserted a blank column next to the [‘Quarterback’] called [‘short_name’]. After typing a few examples, I selected the “Autofill” option in the Data drop down menu, and voilà, I had the new column that I can use for comparison.

    Starting to Use Notebooks

    I haven’t used a Jupyter Notebook to code in a very long time. I usually use the Spyder IDE as I can see all my variables’ values at the same time. But I’m going to give it a shot. You can get the entire notebook here. The first step is to import the different libraries that I think I’ll need.

    Python
    import pandas as pd
    import openpyxl

    Next, I will import the data sets.

    qbs = pd.read_excel('https://myordinaryjourney.com/wp-content/uploads/2023/09/QB_Table-1.xlsx', engine='openpyxl')
    qbs_oc = pd.read_excel('https://myordinaryjourney.com/wp-content/uploads/2023/09/QB_Table-2.xlsx', engine='openpyxl')

    Make sure to verify the data has been loaded by viewing a few rows from each dataframe. If you are wondering what the ‘\n’ means, it is the regular expression for a new line. It creates some extra space between the table headings and the table data to make them more readable. Dataquest has a great quick guide and cheat sheet available here.

    print('Quarterback Stats table:\n', qbs.head(), '\n')
    print('Quarterback of Color table:\n', qbs_oc.head())
    
    #Output
    Quarterback Stats table:
           Player  Total  2009  2010  2011  2012  2013  2014  2015  2016  ...  \
    0   A.Dalton     25     0     0     4     1     1     5     0     1  ...   
    1     A.Luck     17     0     0     0     5     3     4     1     1  ...   
    2  A.Rodgers     41     3     3     3     5     2     2     5     4  ...   
    3    A.Smith     23     1     1     1     1     1     6     4     1  ...   
    4  B.Bortles     12     0     0     0     0     0     1     2     1  ...   
    
       2022  2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
    0     2     0    170      0.15      5557         0.45     361     0.069   
    1     0     0     94      0.18      3620         0.47     186     0.091   
    2     3     0    228      0.18      7840         0.52     542     0.076   
    3     0     0    149      0.15      4648         0.49     367     0.063   
    4     0     0     79      0.15      2719         0.44     201     0.060   
    
       Sack Per Att  Third Down %  
    0         0.065         40.00  
    1         0.051         29.41  
    2         0.069         39.02  
    3         0.079         26.09  
    4         0.074         33.33  
    
    [5 rows x 25 columns] 
    
    Quarterback of Color table:
              Quarterback    short_name Years active  \
    0      Fritz Pollard     F.Pollard    19201926   
    1        Joe Lillard     J.Lillard    19321933   
    2  George Taliaferro  G.Taliaferro    19501955   
    3     Marlin Briscoe     M.Briscoe        1968*   
    4       James Harris      J.Harris    19691981   
    
                                                    Team  
    0  Akron Pros, Milwaukee Badgers, Hammond Pros, P...  
    1                                  Chicago Cardinals  
    2  New York Yanks, Dallas Texans, Baltimore Colts...  
    3                                     Denver Broncos  
    4  Buffalo Bills, Los Angeles Rams, San Diego Cha...  

    Smushing the Data Together

    Now that the two initial datasets have been read into Pandas dataframes, it’s time to smush them together in a way that makes them useful for further analysis. The first step I take is adding a new variable to the Quarterback Stats Table (qbs) called ‘qboc’. I set all the values to 0 for right now.

    qbs['qboc'] = 0

    Once again check to make sure the variable was added to the dataframe as expected.

    print(qbs.head())
    
    #Output
    Player  Total  2009  2010  2011  2012  2013  2014  2015  2016  ...  \
    0   A.Dalton     25     0     0     4     1     1     5     0     1  ...   
    1     A.Luck     17     0     0     0     5     3     4     1     1  ...   
    2  A.Rodgers     41     3     3     3     5     2     2     5     4  ...   
    3    A.Smith     23     1     1     1     1     1     6     4     1  ...   
    4  B.Bortles     12     0     0     0     0     0     1     2     1  ...   
    
       2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
    0     0    170      0.15      5557         0.45     361     0.069   
    1     0     94      0.18      3620         0.47     186     0.091   
    2     0    228      0.18      7840         0.52     542     0.076   
    3     0    149      0.15      4648         0.49     367     0.063   
    4     0     79      0.15      2719         0.44     201     0.060   
    
       Sack Per Att  Third Down %  qboc  
    0         0.065         40.00     0  
    1         0.051         29.41     0  
    2         0.069         39.02     0  
    3         0.079         26.09     0  
    4         0.074         33.33     0  
    
    [5 rows x 26 columns]

    The ‘qboc’ column was added to the end of the dataframe. The next step I take is to make the qbs[‘Player’] and the qbs_oc[‘short_name’] columns the indexes of the dataframes. This will speed up the comparison process. Although it wouldn’t be noticeable with these datasets, if you have a dataset will millions of records, it will bog down the system… ask me how I know.

    qbs = qbs.set_index('Player')
    qbs_oc = qbs_oc.set_index('short_name')

    Now this is the really fun part. All I need to do now is compare the indexes of the two dataframes. If the indexes match, then I can assign a ‘1’ to the qbs[‘qboc’] variable to signify that that quarterback is indeed a quarterback of color.

    qbs.loc[(qbs.index.isin(qbs_oc.index)), 'qboc'] = 1

    This one line of code is pretty cool, I think. I want to dig into what this line actually does. I know this isn’t a Python coding post, but explaining will only take a minute and it unpacks some interesting concepts.

    1. qbs.index and qbs_oc.index specify that we are going to be doing something with the indexes of the two dataframes.
    2. The .isin is a function that returns a Boolean mask. So what (qbs.index.isin(qbs_oc.index) does is takes the index values of the qbs_oc dataframe and look to see if those values are in the qbs dataframe index. If they are, then the Boolean value = True, if not, the value = False. You can also save that part of the code to a variable and use over and over in other parts of your project, but in this instance, we don’t need to.
    3. The qbs.loc is locating all the the indexes in the qbs dataframe where the mask value = True.
    4. ‘qboc’ specifies the column we want to assign our value to. In this case a 1 to signify that the quarterback is a person of color.

    The qbs.loc() designates the dataframe we are intending to modify; (qbs.index.isin(qbs_oc.index)) designates the row we want to work with; and ‘qboc’ designates the column.

    print(qbs.head())
    
    #Output
               Total  2009  2010  2011  2012  2013  2014  2015  2016  2017  ...  \
    Player                                                                  ...   
    A.Dalton      25     0     0     4     1     1     5     0     1     1  ...   
    A.Luck        17     0     0     0     5     3     4     1     1     0  ...   
    A.Rodgers     41     3     3     3     5     2     2     5     4     0  ...   
    A.Smith       23     1     1     1     1     1     6     4     1     5  ...   
    B.Bortles     12     0     0     0     0     0     1     2     1     7  ...   
    
               2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
    Player                                                                      
    A.Dalton      0    170      0.15      5557         0.45     361     0.069   
    A.Luck        0     94      0.18      3620         0.47     186     0.091   
    A.Rodgers     0    228      0.18      7840         0.52     542     0.076   
    A.Smith       0    149      0.15      4648         0.49     367     0.063   
    B.Bortles     0     79      0.15      2719         0.44     201     0.060   
    
               Sack Per Att  Third Down %  qboc  
    Player                                       
    A.Dalton          0.065         40.00     0  
    A.Luck            0.051         29.41     0  
    A.Rodgers         0.069         39.02     0  
    A.Smith           0.079         26.09     1  
    B.Bortles         0.074         33.33     0 

    As you can see, A.Smith now has a 1 in the qboc column. However, the A.Smith in the qbs data set is actually Alex Smith not Akili Smith who is the actual quarterback referenced in the qbs_oc dataset. This illustrates the importance of going through your data to ensure accuracy. Luckily these datasets are small enough to go through to spot any errors.

    double_check = qbs.loc[(qbs['qboc'] == 1)]
    print(double_check.index)
    
    #Output
    Index(['A.Smith', 'C.Kaepernick', 'C.Newton', 'D.Prescott', 'D.Watson',
           'G.Smith', 'J.Brissett', 'J.Campbell', 'J.Freeman', 'J.Hurts',
           'J.Winston', 'K.Murray', 'L.Jackson', 'M.Vick', 'P.Mahomes',
           'R.Griffin III', 'R.Wilson', 'T.Bridgewater', 'T.Taylor'],
          dtype='object', name='Player')
    
    qbs['qboc']['A.Smith'] = 0
    print(qbs)
    
    #Output
             Total  2009  2010  2011  2012  2013  2014  2015  2016  2017  ...  \
    Player                                                                ...   
    A.Smith     23     1     1     1     1     1     6     4     1     5  ...   
    
             2023  Games  Per Game  Attempts  Per 100 Att  Sacked  Per Sack  \
    Player                                                                    
    A.Smith     0    149      0.15      4648         0.49     367     0.063   
    
             Sack Per Att  Third Down %  qboc  
    Player                                     
    A.Smith         0.079         26.09     0  
    
    [1 rows x 25 columns]

    Next Steps

    Get ready for some quarterback madness in the upcoming post! We’ll be diving into our fancy new quarterbacks table and unleashing some wild analysis on the gridiron.