The Journey

The First Detour

Once upon a time, in the realm of weather-related nerdiness, I embarked on a quest to decipher the secrets of changing weather patterns. Armed with my mighty keyboard and a burning hatred for sweltering summers, I planned to uncover the truth about my local area’s climate evolution. You see, summer and I have never been the best of friends. The scorching heat and suffocating humidity make me cringe harder than a cat stuck in a cucumber maze. So, I figured, why not dive into the delightful world of data and investigate if there’s any hope for an early arrival of autumn? I dubbed it my “Project Meteorological Marvel.”

My cunning plan involved sifting through decades of weather records, gathering juicy tidbits on how temperatures have tortured us poor mortals over the years. I wanted to spot trends, make dazzling graphs, and perhaps even predict when autumn would grace us with its blessed presence. Oh, how I yearned for a reliable sign that summer’s reign of terror would soon be over! Of course, this was no ordinary undertaking. I needed a trustworthy data source, and what better place to turn to than the National Oceanic and Atmospheric Administration (NOAA)? If you can’t trust the NOAA to provide accurate historical weather data, well, I guess we’re all doomed!

Now, I must confess, I had no intention of becoming a weather forecaster during this escapade. That’s a whole different level of sorcery reserved for the truly brave and slightly crazy souls. No, my friends, my mission was solely to unravel the mysteries of the past, not predict the future. So, off I went, armed with my web-scraping skills and a fervent desire to put an end to endless summers. And thus, my epic journey into the realm of weather data began… but did it?

Well, it seems that once people discover your supreme data-crunching powers, they start throwing project ideas at you like confetti at a parade. Take my poor, football-obsessed husband for example. He came up with the brilliant notion of analyzing if there’s any connection between a quarterback’s race and the number of times they get a sweet, sweet roughing the passer call in their favor. And as if that wasn’t enough, I thought, why not spice it up even more and explore if the defender’s race also plays a role in how many roughing the passer flags rain down upon them? Heck, let’s even toss in the officials’ race for good measure. Who knew the football field could have so many hidden layers of sociology and statistics? But hey, I’ll play along and start with quarterbacks for now. Let the mind-boggling journey begin! Just don’t blame me if we end up in a statistical black hole of absurdity.

At first, I was going to look at NCAA football statistics given that the sample size would be much larger than for the NFL. However, I didn’t really find a good source for the data to either download or extract. It just doesn’t seem like the NCAA collects that data down to the player level. As luck would have it I was able to find a source for NFL penalty data. the aptly named NFL Penalties is a sited dedicated to capturing penalty data so that users can basically settle disputes “over a player and their frequent ability to get away with murder, or not.” The site’s author does a pretty good job at articulating problems with the data and any mitigation actions taken. Ultimately the data on the site is provided by nflfastR. 1

Now that I’ve talked about the general concept and the search for a data source, here my next steps.:

  1. Collect Roughing the Passer (RTP) data for the quarterback from NFL Penalties.
  2. Collect the relevant biographical data on each of the quarterbacks.
  3. Use Python and relevant libraries such as Pandas2 and matplotlib3 to perform data cleaning, exploration, univariate and bivariate analysis, and visualizations.
  4. Publish the findings along with the documented methodology.

I’m not sure how long this will take me, could be a day, could be weeks. Either way, check back often for updates on the project’s progress.

  1. Carl S, Baldwin B (2023). nflfastR: Functions to Efficiently Access NFL Play by Play Data., ↩︎
  2. The pandas development team. (2023). pandas-dev/pandas: Pandas (v2.1.0). Zenodo. ↩︎
  3. J. D. Hunter, “Matplotlib: A 2D Graphics Environment”, Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007. ↩︎
Step One – Select Relevant Projects

So, the first step in building my data science portfolio was to “Select Relevant Projects”, at least according to ChatGPT. After thinking about this for some time, I decided I was wasting my time trying to think of project, so back to the ChatGPT I went and provided the following prompt:

Can you suggest some relevant data science projects to start with?

The next set of answers really started to provide me with some direction. If you are not an industry expert in a particular field, it can be difficult to see the possibilities in all the data that is available out there in the world. At work, I know what data I have available to me, and I can visualize in my mind’s eye the questions my clients haven’t even thought to ask about. This way, I can proactively dig into my data and tease out details. Presenting the answers to the client before they even know to ask the question tends to make me look like a rockstar. However, coming up with practice projects and problems without being a subject matter expert tends to give me a bit of “writer’s block” so to speak. Asking for help from ChatGPT helped to refine my focus. From the list provided below, I can now take each topic and develop them into individual project to showcase my skills. These are the suggestions and explanations it provided.:

  1. Exploratory Data Analysis (EDA) on a Dataset:
    Conduct a thorough analysis of a dataset, including data cleaning, visualization, and deriving meaningful insights. You can choose a dataset related to a topic of interest, such as finance, healthcare, or sports.
  2. Predictive Modeling with a Classic Dataset:
    Work with well-known datasets like the Iris dataset, Boston Housing dataset, or Titanic dataset. Build predictive models using techniques like linear regression, decision trees, or logistic regression.
  3. Natural Language Processing (NLP) Project:
    Analyze and extract insights from text data. You can start with sentiment analysis on Twitter data or build a simple chatbot using NLP libraries like NLTK or spaCy.
  4. Image Classification:
    Implement an image classification project using deep learning. The MNIST dataset for digit recognition or the CIFAR-10 dataset for object recognition are good starting points.
  5. Time Series Analysis:
    Work with time-series data to forecast trends or anomalies. You can use financial market data or weather data to build forecasting models.
  6. Customer Churn Prediction:
    Analyze customer data to predict customer churn for a business. This is a common problem in industries like telecommunications and subscription-based services.
  7. Recommendation System:
    Build a recommendation system, like the ones used by e-commerce platforms or streaming services. Collaborative filtering and content-based methods are popular approaches.
  8. Healthcare Data Analysis:
    Analyze healthcare data to identify trends, patient outcomes, or factors affecting disease prevalence. You can use datasets related to COVID-19 or electronic health records.
  9. A/B Testing:
    Design and analyze an A/B test for a hypothetical or real-world scenario. Determine the impact of changes in a website or app on user behavior.
  10. Fraud Detection:
    Build a fraud detection model using transaction data. This is valuable in the finance and e-commerce sectors.
  11. Social Media Analysis:
    Analyze social media data to gain insights into user behavior, sentiment, or trends. Use APIs to collect data from platforms like Twitter or Reddit.
  12. Geospatial Data Analysis:
    Work with geospatial data to solve location-based problems. You can analyze transportation data, plot maps, or predict real estate prices.
  13. Market Basket Analysis:
    Analyze transaction data to discover associations between products and make product recommendations.
  14. Text Generation with GANs:
    Experiment with Generative Adversarial Networks (GANs) to generate realistic text data, such as generating paragraphs in the style of famous authors.
  15. Time Series Anomaly Detection:
    Implement anomaly detection techniques on time series data to identify unusual patterns or outliers.

Exploring the World of Data: Welcome to My Data Science Portfolio

Are you a new data professional trying to break into the data science space, or are you a presently working in the field? Although I currently work as a data professional and have a master’s degree in data analytics, sharing my work with anyone outside my current employer or clients is an industry No-No. Intellectual property and non-disclosure agreements make the lawyers frown upon such actions. The solution: “Build a data science portfolio”, they say. I wish it were that easy. No one teaches you how to do that or what a data science portfolio even means. Questions like: “Where do I start?”, or “What do I include?” may nag at you and keep you from even getting started. Just take a look at the /datascience subreddit and you will find threads like What makes a good personal project – from the perspective of a hiring manager, People who make hiring decisions: what do you want to see in a portfolio?, and How to Build a Data Science Portfolio. As you can see, I was obviously wondering the same things since I did the searches too. However, I tend to learn better by doing, so I’m planning on “winging-it,” sort of.

What do I mean by winging it, you may wonder? I’m all in favor of using the tools that are already available, so I took to ChatGPT for some guidance. My first prompt was simple: “how to build a data scientist portfolio.” True to form, ChatGPT did not disappoint, and its advice was simple and concise:

  • Select Relevant Projects
  • Clean and Document Your Code
  • Create a Portfolio Website
  • Project Descriptions
  • Include Jupyter Notebooks
  • Visualize Data Effectively
  • Highlight Your Skills
  • Include a Blog Section
  • Add a Resume or CV
  • Engage in Open-Source Contributions
  • Seek Feedback
  • Update Regularly
  • Network and Share

Consulting ChatGPT will continue throughout the processes.

To help manage the moving parts in this process, I’m relying on Atlassian’s JIRA software to build a roadmap that will not only manage the process of standing up my portfolio, but to also keep track of the progress of my individual projects.

Jira Software is the #1 agile project management tool used by teams to plan, track, release and support world-class software with confidence. 

Welcome to Jira Software | Atlassian

As for sharing my work with the world, you’re here so it must be working. As I work through the points laid out by ChatGPT above, I will document my journey and share my thoughts, successes, and frustrations here. Follow along to see the portfolio grow.