Portfolio project examples are things I’m asked about by newbie data scientists on LinkedIn or otherwise in my professional life. They ask,
“I did projects in college, and I also did a Kaggle competition. Is that enough?”
They mean – is that enough to learn applied data science, and be ready for a job that requires the kind of problem-solving data scientists do?
The answer is, unfortunately, “No” – those experiences are not really portfolio project examples. A portfolio project that counts towards preparing you for problem-solving in a real data science job is one where you did it almost completely independently. In other words, you did all of these four steps pretty much independently:
You came up with a research question or aim all by yourself in a domain that interests you.
You found a dataset you could use to answer it.
You did a scientifically-acceptable job of answering the research question using data management and statistical skills, and.
You document this experience – introduction, methods, results, discussion – in a white paper or a blog post (or series of blog posts).
This documentation – step 4 -is technically the “portfolio project” – and is where you can post code on GitHub, or display a dashboard or notebook, or showcase whatever technological output you got from your experience.
I used to see portfolio projects in data science as “the great equalizer”. That is because – with open source software like R and Python, and with datasets easily downloadable online – anyone can do one. Better yet, anyone can actually publish the results of their portfolio project examples on a blog post for the world to access and see.
However, I neglected to realize a practical problem with this approach. Creating and posting portfolio project examples is not that easy for a newbie in data science. The newbie data scientist may have learned about all of those four steps independently, but does not know how to put them together into a cohesive project they can showcase. Even learners who came to me after studying in college classes and boot camps see the same problem of fragmentation in data science education programs.
Mentoring is the Solution
If making portfolio project examples is your problem, mentoring is the solution. I am among the different data science mentors you can hire to help you individually – but for doing data science portfolio projects, it’s better to be in a group mentoring program like my “rebrand” online group mentoring program for health data scientists.
In this blog post, I will give a few portfolio project examples you could do if you participated in our mentoring program.
Portfolio Project Examples 1: Laboratory Value Correlations
Challenge
I had a learner who was working on some data about cytokines, which are typically described as “chemical messengers” in our central nervous system. Those of you who study cytokines know that they are often measured with a PCR test, and the resulting values are continuous variables. There are “inflammatory cytokines” and “anti-inflammatory cytokines”, and just by me telling you this, you can already guess that the different cytokine measurements are correlated with each other.
Having huge datasets of “correlated continuous variables” is a big problem in Big Data. Since they are all correlated, which ones should you use in your model? Or should you make an index of them? Or maybe classify them? What do you actually do with all these correlated measurements? Worse, what steps do you take to figure it out, and get endorsement along the way from your boss (who doesn’t understand any of this)?
She was the main statistician on her team, and was asked to “prepare” the cytokine data – or maybe, “analyze” the cytokine data? The researchers with her were surgeons, so they had no idea how to do any research (those four steps above). She basically had to “do something” with the cytokine data to make sense out of it.
Solution
So, to give her the skills she needed to figure that puzzle out, I recommended that she do a portfolio project with the NHANES Laboratory Data available in the 2017-2018 dataset. The laboratory data available includes lipid results (HDL, LDL, triglycerides) and other measurements that have a lot of biological evidence behind them you can study in the scientific literature.
By contrast, cytokines are basically “Twitter of the CNS”, so I can’t see why analyzing them would do any good – but it continues to be trendy for some reason. My thought was that the learner could practice by using continuous variables from real people that we actually understand, and figure out how to visualize them, categorize them, and describe them. Then, she can apply her skills to these cytokine data, which – if you believe the extensive scientific literature on cytokine analysis – will probably make little sense when she’s done. But since she practiced on lipids, which actually do make sense and have an evidence base, she’ll understand what it is “supposed” to look like when you can actually interpret the relationships between the correlated continuous measurements.
Project Details
The way my mentoring program works is that there are four self-paced phases. Once you have completed Phase 4, you would have completed all four steps above about a research question. You will have made several posts on a public blog to showcase your work. The blog is the home to your portfolio projects, and links to your GitHub account. Then, after you complete Phase 4, you can continue in the program doing follow-up analyses or fancier visualizations on the same dataset, and continuing to post them on your blog to add to your portfolio project.
For this learner, I would try to have her produce output in the form of box plots and scatter plots to help us figure out the best way to characterize the relationship between these different laboratory values. Then, she could make several blog posts showcasing how she selected the lab values she analyzed, how she curated her data, what visualizations ultimately “spoke” to her and why, and what she finally decided was the best way to present the data so we understand the relationships between the variables.
Portfolio Project Examples 2: Pharmaceutical Adverse Events
Challenge
It’s not unusual for me to meet a new SAS learner on LinkedIn who wants to make a portfolio project example having to do with data from “Big Pharma”. As we know, pharmaceutical companies are famously jealous guardians of their proprietary data. Learners may erroneously believe that they cannot get experience with data to impress “Big Pharma” without actually having worked for “Big Pharma”.
Solution
But the fact that we cannot access private data from pharmaceutical companies should not stop us from doing pharmacy-related analyses. Just about every country has an adverse event reporting system for medications that are widely in use. In the US, we have the FDA Adverse Evens Reporting System (FAERS) which publishes many resources online, including a dashboard to help you navigate their data.
Imagine a SAS learner in my program wanted to do a portfolio project on pharmaceutical data. They could start with this FAERS data. In the last example, I talked about lipids – now, let’s do an example with lipid-lowering drugs, like Lipitor. I looked up some common ones online, and entered that into the search engine on the dashboard. See what I found already?
As you can see in the annotated screen shot I made, for some reason, there was this huge spike in adverse events for these drugs in 2021 and 2022!
Project Details
As you can see by my little example above, we already have a potential portfolio project for the newbie pharma data scientist – figuring out why there is a spike in adverse events for lipid-lowering drugs for 2021 and 2022! Of course, the person doing the project would have to download and process the raw data from FAERS. That is possible with the FDA data – but the dashboard can allow you to do preliminary analyses and figure out what data you want, and what project you want to do. In this case, I would recommend the learner use SAS ODA for practice, because Big Pharma greatly values SAS skills.
Portfolio Project Examples 3: Impact of Health Insurance on Populations
Challenge
I have been working with some colleagues who are concerned about how health insurance in the US is draining its population of funds to support the corrupt privatized healthcare system. Even people on our public insurances – Medicaid and Medicare – are experiencing a post-pandemic healthcare system that crashed and is continuing to run in “crisis care” mode while everyone appears to be acting like nothing happened.
A good place to start with a portfolio project like this is the US census. It has a lot of variables on different characteristics, and the rates. For example, I looked up their data tables on health insurance coverage.
Solution and Project Details
As I said on the graphic, when you get rates for certain areas, you can do an ecologic study – also called a correlational study. You can even make a dashboard and visualize the rates with a scatterplot or other interactive diagram. It’s a great excuse to practice your R or Python skills! See the visual below, which is a screen shot from this peer-reviewed article, where we used a scatter plot and a heat map to show associations.
Read all of our data science blog posts!
Portfolio project examples are sometimes needed for newbies in data science who are looking to complete independent projects. This blog post provides some great examples of independent projects you can do with datasets available online!