Curated Datasets: Great for Data Science Portfolio Projects!

If you need data to do a project, read this blog post for information.

Curated datasets are available on the internet for you to use in portfolio projects. This blog post was intended for our mentoring program, but anyone can benefit! I’ve separated the datasets into “health-related” and “non-health-related” and put them in alphabetical order by their abbreviation. I’ve also included some of my own personal notes based on what I have experienced from working with the dataset or others who have used it.

To provide you curated datasets, I will keep adding to this post as I learn more about these datasets, and find out about other datasets you can use for your portfolio projects. If you have any comments about your experience with these datasets, please add them to this blog post! Good luck with your projects!

You can download BRFSS health survey datasets from the internet.

Formal name: Behavioral Risk Factor Surveillance System

Description: Cross-sectional annual dataset of an anonymous phone-based health survey in the United States (US).

Pros:

  • Well-documented
  • Datasets free to download online
  • Some extremely useful questions for analysis
  • Goes many years back so trending possible
  • Core dataset includes representative sample of US population by states and nationally, including weights for weighted analyses
  • Large (over 400,000 records)

Cons:

  • Some outdated questions
  • Lack of contemporary questions (e.g., e-cigarette use)
  • Cross-sectional, so cannot be used for longitudinal studies
  • Recent severe reduction in core questions asked (as of 2020)
  • Lacks clinical data or data that can be measured in-person (see NHANES)

Resources:

  • Guide to downloading BRFSS datasets
  • Monika’s LinkedIn Learning courses in R and SAS that use the BRFSS as a demonstration dataset
  • Monika’s SAS book that uses BRFSS as an example

Formal name: Center for Medicare and Medicaid Services (CMS) in the United States is public insurance. Pharmacy-related data are available online.

Description: CMS has some different online dashboards where you can look at a tabular version of the pharmacy data, do filters, and download the raw data for analysis.

Pros:

  • Very easy to browse and download data using the dashboards
  • Well-documented
  • Up-to-date
  • Even if you do not have a background in pharmaceuticals, it is easy to look up information about drugs, the US, and other topics to be able to come up with a question to answer using some of these data.

Cons:

  • The data served up are not necessarily the data you want. There are calculated variables that apparently have to do with seeing if a policy is met that are not useful generally for analysis (e.g., counts of people aged 65 and over receiving a drug vs. everyone).
  • Even though it is easy to acquire the domain knowledge necessary to do an analysis, it is quite consuming. Drugs are complicated, Medicare and Medicaid are complicated, and US healthcare finance is also complicated.

Resources:

  • If you click on Access Cost Data button, you will go to a page that will provide you links to four different pages: Medicare Part B Spending by Drug, Medicare Part D Spending by Drug, Medicaid Spending by Drug, and Medicare Part B Discarded Drug Units. On each of these pages, if you choose the “view data” button in the upper left, you will be brought to the dashboard to filter and download data.
DCCT is the baseline data, and EDIC is the longitudinal data.

Formal name: Diabetes Control and Complications Trial (DCCT) and Epidemiology of Diabetes Interventions and Complications Study (EDIC)

Description: The DCCT was originally a study of Type I diabetics who were trying to actively control their diabetes, and then EDIC was longitudinal follow-up that was done on this cohort. I have not used this dataset so I’m basing my opinions on the documentation.

Pros:

  • Well-documented
  • This is a very high-quality dataset. The measurements seem very valid and reliable.
  • Very long-term follow-up, so survival analysis models can be used
  • For the entire trial the sample size is pretty big, and the data appear to be pretty clean.
  • Datasets are available at no cost but you must undergo authorization

Cons:

  • Very old-fashioned dataset and structure. Non-relational, so very confusion.
  • Documentation page has all necessary information, but is clumsy, slow, and labyrinthian.
  • Need a lot of domain knowledge to understand the dataset. Measurements such as psychological instruments, labs, and genetics require advanced knowledge to analyze.

Resources:

  • I review the DCCT and EDIC datasets in my online course, “How to do Data Close-out: Boot Camp”.
The United States FDA Adverse Event Reporting System lets you download raw data so you can analyze it.

Formal name: United States Food and Drug Administration (FDA) Adverse Event Reporting System

Description: In the United States, after a drug goes on the market, if people have adverse events, it can be reported into this system.

Pros:

  • Date are current and up-to-date
  • Easy to download in ASCII format free over the internet
  • Well-documented – zip file contains explanation of each dataset
  • Although somewhat confusing, it is not that challenging to understand the data from the point-of-view of domain knowledge
  • Very nice dashboard available to help you understand the data before you download it.

Cons:

  • Data are biased, in that many adverse events go unreported
  • Data need to be aggregated to be meaningful. For example, if you want to study “Lisinopril”, or “hypertension medication”, or certain classifications of adverse events, you will need to look at all the different formulations and develop classifications.
  • Multiple tables, so it will take some reading of documentation to figure out what you want to do with this dataset
  • No load code posted
The Healthcare Cost and Utilization Project data can be used to analyze healthcare cost

Formal name: Healthcare Cost & Utilization Project

Description: Cross-sectional annual datasets of a sample of data from United States (US) healthcare settings intended for analysis to understand healthcare use and cost. There are multiple databases, but the most popular is the Nationwide Inpatient Sample (NIS), which is an annual dataset of data about patients being discharged from US hospitals. The pros and cons below refer specifically to the NIS, as I have little familiarity with the other datasets.

Pros:

  • Well-documented
  • Goes many years back so trending possible
  • Includes rare cost-related data so can be used for healthcare cost forecasting
  • Includes many calculated variables based on economics that are useful in forecasting models

Cons:

  • Datasets are for sale, and although prices are not that high, it takes a lot of research on their documentation web site to figure out what data you want to purchase
  • Due to lack of access in the US healthcare system and other features of US healthcare financing and delivery, the dataset is unbalanced and biased, and it is hard to use for health-related analyses. In fact, drawing inferences may actually perpetuate bias in the US healthcare system.
  • Many variables in the dataset are very hard to understand. They require knowledge of both the healthcare setting as well as healthcare economics.
  • Data prepped for SAS, but not open source applications

Resources:

  • I review HCUP documentation in my online course, “How to do Data Close-out: Boot Camp”.

Formal name: Military Health System Data Repository

Description: This repository is a data lake of processed production data from healthcare settings in the United States (US) military.

Pros:

  • Extremely well-documented and well-used
  • Goes many years back so trending possible
  • Analyses extremely helpful to the functioning of the US military
  • Servicmembers and veterans who want to do epidemiologic analyses: These are GREAT datasets for you, because you will already understand them, and will have less trouble getting access to them for projects

Cons:

  • Requires a lot of domain knowledge. If you have not worked for the military or been in the military, these will be very confusing datasets for you.
  • Difficult to get permission to access unless part of a research team that has approval. A good way to address this is to become part of a research team that gets a grant from the military or other organization to use these datasets to study the military.

Resources:

  • My colleagues and I published some injury papers where we combined the SIDR and SADR datasets from the MHS with other military datasets: one on ankle injury, one on knee injury, and one on rhabdomyolysis.
NHANES is a health surveillance dataset from the US that includes examination data.

Formal name: National Health and Nutrition Examination Survey

Description: Cross-sectional annual dataset of an in-person health survey in the United States (US).

Pros:

  • Datasets free to download online
  • Some extremely useful questions for analysis
  • Goes many years back so trending possible
  • Population-based sampling with weights
  • Includes rare in-person examination data, such as oral health data, and anthropometrics.
  • Also includes some laboratory measurements, and other hard-to-measure health data.

Cons:

  • Some outdated questions
  • Lack of contemporary questions (e.g., e-cigarette use)
  • Cross-sectional, so cannot be used for longitudinal studies
  • Dataset is small (<10,000 per year). For a larger health survey with similar measurements, see BRFSS.
  • Extremely fragmented dataset, in that each exam or questionnaire is stored in its own dataset, so hard to put together a dataset with all the covariates needed about every experimental unit. For this reason, many researchers aggregate multiple years of NHANES data to answer research questions.
  • Data prepped for SAS, but not open source applications
Warranty Week makes available data on warranties that you can purchase.

Description: Warranty Week is a newsletter for warranty management professionals. If you want to get into studying warranties (great for actuaries!), then this is the site for you. It has manufacturers’ product warranty expense reports available in an Excel spreadsheet format. Each spreadsheet includes warranty reserves, claims, and accrual figures for specific companies. Thanks to Joe Chantiny for turning me on to this data source!

Pros:

  • Voluminous, accurate data from a trusted provider.
  • It is easy to choose data from companies a la carte – so you can focus in a warranty domain (e.g., automotive, energy, manufacturing).

Cons:

  • You need to have a lot of domain knowledge to understand this dataset and figure out what to do with it. If you want to know more, please contact Joe Chantiny on LinkedIn and he can give you more information.
  • The spreadsheets are NOT tabular – each spreadsheet is essentially a report. To do anything with the data, you would have to process it into data tables.
  • Data are not free, but the cost is low, and you can pay by PayPal. However, you need to shop carefully on the site to decide exactly what you are buying. They have 394 datasets, and as of this writing, to purchase all of them is less than $2,000, which is actually a great deal in my opinion.
You can download reports online from the Massachusetts Casino Gaming Commission and analyze the data.

Description: Massachusetts has three casinos, and each has its own revenue report that is updated monthly. Each revenue reports is in PDF format, and looks like an Excel table. Each row has data about each month the casino has been open.

Pros:

  • If you are a casino customer, it is very easy to understand the data.
  • Interesting to use in time trending, and could be useful if paired with casino data from nearby states to review competitive pressure as the different casinos in Massachusetts opened.
  • Great examples of continuous variables and monetary data for practice.
  • Opportunity to connect to other datasets about Massachusetts for richer information
  • Portfolio projects can be very interesting, flashy, and easy to understand

Cons:

  • Electronic datasets need to be developed – they are in PDF format. It is not hard to do data entry, but an enterprising programmer could scrape the data easily as well.
  • Small datasets, so very little data. Casinos have not been open in Massachusetts very long.
  • Only a few columns, so you need to be creative to develop research questions and analyses to answer them.

Resources:

  • See my blog post for an example of a portfolio project I developed before the pandemic, so it is outdated now. Feel free to update it with your own analysis!

Last updated June 7, 2023. Revised banners June 17, 2023.

Census data can be available from many countries, but the United States census is great because the data are available online.

Description: Every 10 years, the US tries to do a census – meaning count everyone in the US. They have us all fill out a form – either a long form, providing a lot of data, or a short form, providing very basic demographic data. There is also a yearly survey called the American Community Survey that is done by phone to try to improve the biased estimates that come from the census (because not everyone can be contacted to fill out a form.

Pros:

  • They have a new rebuilt portal (goodbye American FactFinder)!
  • These data are very helpful when you are writing about a specific organization or location in the US. If I am analyzing data that came from students in a particular school, then I can use census data to characterize the region around that school.
  • If you understand how counties work in the US, and you are analyzing other data that has county variables (e.g., data from hospitals located in certain counties), then “hooking on” county-level estimates is a great way to add value to your data from the census.

Cons:

  • You really have to understand the US in terms of geographic regions in order to utilize the data. You need to understand states, counties, MSAs, census districts, and other regional definitions.
  • Some data are suppressed for privacy reasons.
  • It’s great that there are so many datasets and so much data available, but many are hard to understand. Beyond the simple analysis, it will take you some time to get to know what exactly is available, and what you can use for your purposes.

Resources:

  • If you click below, you will be directed to the census online table-builder. However, there are other data products you can explore from the census.

Last updated July 15, 2023.

Read all of our data science blog posts!

Curated datasets are useful to know about if you want to do a data science portfolio project on your own. I made this blog post for our group mentoring program. Check out the ones I am promoting on my blog!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Verified by MonsterInsights