Classification Crosswalks: Strategies in Data Transformation

What if you have too many categories in a categorical variable? Your cardinality is too high for a chi-square analysis.

Classification crosswalks are useful if you want to reduce the number of levels in categorical variables you want to analyze. In biostatistics, we would say that a categorical variable with 77 levels would have too many levels for us to analyze, and we’d want to reduce the levels. If we were doing a chi-square analysis, even 10 levels would be too many. In data science, we would say that 77 is too high of “cardinality” for this variable, and we’d want to reduce the cardinality. Either way, classification crosswalks can help you.

Classification Crosswalks Example

This example comes from me helping one of the members of our online data science mentoring program. She was thinking of doing a study of pharmacy claims data from the Medicare Part D Prescribers online database.

We read a report we found online based on prescribers of pain medication in New Hampshire (NH), so she downloaded the NH data from this database. Our next step was to make classification crosswalks – but we ended up changing the research question, so we never completed this step. I decided to use data from this download to demonstrate on this blog post how to make classification crosswalks for these data so you can do something similar for your portfolio project!

Why We Need Classification Crosswalks

We were doing our initial data curation work in Microsoft Excel. If you want to download the spreadsheet, go to my GitHub here and look for “Provider Crosswalk”.

The mentoring program participant exported a spreadsheet of count data. The counts were number of providers from NH who prescribed a certain pain medication in 2020 (a query filter had been applied). The output included just the provider types that were responsible for any claims (e.g., had a frequency of claims greater than 0). The variable Prscrbr_Type was the native name of the provider type descriptor column. I made a graphic to summarize the output.

If you have too many levels of a categorical variable, you really need to make a crosswalk.

As you can see in the graphic, I added a primary key (PK) I called PrescID, and this is just a sequence of numbers from 1 to 77 (the number of rows). That way, I could sort the spreadsheet by any column, but then sort it by PrescID to get it back to its original order.

So, we had too many provider types – namely 77. The topic was pain medication, so I know that it is logical for certain types of providers – like surgeons – to prescribe pain medication. But it’s not normal for mental health providers to prescribe pain medication. I thought about that for classification purposes.

Now that I had my PK, I sorted the data by the frequency of claims from most to least. Using the empirical logic I was talking about earlier, plus the frequency information on the spreadsheet, I created a new foreign key (FK) called PrescID2 (with a description column called PrescID2 Desc), which is what you do with classification crosswalks, so I could collapse these provider types into these smaller categories, as you’ll see in the graphic.

You can create a classification crosswalk as part of your data curation in Excel.

As you can see in the graphic, I classified both Family Practice and Internal Medicine under Primary Care (PrescID2 = 1). I coded it 1 on purpose – because I wanted that to be the first category presented each time I did data analysis using PrescID2. How I knew what to code under primary care comes from my background, as well as lists online of what professions can be classified into what classifications.

Also, it was important for me to keep “nurse” and “physician assistant” separate. It is not shown on the graphic, but other nurse entries were also coded as “nurse” – but “physician assistant” remains alone. Those will be very big categories that I kept separate because they have different training pathways. In other words, if any of these groups were prescribing “too much” pain medication, there could be a training intervention. Again, these classifications are based on my empirical knowledge, as well as what frequencies I was seeing in the data.

Others may argue with my classifications, and even with my coding – but if I’m doing a portfolio project for data science, my focus is just keeping documentation of what I am doing so I can share it.

Quick Analysis with a Pivot Table

Those of you familiar with Excel might know about how to make a pivot table. I don’t like to use them except as “temporary calculators”. What I mean by that is I will make a tab on an Excel spreadsheet called “Pivot Table”, and then use it to make calculations which I copy out of there and “paste values” into some other tab. That’s because pivot tables change as you adjust them – they are essentially queries that update as you change the criteria. So I treat them that way.

I made a tab called “Pivot Table” and then set up the pivot table as you can see in the graphic.

As you can see, working with classification crosswalks allowed us to reduce the cardinality from 77 to 10. Was that good enough? I had to first copy the values out of the pivot table and paste them onto another tab. I made a tab called “analysis” and put the data on that tab.

After taking a look at the new frequencies in the collapsed categories, I decided the cardinality was still too high. I further classified them down into 7 categories, as you can see in the graphic.

As you can see in the graphic, after copying the data out of the pivot table, I added an “order” column and added PrescID2 to it. That way, I could sort it and get it in the order I wanted. That made it easy for me to decide what to further collapse. As you can see, I put four of the categories together. I just changed what it said under Row Labels, and I used the addition function in Excel to add up the number of claims.

Classification Crosswalks: Purpose for Portfolio Project

The original plan to do a portfolio project on these claims data is what prompted this data curation effort in the first place. The purpose of doing these classifications was that we wanted to choose a provider type to focus on. The original topic was to look into what prescribers might be prescribing pain medication when they should not be.

So in other words, doing this classification was actually part of curating the data and planning for a portfolio project. It wasn’t the project itself. Therefore, using Excel to produce some results was not a bad idea. Excel can produce simple results quickly that can allow you to make study design decisions.

For example, I threw together a pie chart based upon our final classifications.

In this pie chart, the largest three pieces – prescriptions written by primary care, nurses and physician assistants – make sense. But mental health practitioners prescribing pain medication? That doesn’t look right to me. Maybe it’s fine, but I think that would be an interesting topic to examine.

This is where I’d do a deep dive, and try to figure out what is going on. That is an example of a topic that could be further explored in a data science portfolio project.

Added video January 19, 2024.

Read all of our data science blog posts!

Descriptive Analysis of Black Friday Death Count Database: Creative Classification

Descriptive analysis of Black Friday Death Count Database provides an example of how creative classification [...]

Classification Crosswalks: Strategies in Data Transformation

Classification crosswalks are easy to make, and can help you reduce cardinality in categorical variables, [...]

FAERS Data: Getting Creative with an Adverse Event Surveillance Dashboard

FAERS data are like any post-market surveillance pharmacy data – notoriously messy. But if you [...]

Dataset Source Documentation: Necessary for Data Science Projects with Multiple Data Sources

Dataset source documentation is good to keep when you are doing an analysis with data [...]

Joins in Base R: Alternative to SQL-like dplyr

Joins in base R must be executed properly or you will lose data. Read my [...]

NHANES Data: Pitfalls, Pranks, Possibilities, and Practical Advice

NHANES data piqued your interest? It’s not all sunshine and roses. Read my blog post [...]

Color in Visualizations: Using it to its Full Communicative Advantage

Color in visualizations of data curation and other data science documentation can be used to [...]

Defaults in PowerPoint: Setting Them Up for Data Visualizations

Defaults in PowerPoint are set up for slides – not data visualizations. Read my blog [...]

Text and Arrows in Dataviz Can Greatly Improve Understanding

Text and arrows in dataviz, if used wisely, can help your audience understand something very [...]

Shapes and Images in Dataviz: Making Choices for Optimal Communication

Shapes and images in dataviz, if chosen wisely, can greatly enhance the communicative value of [...]

Table Editing in R is Easy! Here Are a Few Tricks…

Table editing in R is easier than in SAS, because you can refer to columns, [...]

R for Logistic Regression: Example from Epidemiology and Biostatistics

R for logistic regression in health data analytics is a reasonable choice, if you know [...]


Connecting SAS to Other Applications: Different Strategies

Connecting SAS to other applications is often necessary, and there are many ways to do [...]

Portfolio Project Examples for Independent Data Science Projects

Portfolio project examples are sometimes needed for newbies in data science who are looking to [...]

Project Management Terminology for Public Health Data Scientists

Project management terminology is often used around epidemiologists, biostatisticians, and health data scientists, and it’s [...]

Rapid Application Development Public Health Style

“Rapid application development” (RAD) refers to an approach to designing and developing computer applications. In [...]

Understanding Legacy Data in a Relational World

Understanding legacy data is necessary if you want to analyze datasets that are extracted from [...]

Front-end Decisions Impact Back-end Data (and Your Data Science Experience!)

Front-end decisions are made when applications are designed. They are even made when you design [...]

Reducing Query Cost (and Making Better Use of Your Time)

Reducing query cost is especially important in SAS – but do you know how to [...]

Curated Datasets: Great for Data Science Portfolio Projects!

Curated datasets are useful to know about if you want to do a data science [...]

Statistics Trivia for Data Scientists

Statistics trivia for data scientists will refresh your memory from the courses you’ve taken – [...]

Management Tips for Data Scientists

Management tips for data scientists can be used by anyone – at work and in [...]

REDCap Mess: How it Got There, and How to Clean it Up

REDCap mess happens often in research shops, and it’s an analysis showstopper! Read my blog [...]

GitHub Beginners in Data Science: Here’s an Easy Way to Start!

GitHub beginners – even in data science – often feel intimidated when starting their GitHub [...]

ETL Pipeline Documentation: Here are my Tips and Tricks!

ETL pipeline documentation is great for team communication as well as data stewardship! Read my [...]

Benchmarking Runtime is Different in SAS Compared to Other Programs

Benchmarking runtime is different in SAS compared to other programs, where you have to request [...]

End-to-End AI Pipelines: Can Academics Be Taught How to Do Them?

End-to-end AI pipelines are being created routinely in industry, and one complaint is that academics [...]

Referring to Columns in R by Name Rather than Number has Pros and Cons

Referring to columns in R can be done using both number and field name syntax. [...]

The Paste Command in R is Great for Labels on Plots and Reports

The paste command in R is used to concatenate strings. You can leverage the paste [...]

Coloring Plots in R using Hexadecimal Codes Makes Them Fabulous!

Recoloring plots in R? Want to learn how to use an image to inspire R [...]

Adding Error Bars to ggplot2 Plots Can be Made Easy Through Dataframe Structure

Adding error bars to ggplot2 in R plots is easiest if you include the width [...]

AI on the Edge: What it is, and Data Storage Challenges it Poses

“AI on the edge” was a new term for me that I learned from Marc [...]

Pie Chart ggplot Style is Surprisingly Hard! Here’s How I Did it

Pie chart ggplot style is surprisingly hard to make, mainly because ggplot2 did not give [...]

Time Series Plots in R Using ggplot2 Are Ultimately Customizable

Time series plots in R are totally customizable using the ggplot2 package, and can come [...]

Data Curation Solution to Confusing Options in R Package UpSetR

Data curation solution that I posted recently with my blog post showing how to do [...]

Making Upset Plots with R Package UpSetR Helps Visualize Patterns of Attributes

Making upset plots with R package UpSetR is an easy way to visualize patterns of [...]


Making Box Plots Different Ways is Easy in R!

Making box plots in R affords you many different approaches and features. My blog post [...]

Convert CSV to RDS When Using R for Easier Data Handling

Convert CSV to RDS is what you want to do if you are working with [...]

GPower Case Example Shows How to Calculate and Document Sample Size

GPower case example shows a use-case where we needed to select an outcome measure for [...]

Querying the GHDx Database: Demonstration and Review of Application

Querying the GHDx database is challenging because of its difficult user interface, but mastering it [...]

Variable Names in SAS and R Have Different Restrictions and Rules

Variable names in SAS and R are subject to different “rules and regulations”, and these [...]

Referring to Variables in Processing Data is Different in SAS Compared to R

Referring to variables in processing is different conceptually when thinking about SAS compared to R. [...]

Counting Rows in SAS and R Use Totally Different Strategies

Counting rows in SAS and R is approached differently, because the two programs process data [...]

Native Formats in SAS and R for Data Are Different: Here’s How!

Native formats in SAS and R of data objects have different qualities – and there [...]

SAS-R Integration Example: Transform in R, Analyze in SAS!

Looking for a SAS-R integration example that uses the best of both worlds? I show [...]

Dumbbell Plot for Comparison of Rated Items: Which is Rated More Highly – Harvard or the U of MN?

Want to compare multiple rankings on two competing items – like hotels, restaurants, or colleges? [...]


Data for Meta-analysis Need to be Prepared a Certain Way – Here’s How

Getting data for meta-analysis together can be challenging, so I walk you through the simple [...]

Sort Order, Formats, and Operators: A Tour of The SAS Documentation Page

Get to know three of my favorite SAS documentation pages: the one with sort order, [...]

Confused when Downloading BRFSS Data? Here is a Guide

I use the datasets from the Behavioral Risk Factor Surveillance Survey (BRFSS) to demonstrate in [...]


Doing Surveys? Try my R Likert Plot Data Hack!

I love the Likert package in R, and use it often to visualize data. The [...]


I Used the R Package EpiCurve to Make an Epidemiologic Curve. Here’s How It Turned Out.

With all this talk about “flattening the curve” of the coronavirus, I thought I would [...]

Which Independent Variables Belong in a Regression Equation? We Don’t All Agree, But Here’s What I Do.

During my failed attempt to get a PhD from the University of South Florida, my [...]

Classification crosswalks are easy to make, and can help you reduce cardinality in categorical variables, making for insightful data science portfolio projects with only descriptive statistics. Read my blog post for guidance!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Verified by MonsterInsights