ETL Pipeline Documentation: Here are my Tips and Tricks!

This blog post shows you how to properly document your extract, transform, and load code.

ETL pipeline documentation is necessary on all team data projects. In fact, I make ETL pipeline documentation every time I have a project where I am doing ETL, even if I’m the only one doing it. Extract, transform, and load (ETL) protocols are typically quite complex. Even what initially looks like a simple process of transforming three native variables into nine analytic variables – as I demonstrate in this video – turns out to be actually rather complex.

Watch Monika Wahi's data science tutorials on YouTube!

Watch my video to get a demonstration on how to document your ETL pipeline

ETL Pipeline Documentation: Different Types

In the video, I classify ETL pipeline documentation into two categories: the minimum necessary, and additional helpful files that can be added. What I list under the “minimum necessary” is basically two items:

  1. A complete data dictionary, with well-documented picklists and available crosswalks, and
  2. A copy of the final analytic dataset.

You can download all of these files from GitHub.

Data Dictionary

The main part of a data dictionary provides a list of names of variables in a data table, and provides necessary information about them, such as what they mean, how they are populated, and in what format they are stored. Picklists are like dropdown lists, where a value refers to a level in a categorical variable (e.g., 1 = Yes, 2 = No, 9 = Don’t Know). Crosswalks are like picklists on steroids – meaning they are very extensive picklists that can be joined to the main dataset to add other columns to enhance its analytic utility.

This is an example of the type of documentation you can include your data dictionary.

Analytic Datasets for Research

When you engage in ETL, you are transforming the native extract you received from the data provider into an analytic dataset on which you can base your analysis for research. Through discrete transformation steps, your ETL code transforms the native extract into the analytic dataset.

Other ETL Pipeline Documentation

Other documentation that can be helpful is:

  1. Well-organized ETL code
  2. An ETL pipeline diagram
  3. Example deidentified source data.

Well-organized ETL Code

Well-organized ETL code is modular, and includes only one or a few transformation steps per code file. The code files use naming conventions that make the code line up in order of operation. Other naming conventions (e.g., for datasets and variables) are applied. To learn more about this, read Chapter 4 of my book, “Mastering SAS Programming for Data Warehousing”.

ETL Pipeline Diagram

This ETL pipeline diagram is basically a map that summarizes the code. It goes hand in hand with the code and the transformation steps. If you have trouble with a particular variable in the analytic dataset, this diagram can help you troubleshoot which code to open and modify. Well-organized code and diagrams like these are necessary if you want to automate your ETL code using SAS or other programs. I go over ETL code automation in Chapter 8 of my SAS book. The videos below from SAS provide an example of doing this in SAS Data Integration Studio.

SAS provided this video to show you how to automate your ETL pipeline in SAS DI Studio.
In SAS Data Integration Studio, a job refers to a pipeline you can create, save, and come back to edit.

Example of Deidentified Source Data

Sometimes if you get confused about data, you can ask your data provider to make up a few fake records that follow validation rules for you.

These data are obviously faked – but they serve to show the receiver of this faked data the format, structure, and some business rules of the data. Receiving a faked dataset like this – even if it is only a few rows – can give you the opportunity to start building transformation code (especially complex date handling), even before you receive the real data from the provider.

Updated February 22, 2023. Revised Banners June 18, 2023.

Read all of our data science blog posts!

ETL pipeline documentation is great for team communication as well as data stewardship! Read my blog post to learn my tips and tricks.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Verified by MonsterInsights