ETL pipeline documentation is necessary on all team data projects. In fact, I make ETL pipeline documentation every time I have a project where I am doing ETL, even if I’m the only one doing it. Extract, transform, and load (ETL) protocols are typically quite complex. Even what initially looks like a simple process of transforming three native variables into nine analytic variables – as I demonstrate in this video – turns out to be actually rather complex.
ETL Pipeline Documentation: Different Types
In the video, I classify ETL pipeline documentation into two categories: the minimum necessary, and additional helpful files that can be added. What I list under the “minimum necessary” is basically two items:
- A complete data dictionary, with well-documented picklists and available crosswalks, and
- A copy of the final analytic dataset.
You can download all of these files from GitHub.
The main part of a data dictionary provides a list of names of variables in a data table, and provides necessary information about them, such as what they mean, how they are populated, and in what format they are stored. Picklists are like dropdown lists, where a value refers to a level in a categorical variable (e.g., 1 = Yes, 2 = No, 9 = Don’t Know). Crosswalks are like picklists on steroids – meaning they are very extensive picklists that can be joined to the main dataset to add other columns to enhance its analytic utility.
Analytic Datasets for Research
When you engage in ETL, you are transforming the native extract you received from the data provider into an analytic dataset on which you can base your analysis for research. Through discrete transformation steps, your ETL code transforms the native extract into the analytic dataset.
Other ETL Pipeline Documentation
Other documentation that can be helpful is:
- Well-organized ETL code
- An ETL pipeline diagram
- Example deidentified source data.
Well-organized ETL Code
Well-organized ETL code is modular, and includes only one or a few transformation steps per code file. The code files use naming conventions that make the code line up in order of operation. Other naming conventions (e.g., for datasets and variables) are applied. To learn more about this, read Chapter 4 of my book, “Mastering SAS Programming for Data Warehousing”.
ETL Pipeline Diagram
This ETL pipeline diagram is basically a map that summarizes the code. It goes hand in hand with the code and the transformation steps. If you have trouble with a particular variable in the analytic dataset, this diagram can help you troubleshoot which code to open and modify. Well-organized code and diagrams like these are necessary if you want to automate your ETL code using SAS or other programs. I go over ETL code automation in Chapter 8 of my SAS book. The videos below from SAS provide an example of doing this in SAS Data Integration Studio.
Example of Deidentified Source Data
These data are obviously faked – but they serve to show the receiver of this faked data the format, structure, and some business rules of the data. Receiving a faked dataset like this – even if it is only a few rows – can give you the opportunity to start building transformation code (especially complex date handling), even before you receive the real data from the provider.
Updated February 22, 2023. Revised Banners June 18, 2023.
ETL pipeline documentation is great for team communication as well as data stewardship! Read my blog post to learn my tips and tricks.