Querying the GHDx Database: Demonstration and Review of Application

Many data scientists interested in health are looking to query the Global Burden of Disease database, also known as the GHDx

Querying the GHDx database was requested by one of my customers. The GHDx database is online and is the data repository behind the Global Burden of Disease Study, which is sponsored by the Lancet and the Institute of Health Metrics and Evaluation (IHME) at the University of Washington.

Querying the GHDx database is very challenging, which is why my customer asked me for help. I could not believe how awkward it was. We fought with it for about an hour before I could come up with a process. The process I recommended included serial individual queries and abstracting data from the retrieved database results onto spreadsheets. You know if you are doing that, your GUI is a fail.

I’ll show you annotated screen shot below, but it might be easier for you to watch me demonstrate the GUI, which I did in a recent livestream.

Querying the GHDx Database Requires You to Understand the GUI

Querying the GHDx database is quite complicated, so it requires you to plan your actions before you take them. You can imagine why when you look at this annotated screen shot I made of the GHDx query interface.

This is a diagram of the web page where the GHDx query tool is. The entire web page has a lot of items on it besides the query interface.

As you can see, the query form is quite small. The rest of the real estate is taken up by visualizations and information that could be on other pages. Let’s take a close-up of the query form – see my annotated screen shot here.

Let me give you a quick unpacking of the annotated screen shot of the query form.

A

Context: This is a dropdown that you want to address first in your query. That is because if you change it from the default, which is “cause”, it changes the fields on the query form! I think there are different flat tables behind the GUI, and this dropdown toggles what set of tables you are querying. Of course, without knowing anything about the underlying dataset – like the source of it, or what columns it has – it’s not possible to figure out how to fill out the query form.

In my video above, I only demonstrate myself querying using the “cause” query form. That’s because it lets you query about rates of health conditions. I encourage you to explore the other forms yourself.

B

Base: This changes your results from being an estimate (“single”) to an estimate of change (“change”). As you probably know, when you calculate percent-change, you have to have a baseline, and the percent change is of that baseline.

Because this query form is so unclear, I would be hesitant to use the “change” function. I would instead suggest you just use the “base” function and calculate your own change percentages.

C

Cause: This is the list of conditions – but there are serious classification issues, as you can see by the screen shot. It would be very hard to prepare for a query of this database, because the results would be per entry, and the entries are stratified at very low levels. Further, there is not an obvious place to click to get the full list.

For example, if you only wanted some selections under “Cardiomyopathy and myocarditis” and not the whole category, your individual selections would not group together in the results pane after your query. Again, you would have to do some post-processing to calculate your estimates.

D

Measure: Here are where common “burden of disease” measures (such as DALYs) are, but there isn’t any readily-available documentation about how they were calculated. From this pane, I would only trust “prevalence” myself without looking more into the source data and the source calculations. Even “incidence” is questionable, because many countries do not do a good job of tracking incident cases of chronic disease.

E

Metric: If you noticed that the term “metric” and “measure” mean the same thing, then you are probably as confused as I am as to why there are two dropdowns. Since this includes “rate” and “percent”, it seems redundant to include “prevalence” under Measure – doesn’t it?

F

Location: This has countries in it, and it also has some higher-level groupings, such as “high-income” countries. Again, this suffers from classification limitations. It would be very hard to get an estimate for a group of countries selected by the user. You essentially would have to use this GUI for data collection into a spreadsheet and then make summary statistics.

Why is Querying the GHDx Database So Hard?

There are a few answers to this question. First, no one really cared to design a user interface, because if you do that, then you have to consider who your users are, and it appears that no one at the Lancet and IHME was interested in doing this. I guess I’m a user, and I find it horrendous. I cannot imagine a user who would find the experience satisfying. So, the first lesson is that if you don’t consider who your user is, your user interface will probably be bad.

Second, it looks to me like their back-end is in SAS, for whatever reason, when it probably should be in some flavor of SQL. That’s because SQL is for querying, and SAS is for analytics. SAS has some great tools for publishing to the web – however, whatever tool these people used is not one of them! Just because SAS makes a tool doesn’t mean you should use it. To me, this situation looks like a great use-case for some integration. SAS could still be used for creating estimates (like confidence intervals), but SQL could be used for storage, and R or Python could be used to create more customized output. There are so many possibilities for dashboards in R, and this GHDx could essentially be seen as a dashboard for querying a bunch of tables.

If you want more information on these concepts, here is what I recommend:

  • Please watch my recorded livestream where I demonstrate the GHDx. I also lecture a little about data warehouse principles I cover in my book “Mastering SAS Programming for Data Warehousing”.
  • Gain skills at design so that when it’s your turn to make a dashboard or a GUI, you can at least use your imagination and produce something better than this example. My course on LinkedIn Learning on data curation can help you with this, as well as my description of dashboard design in a book chapter I co-authored with my intern, “Framework to Evaluate Level of Good Faith in Implementations of Public Dashboards”

Read all of our data science blog posts!

Querying the GHDx database is challenging because of its difficult user interface, but mastering it will allow you to access country-level health data for comparisons! See my demonstration!

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.