What do I do when data is missing?

Recently, I’ve been working on calculating the Genuine Progress Indicator (GPI) for Ohio. GPI is an alternative measure to GDP that tries to capture what is going on in an economy while adding things like the value of having an educated workforce and subtracting things like the social costs of crime.

One addition GPI makes to GDP is that adds the value of leisure time and time spent on non-market work. The unpaid time we spend doing housework or caring for children, for example. The reason we want to include these indicators is that we know these things provide value to our economy, but because money never exchanges hands they don’t get measured by  GDP.

To estimate the value of these things in the economy, we use data from the American Time Use Survey produced by the bureau of labor statistics. This survey tells us how much time Americans over the age of 15 spend on different activities.

Unfortunately, the American Time Use Survey wasn’t conducted in 2020 because of the pandemic. To make matters worse, the pandemic also led to dramatic changes in what activities people spent their time on day-to-day.

Normally with missing data, we can use the observed data we have to make some estimate for what the missing value is. We might do this by assigning the missing value as the average of our observed data.

But in this case we know that the average of the observed data is not representative of the missing data. We know that because of the shutdowns, people spent way more time at home.

In statistics, we call this type of missing data missing not at random (MNAR). Specifically, data is MNAR if it is missing because of some unobserved condition.

MNAR data is extremely hard to work with as a statistician. It essentially guarantees that there will be some bias in the final results.

One of the most common ways to deal with MNAR data is to perform sensitivity analysis. We can test what our results look like if the missing data is more or less similar to the observed data we have. This way, we can at least get an idea of what the range of reasonable results might be.

However, as is always the case with sensitivity analysis, it relies heavily on our assumptions as researchers. It is important to make those assumptions as clear as possible and to communicate how they affect the results.

In the context of the GPI study, I chose to extrapolate the data from 2020 using the other years of data. I know this is going to lead to biased results, but in the context of this particular report the single estimate isn’t as important as the overall trend.

Another reason I took this approach was because of what GPI is trying to measure. Specifically with leisure time, we are assuming that leisure time during work days could be replaced with additional work for a wage, and that people are choosing to take that leisure time instead. 

In the context of the pandemic, a lot of people weren’t really choosing to use that time for leisure necessarily. This means that not only would we have to make some assumptions about the additional time people spent at home, but we would have to adjust the way we valued that time. 

In total, I chose to acknowledge that we don’t have data for 2020 and that those particular indicators are flawed for that year. The overall story of GPI vs GDP remains unchanged, and I minimized the additional assumptions I had to make. Hopefully as more research about the pandemic becomes available, there will be a more rigorous way to address this specific problem.