Should I use a linear probability model or logistic regression?

Over the last month, I’ve been working on updating the Ohio Poverty Measure. The Ohio Poverty Measure is a poverty measurement tool calculated by Scioto Analysis and based on similar measures in California, New York City, Wisconsin, and other states. 

One of the biggest hurdles to overcome on this project is imputing information about specific additions to income not available in the American Community Survey, the main dataset used for calculating the Ohio Poverty Measure. Specifically, to calculate the Ohio Poverty Measure, we need to impute data for which poverty units receive housing subsidies and free school lunches. 

To work around this, we use answers from the Current Population Survey to impute recipiency of housing and school lunch benefits to families who respond to the American Community Survey.

The Current Population Survey asks a larger number of questions to a smaller subset of the population compared to the American Community Survey. Our goal is to use Current Population Survey data to build a model that predicts the probability a poverty unit receives one of these benefits, then use that model to determine which poverty units in the ACS data get the benefits and estimate the size of those benefits.. 

This approach will not get the answer exactly correct for each individual family, but assuming the Current Population Survey population is similar to the American Community Survey population, this approach should give us a useful approximation of these benefits. 

So, let’s talk about how to best build this type of model. 

The simplest approach would be to define the outcome of benefit recipiency as a numeric variable, (e.g. one for people who receive the benefit and zero for those who don’t) and use a regular linear regression approach to estimate recipiency. With binary outcome data, we call this a linear probability model

Unfortunately, linear probability models have two main drawbacks. First, linear regressions assume continuous outcome variables. This means that we can make predictions that go below zero or above one. Since our outcomes are supposed to represent probabilities, this is undesirable. There is no such thing, after all, as a -25% or 125% probability of housing subsidy receipt.

Second, linear probability models, as the name suggests,  linearly increase or decrease in one direction or the other. This is closely related to the first problem, since as we approach extremely likely or unlikely outcomes we actually don’t expect to see linear changes in the probability. It makes much more sense that our model should asymptotically approach one or zero in those cases.

The solution to this problem is to use a generalized linear model. For binary outcomes, the logistic or probit regression models are the most common choices. These models bound our outcome nonlinearly between zero and one, with our predictions asymptotically approaching those values. 

Because in our data, we are dealing with some extreme probabilities (people with high income should be ineligible for these benefits and therefore have probability zero), the linear probability model is a poor choice for estimating recipiency. Linear probability models perform best when looking at situations where the outcome is almost always close to 50/50. Near the middle, all of these models are fairly close. It’s as you get further into extreme probabilities that the shortcomings of the linear probability model really begin to show themselves.

For this project, I chose to use a logistic regression model. Still, this only allows us to say what the probability of a poverty unit receiving some benefit might be, we still need to figure out who receives benefits in our ACS data. 

The solution to this is quite simple. First, we look at the CPS data and see what percentage of those respondents receive this benefit. Because the CPS is a random sample (after weighting it), we can assume that this is the proportion of people that actually receive these benefits. Next, we rank every person in the ACS data by their probability of receiving these benefits. Finally, we give benefits to people in the ACS data with the highest probabilities until the same percentage of people receive benefits as in the CPS data.

By using a logistic regression instead of a linear probability model, we more accurately determine the probability of receiving benefits for people at the extremes of our survey. Because we are looking specifically at people near the ends of our predictions, it’s important that our model functions correctly in those places. For most binary outcome data, the choice is simple.