Understanding uncertainty
Predicting the US election result wasn’t a Big Data problem. There had only ever been 57 presidential elections and there’s good polling data for less than half of them. What it shares with a lot of Big Data problems is the difficulty of making sure you have thought about all the uncertainty, in particular when there’s a lot less information than it looks like there is, and the quality of that information is fairly low.
In particular, it’s a lot easier to get an accurate prediction of the mean opinion-poll result and a good estimate of its uncertainty than it is to translate that into uncertainty over number of states won. It’s not hard to find out what your model thinks the uncertainty is; that’s just a matter of running the model over and over again in simulation. But simulation won’t tell you what sources of uncertainty you’ve left out.
For the US elections it turns out one thing that matters is the amount of correlation between states in the polling errors. Since there are 1225 correlations and maybe twenty elections worth of good polling data, the correlations aren’t going to be empirically determinable even if you assume there’s nothing special about this election– you need to make assumptions about how the variables you have relate to the ones you’re trying to predict.
The predictions from 538 still might not have been based on correct assumptions, but they were good enough for their conclusions to be basically right — and no-one else’s were, apparently even including the Trump campaign.
It’s not that we should give up on modelling. As we saw last time, sitting around listening to experts pull numbers out of the air works rather worse. But it’s important to understand the uncertainty in predictions can be a lot more than you’d get by asking the model — and the same is true, only much worse, when you’re modelling the effects of social or health interventions rather than just forecasting.
Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »
What hopefully been found wanting were the ‘forecasts’ which were supposed to predict the outcome- which they only did when the votes were half counted.
8 years ago