By NATE SILVER
Last week, Harry Enten, a blogger who studies statistics and elections, published a model that suggested that Republicans were likely to retain control of the House in 2012.
This is not, in and of itself, a remarkable conclusion. Incumbency is normally a fairly powerful advantage in Congressional elections. Republicans have a reasonably large majority in the House now, so the safe money is on their being able to keep it.
What was more daring, however, was the confidence that Mr. Enten was willing to assign to his projection. He stated that there was a 95 percent likelihood that Republicans would win between 228 and 248 seats (218 are needed for a majority). “Unless a historic event occurs,” Mr. Enten wrote, “Republicans will still be in control of the House of Representatives after the 2012 election.”
Just how historic an event are we talking about? If one takes his model literally, it implies that there is less than a 1-in-100,000 chance that the Democrats will regain a majority.
Mr. Enten’s model got picked up in several blogs that I read, and I’ve received other requests for comment about it. How can he be so confident when, among other things, we don’t know the identity of the Republican candidate for president or how strongly President Obama will perform against his opponent, the effects of redistricting, the state of the economy, the number of retirements in each party and a whole host of other things?
Actually, he shouldn’t be so confident. Mr. Enten normally does outstanding work and will probably have my job one day. But sometimes statistical models are as skin-deep as models on the runway, and this one is such an example.
The issue with this model, and some others like it, is what’s known in the statistical business as overfitting. This occurs when the number of variables is large relative to the sample size: in this case, the full version of Mr. Enten’s model contains six variables, but is used to explain only 15 cases (Congressional elections in presidential years since 1952).
A general rule of thumb is that you should have no more than one variable for every 10 or 15 cases in your data set. So a model to explain what happened in 15 elections should ideally contain no more than one or two inputs. By a strict interpretation, in fact, not only should a model like this one not contain more than one or two input variables, but the statistician should not even consider more than one or two variables as candidates for the model, since otherwise he can cherry-pick the ones that happen to fit the data the best (a related problem known as data dredging).
If you ignore these principles, you may wind up with a model that fits the noise in the data rather than the signal. Mr. Enten’s model, for instance, contains a variable for cases in which there has been an “unprovoked, hostile deployment of American armed forces in foreign conflict” that “has resulted in at least 1 fatality during the past term,” but it applies only when the war was started by a president of the same party as the one that currently occupies the White House, or when the current president is of a different party but has perpetuated the war for more than one term, conditional on the fact that the Congress and the White House are controlled by different parties.
Those of you who find this definition confusing have the right idea. It is so narrow that it applies to only two cases, 1976 and 2008, that the model would otherwise do a fairly poor job of explaining. While there is no doubt that wars can have some impact on voters’ assessment of the Congress, if one is willing to apply literally a half-dozen different qualifications to determine which wars he deems relevant, he could presumably come up with any number of other riffs on the definition that would apply to different years instead. (For example, if the elections of 1968 and 2004 were the ones the model had a rough time explaining, he could invent another version of the definition of “war” that would apply solely to those years.)
The problem with an overfit model is that, because it is so fussy about handling past cases, it tends to do a poor job of predicting future ones. Imagine that I was a petty criminal of some kind, and that I deputized you to come up with a way to help me pick combination locks. I also gave you three locks to experiment upon.
What I’d really be looking for would be some set of principles on how one picks locks: perhaps a certain type of paper clip works especially well, or a disproportionate number of combinations contain numbers like ‘7’ and ‘13’. Instead, after studying the issue for a few days, you report back to me that you’ve found the perfect solution. If the lock is blue, use the combination 45-12-26. If it’s red, use 33-9-16. And if it’s black, use 22-10-41. That would certainly be a very reliable way to pick these three particular locks, but it wouldn’t tell me anything about how to pick locks in general. This is essentially the same thing that happens when one produces an overfit statistical model.
Sometimes, the person creating the model will not discover this until it’s too late, especially if the variables in the model otherwise seem reasonable. In this case, however, it is fairly easy to demonstrate the model’s limitations.
Mr. Enten’s model is built on data from 1952 through 2008. He’s using it to forecast the election outcome in 2012; we don’t know yet how that prediction is going to turn out.
What we can do, however, is see how the model would have done on a case outside of its sample: 1948. There’s not an obvious reason to include 1952 in the data but not 1948; both are in the postwar period.
If we use the model to make a “retrodiction” for 1948, however, it fares very poorly:
The model would have predicted that Republicans, who then held the majority in the House, would have maintained it by winning 241 seats. Instead, they lost 75 seats to Harry Truman’s Democrats and wound up with just 171. That’s a 70-seat error, for a model that supposedly had a 95 percent confidence interval of plus or minus 10 seats. The odds against that, if the model had been specified correctly, were roughly one tredecillion (a 1 followed by 42 zeroes) to one.
Perhaps 1948 was special in some way. But if 1948 was special, 2012 could be, too, and the model could be just as inaccurate.
One alternative is to use a model with the same structure as Mr. Enten’s, but to include 1948 in the data that we use to build it. If we do that, the mean projection isn’t terribly different — Republicans are projected to win 230 seats rather than 238 — but the margin of error is much higher. Instead of having a 95 confidence interval of plus or minus 10 seats, we’d instead have one of plus or minus roughly 43 seats, which means that anywhere from 187 seats to 273 would be a reasonable guess. That model implies that Democrats have about a 38 percent chance of regaining control of the House, which is in line with what betting markets think.
There are some other problematic elements in Mr. Enten’s model. For instance, whether we define Libya as an “unprovoked, hostile deployment of American armed forces in foreign conflict” — I have no idea whether it qualifies — makes a difference of 23 seats in the answer the model comes up with.
The fundamental error, however, is in assuming that a model that is optimized to explain the past will do an optimal job of predicting the future. These are two very different things. When data sets are extremely large, the distinction may not be so important; in fact, people are probably too reluctant to indulge some complexity in their models in such cases. But small data sets, like one containing just 15 elections, are much less forgiving, and almost always require extremely simple specifications that are well-grounded in political theory.
It’s easy to forget about these distinctions when you’re trying to squeeze every last bit of juice out of a model (I’ve certainly made these mistakes myself on occasion). Several Congressional forecasting models, for instance, are based on using the Gallup generic ballot poll, and the Gallup generic ballot poll in particular. Why use the Gallup poll alone instead of an average of polls? We know that, over the long run, polling averages are almost certainly better than any one poll taken alone. But the Gallup poll had happened to be a little closer to the mark in the small sample of recent midterm elections, and that made it a popular among those who were trying to optimize the fit on past data.
Those models paid the price last year; whereas the polling average did a good job of predicting the size of the Republican wave, Gallup massively overestimated it by forecasting a 15-point Republican win, about double the actual margin. As a result, one popular model that relied on Gallup data and that billed itself as being able to predict these outcomes called for almost exactly a 77-seat Republican gain in the House, well above the actual total of 63.
This is also a problem to a greater or lesser degree with the many other models that are used to forecast Congressional and presidential elections based on fundamental variables like economic growth, wars, incumbency, the results of the previous election and so forth. Within the category of economic variables alone, there are at least five or six plausible metrics to consider (unemployment, G.D.P., personal income, inflation, consumer confidence), and each of these can be specified in any number of different ways (for instance, the absolute level of unemployment, or the change in unemployment as compared to some prior period).
When you have so many candidate variables and so few elections (just 16 presidential years since World War II), the potential for overfitting is enormous. And the consequences of overfitting are much greater than people realize. In Mr. Enten’s case, when we added just one more data point, his model’s error increased not by 10 or 20 percent, but by more than 400 percent.
That doesn’t mean that models like these are completely hopeless. For instance, a model to predict presidential elections called “Bread and Peace”, which uses personal income and war casualties as its only two inputs, seems to be sensibly designed. But when you come across a model that seems implausible or overconfident, you absolutely ought to be skeptical of it: there often isn’t much there when you look under the hood.