ICEWS is an early warning system designed to help US policy analysts predict a variety of international crises. This project was created at the Defense Advanced Research Projects Agency in 2007, but has since been funded (through 2013) by the Office of Naval Research. ICEWS has not been widely written about, in part because of its operational nature, and in part because articles about prediction in politics face special hurdles in the publication process. An academic article (gated) described the early phase of the project in 2010, including assessments of its accuracy, and a WIRED article in 2011 criticized ICEWS for missing the Arab Spring–at a time when the project was only focused on Asia.
In an article (here for now) forthcoming in the International Studies Review, as one of the original teams on the ICEWS project, we highlight the basic framework used in the more recent, worldwide version of ICEWS. Specifically, we discuss our model that is focused on forecasting, which is our main contribution to the larger, overall project. We call this CRISP. We argue that forecasting not only increases the dialogue between academia and the policy community, but that it also provides a gold standard for evaluating the empirical content of models. Thus, this gold standard improves not only the dialogue, but actually augments the science itself. In an earlier article in Foreign Policy, with Nils Metternich, we compared Billy Beane and Lewis Frye Richardson (sort of).
In this article (here for now), we estimate a fairly catholic model of conflict onset for 1999 and use these estimates to predict civil war onsets annually from 2000 to 2009. We show that this model gets a remarkable number of previous onsets correct, including Nigeria, but that it misses a number of others. Depending on the probability threshold used to classify onsets, as many as 24 of 40 onsets are predicted correctly. However, as the probability threshold goes lower (to get correct positives), it causes a lot of false positives to emerge, usually on an order of magnitude more than the correctly predicted events. Using a probability threshold of 0.5, the model predicts only two out of the 40 onsets, but has no false positives. When using 0.1 as the cutoff, 15 onsets are correctly predicted, but at the same time the model forecasts 245 onsets that did not happen. Once again, we see the tradeoff between crying wolf and fiddling while the flames consume you.
In terms of modeling, one thing we found was that structural variables are mainly useful in panel models as a means of garnering statistical significance, but they don’t prove to be very effective at forecasting onsets over time, in part because they rarely change very much in any given country. For example, the portion of a country’s area that is mountainous varies across countries (Bolivia vs. Paraguay), but really doesn’t change within a country over short periods of time. Event data about protests and conflict activities, on the other hand, are quite fluid within any given country. As such, they tend to be powerful in statistical models that are developed to map the ebb and flow of conflicts that are often quite volatile. Our models were estimated with a training data set, and then evaluated with new data in a test set. The tradeoff between true positives and false positives is evident here as well, although as a whole these model predictions which include rapidly changing independent variables are more accurate. For example, at a probability threshold of 0.5, 199 of 286 conflict onsets were correctly predicted, with only 33 false positives in the remaining 1781 cases.
Even with this greater predictive performance, prediction is hard. There are only four civil wars that began during the out-of-sample period and none of them have predicted probabilities that are especially high. Senegal shows a predicted probability of civil war in November of 2011 of about one in four; the civil war started there in December. Nigeria, Syria, and Libya, however, have predicted probabilities that are low, with the highest being one in ten for Syria in September 2010–the month before all hell broke loose.
Part of the problem lies in the rare nature of most conflict events. Unlike, say, classifying whether a voter will vote Democrat or Republican, where one could expect roughly balanced outcomes, the vast majority of our data are 0’s, points where no conflict occurs. As with many medical tests for rare diseases, this greatly amplifies the tradeoff between true and false positives, i.e. the number of conflict onsets that a model correctly predicts versus the number of false predictions. Another complication, related to model evaluation, is that the rare nature of conflict events can make it difficult to find test data that are comparable to the training data used to estimate a model. This won’t be news to conflict researchers, but it puts some of the criticisms of efforts like ICEWS into context.
Our broader experience with quantitative forecasting provides us with three important takeaways. First, because our current predictions leave a lot to be desired, we have a lot of work ahead of us if we want to have accurate predictions. Second, predictions are actually improving quite a bit and we’re much better off being transparent about them. Finally, pretending that our explanations don’t have to supply accurate predictions–i.e., we are explaining rather than predicting–leads to worse understanding. Rather than ignoring or hiding predictions we should put them front and center so that they may help us in the evaluation of how well our understandings play out in political events and remind us that our understandings are incomplete as well as uncertain.
Our assertion is that real understanding will involve both explanation and prediction. Time to get on with it rather than pretending that these two goals are polar opposites. We have a long way to go.