GDELT and ICEWS, a short comparison

GDELT (gdelt.utdallas.edu) is a global database of events which have been coded from vast quantities of publicly available text that is produced by the world’s new media. It has created a great deal of excitement in the social science community, especially within the field of international relations. But it has had wider visibility as well: in August 2013, there were 150,000 views of a map of protest activity around the world, based on the GDELT database.  Event data have been around for several decades, but the GDELT project has generated new interest.

ICEWS is an early warning system designed to help US policy analysts predict a variety of international crises to which the US might have to respond. These include international and domestic crises, ethnic and religious violence, as well as rebellion and insurgency. This project was created at the Defense Advanced Research Projects  Agency, but has since been funded (through 2013) by the Office of Naval Research. ICEWS also produces  a  rich corpus of text which is analyzed with powerful techniques  of automated event-data production.  Since GDELT and ICEWS are based on similar, though not identical methods and sources, it is interesting to compare them.

ICEWS data

ICEWS event data, gray line for stories and black line for events, 2001-2013

One area in which they are most conceptually different is that ICEWS follows a more traditional approach to event data in seeking to encode a chronology of events that reflects in some sense  the putative ground truth of what occurred. The figure on the right shows the corpus of stories in ICEWS (gray) and the resulting events (black): total events are fairly stable over time event though the number of media stories increases. GDELT is more concerned with getting a comprehensive catalogue of all media stories (and other text) on reported events, and the corpus of those media stories is increasing exponentially, as the figure below shows. As a result, the number of events in GDELT is also increasing over time, much more so than ICEWS.

Aside from this major difference, both GDELT and ICEWS use ontologies that are based on Phil Schrodt’s CAMEO framework. Both also use different Natural Language Processing  techniques.  The ICEWS approach has been validated by human coders, and is about 75% accurate in identifying events that trained human coders view as correct.  ICEWS goes back to 2001 (at present) and GDELT goes back to 1979 (at present).  One clear difference is that the GDELT data grow exponentially, but the ICEWS data are relatively stable in terms of the number of events.

GDELTICEWS_img_1

GDELT shows exponential growth over time

We looked at protest events in Egypt and Turkey in 2011 and 2012 for both data sets, and we also looked at fighting in Syria over the same period. A dynamic visualization can be found at http://mdwardlab.com/gdelt-and-icews. What did we learn from these, limited comparisons?  First, we found out first hand what the GDELT community has been saying: the GDELT data are in BETA and currently have a lot of false positives. This is not optimal for a decision making aid such as ICEWS, in which drill-down to the specific events resulting in new predictions is a requirement. Second, no one has a good ground truth for event data — though we have some ideas on this and are working on a study to implement them. Third, geolocation is a boon. GDELT seems especially good a this, even with a lot of false positives.

The bottom line is that GDELT over-states the number of events by a substantial margin, but ICEWS misses some events as well.  Like many decision-making problems the choice is between willingness to be wrong and desire to be right.

Our manuscript describing these comparisons is available at http://mdwardlab.com/biblio/comparing-gdelt-and-icews-event-data and the visualization at http://mdwardlab.com/gdelt-and-icews/index.html. It was authored by Michael D. Ward, Andreas Beger, Josh Cutler, Matthew Dickenson, Cassy Dorff, and Ben Radford.  This post was written by Michael D. Ward, who recently was reassigned from the control group to the blogging group.

Related articles

7 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 28 other followers

%d bloggers like this: