This is a guest post by Simon Munzert, PhD student at the University of Konstanz, who is currently on a visit at the Lab.
It’s not that the people here at Duke’s Department of Political Science—and the WardLab members in particular—risk to run out of hot data in the near future. As somebody who is primarily concerned with research on public opinion and election forecasting, I was stunned in view of the masses of high quality event data and its potential for so many applications. Still, during my short stay at the Lab as a visiting scholar I had the opportunity to give a little introduction to various web scraping techniques using R.
Why web scraping? We have observed that the rapid growth of the World Wide Web over the past two decades tremendously changed the way we share, collect and publish data. Firms, public institutions and private users provide every imaginable type of information and new channels of communication generate vast amounts of data on human behavior. As many data on the Web are products of social interaction, they are of immediate interest for us as social scientists. Over the past years research on computer-based methods for classification and analysis of existing large amounts of data is booming across all disciplines, and political scientists contribute heavily to this process.
However, vast amounts of unstructured data, probably spread over hundreds of webpages, can pose a challenge, as it is very time-consuming and error prone to collect such data by hand. If you have identified online data as an appropriate resource for our project, it might be a good decision to automate the data collection and tidying procedure, especially if we plan to update databases regularly, if the collection task is non-trivial in terms of scope and complexity, and if we want others to be able to replicate our data collection process. We are used to automate the data preparation and analysis part of quantitative analyses with statistical software like Stata or R, but automation is also possible for the data collection part if the data of interest are available on the Web. In fact, over the past few years many people in the R community have worked hard to make R a very flexible tool to collect web data and communicate with web services. For us as substantively oriented researchers, this is extraordinarily valuable because R has become one of the most popular software packages in the profession, so it’s good news that R provides these capabilities as well.
So what are the main techniques to scrape data from the Web? Essentially, classical web scraping from HTML pages with R is a six-step process (see also the figure on the right). First, we have to identify the desired information and locate it within an HTML document. HTML source code is highly hierarchical, but interesting data are often layouted as news headlines, tables or lists, and it is possible to address the markup which is used to structure and layout content for display in a browser. Second, we have to download the documents which contain the information of interest. This can be one HTML page or hundreds of them. R is capable of doing this for us in an automated manner. Next, we import the HTML code into R using HTML parsing software which is implemented in R and store the bunch of documents in a list or some other useful data structure. In order to extract the information—certain lines of text, numbers stored in tables, etc.—from the documents, we exploit the markup nature of HTML and the fact that it is essentially XML-based and construct XPath queries. XPath is a little language of its own which allows assessing structured information from XML-style documents, asking R, e.g., to “return all bold formatted headlines which are in the politics section of this online newspaper frontpage.” Finally, we have to merge and tidy all extracted raw data and debug our code to keep it robust in future scraping tasks.
In practice, there are some pitfalls in more sophisticated scenarios (e.g., when we are interested in data from dynamic webpages), but essentially there is nothing that cannot be done with R in terms of tapping web resources. At Duke Michael Ward and Kyle Beardsley gave me the opportunity to demonstrate in two sessions what’s possible with R in their methods workshop. This is hardly enough time to become an expert in web scraping, so I decided to give a broad overview of the technologies and tools which are currently available. I demonstrated the basics of the following topics:
- How to scrape static content from HTML code using
- the fabulous XML package to parse XML and HTML code,
- the XPath query language to extract well-chosen content from the document tree
- SelectorGadget and the Browser Inspector tools as convenient instruments for constructing XPath queries, and
- regular expressions to scrape data from a website as if it were merely text.
- How to scrape dynamic content from AJAX-enriched webpages with Selenium WebDriver and the RSelenium package.
- How to tap REST web services and access the popular Twitter APIs using the twitteR and the streamR package.
If you want to learn more about how to scrape web data with R, there is a book coming out in January 2015 in which colleagues of mine and me compiled useful fundamentals and many practical web scraping and text mining applications. During the past few years the market for books on data science/mining has emerged quickly. What is surprisingly often missing in these introductions though is how data for data science applications are actually acquired. In this sense our book serves as a preparatory step for data analyses but also provides guidance on how to manage available information and keep it up to date. We planned to make it accessible to autodidacts with no or only little knowledge about web technologies. For more information on the book click here.