Today I had the opportunity to attend a great talk by Jürgen Pfeffer, Assistant Research Professor at Carnegie Mellon’s School of Computer Science. Pfeffer talked broadly about the methodological challenges of big data, social science research.
Increasingly, he argued, social scientists are reliant on data collected and curated by third party – often private – sources. As researchers, we are less intimately connected with our data, less aware of the biases that went into its collection and cleaning. Rather, in the era of social media and big data, we turn on some magical data source and watch the data flow in.
Take, for instance, Twitter – a platform whose prevalence and open API make it a popular source for scraping big datasets.
In a 2013 paper with Fred Morstatter, Huan Liu, and Kathleen M. Carley, Pfeffer assessed the representativeness of Twitter’s streaming API. As the authors explain:
The “Twitter Streaming API” is a capability provided by Twitter that allows anyone to retrieve at most a 1% sample of all the data by providing some parameters…The methods that Twitter employs to sample this data is currently unknown.
Using Twitter’s “Firehose” – an expensive service that that allows for 100% access – the researchers compared the data provided by Twitter’s API to representative samples collected from the Firehose.
In news disturbing for computational social scientists everywhere, they found that “the Streaming API performs worse than randomly sampled data…in that case of top hashtag analysis, the Streaming API sometimes reveals negative correlation in the top hashtags, while the randomly sampled data exhibits very high positive correlation with the Firehose data.”
In one particular telling example, the team compared the raw counts from both the API and the Firehose of tweets about “Syria”. The API data shows high initial interest, tapering off around Christmas and seemingly starting to pick up again mid-January. You may be prepared to draw conclusions for this data: people are busy over the holidays, they are not on Twitter or not attentive to international issues at this time. It seems reasonable that there might be a lull.
But the firehouse data tell a different story: the API initially provides a good sample of the full dataset, but then as the API shows declining mentions, the Firehose shows a dramatic rise in mentions.
Rather than indicating a change in user activity, the decline in the streaming data is most likely do to a change in Twitter’s sampling methods. But since neither the methods nor announcements of changes to the methods are publicly available, it’s impossible for a researcher to properly know.
While these results are disconcerting, Pfeffer was quick to point out that all is not lost. Bias in research methods is an old problem; indeed, bias is inherent to the social science process. The real goal isn’t to eradicate all bias, but rather to be aware of its existence and influence. To, as his talk was titled, know your data and know your methods.