Before the break, I had the opportunity to hear Brendan O’Connor talk about his recent paper with Su Lin Blodgett and Lisa Green: Demographic Dialectal Variation in Social Media: A Case Study of African-American English.
Imagine an algorithm designed to classify sentences. Perhaps it identifies the topic of the sentence or perhaps it classifies the sentiment of the sentence. These algorithms can be really accurate – but they are only as good as the corpus they are trained on.
If you train an algorithm on the New York Times and then try to classify tweets, for example, you may not have the kind of success you might like – the language and writing style of the Times and a typical tweet being so different.
There’s a lot of interesting stuff in the Blodgett et al. paper, but perhaps most notable to me is their comparison of the quality of existing language identification tools on tweets by race. They find that these tools perform poorly on text associated with African Americans while performing better on text associated with white speakers.
In other words, if you got a big set of Twitter data and filtered out the non-English tweets, that algorithm would disproportionally identify tweets from black authors as not being in English, and those tweets would then be removed from the dataset.
Such an algorithm, trained on white language, has the unintentional effect of literally removing voices of color.
Their paper presents a classifier to eliminate that disparity, but the study is an eye-opening finding – a cautionary tail for anyone undertaking language analysis. If you’re not thoughtful and careful in your approach, even the most validated classifier may bias your data sample.