I recently read Eisenstein’s excellent, What to do about bad language on the internet, which explores the challenge of using Natural Language Processing on “bad” – e.g., non-standard – text.
I take Eisenstein’s use of the normative word “bad” here somewhat ironically. He argues that researchers dislike non-standard text because it complicates NLP analysis, but it is only “bad” in this narrow sense. Furthermore, while the effort required to analyze such text may be frustrating, efforts to normalize these texts are potentially worse.
It has been well documented that NLP approaches trained on formal texts, such as the Wall Street Journal, perform poorly when applied to less formal texts, such as Twitter data. Intuitively this makes sense: most people don’t write like the Wall Street Journal on Twitter.
Importantly, Eisenstein quickly does away with common explanations for the prevalence of poor language on Twitter. Citing Drouin and Davis (2009), he notes that there are no significant differences in the literacy rates of users who do or do not use non-standard language. Further studies also dispel notions of users being too lazy to type correctly, Twitter’s character limit forcing unnatural contractions, and phones auto-correcting going out of control.
In short, most users employ non-standard language because they want to. Their grammar and word choice intentionally convey meaning.
In normalizing this text, then, in moving it towards the unified standards on which NLP classifiers are trained, researchers explicitly discard important linguistic information. Importantly, this approach has implications for not only for research, but for language itself. As Eisenstein argues:
By developing software that works best for standard linguistic forms, we throw the weight of language technology behind those forms, and against variants that are preferred by disempowered groups. …It strips individuals of any agency in using language as a resource to create and shape their identity.
This concern is reminiscent of James C. Scott’s Seeing Like a State, which raises deep concerns about the power of a centralized, administrative state. In order to function effectively and efficiently, an administrative state needs to be able to standardize certain things – weights and measures, property norms, names, and language all have implications for taxation and distribution of resources. As Scott argues, this tendency towards standardization isn’t inherently bad, but it is deeply dangerous – especially when combined with things like a weak civil society and a powerful authoritarian state.
Scott argues that state imposition of a impose a single, official language is “one of the most powerful state simplifications,” which lays the groundwork for additional normalization. The state process of normalizing language, Scott writes, “should probably be viewed, as Eugen Weber suggests in the case of France, as one of domestic colonization in which various foreign provinces (such as Brittany and Occitanie) are linguistically subdued and culturally incorporated. …The implicit logic of the move was to define a hierarchy of cultures, relegating local languages and their regional cultures to, at best, a quaint provincialism.”
This is a bold claim, yet not entirely unfounded.
While there is further work to be done in this area, there is good reason to think that the “normalization” of language disproportionally effects people who are outside the norm along other social dimensions. These marginalized communities – marginalized, incidentally, because they fall outside whatever is defined as the norm – develop their own linguistic styles. Those linguistic styles are then in turn disparaged and even erased for following outside the norm.
Perhaps one of the most well documented examples of this is Su Lin Bloggett and Brendan O’Connor’s study on Racial Disparity in Natural Language Processing. As Eisenstein points out, it is trivially impossible for Twitter to represent a coherent linguist domain – users around the globe user Twitter in numerous languages.
The implicit pre-processing step, then, before even normalizing “bad” text to be in line with dominant norms, is to restrict analysis to English-language text. Bloggett and O’Connor find that tweets from African-American users are over-represented among the Tweets that thrown out for being non-English.
Dealing with non-standard text is not easy. Dealing with a living language that can morph in a matter of days or even hours (#covfefe) is not easy. There’s no getting around the fact that researchers will have to make difficult calls in how to process this information and how to appropriately manage dimensionality reduction.
But the worst thing we can do is to pretend that it is not a matter of concern; to begin our work by thoughtlessly filtering and normalizing without giving significant thought to what we’re discarding and what that discarded data represents.