I recently read Keith et al’s excellent paper, Identifying civilians killed by police with distantly supervised entity-event extraction, which was presented this year at the conference on Empirical Methods on Natural Language Processing, or, as it’s more commonly known, EMNLP.
The authors present an initial framework for tackling an important real world question: how can you automatically extract from a news corpus the names of civilians killed by police officers? Their study focuses on the U.S. context, where there are no complete federal records of such killings.
Filling this gap, human rights organizations and journalists have attempted to compile such a list through the arduous – and emotionally draining – task of reading millions of news articles to identify victim names and event details.
Given the salience of this problem, a Keith et al set out to develop a more streamlined solution.
The event-extraction problem is furthermore an interesting NLP challenge in itself – there are non-trivial disambiguation problems as well as semantic variability around indicators of civilians killed by police. Common false positives in their automated approaches, for example, include officers killed in the line of duty and non-fatal encounters.
Their approach relies on distant supervision – using existing databases of civilians killed as mention-level labels. They implement this labeling with both “hard” and a “soft” assumption models. The hard labeling assumes that every mention of a person (name and location) from the gold-standard database corresponds to a mention of a police killing. This assumption proves to be too hard and an inaccurate model of the textual input.
The “soft” models perform better. Rather than assume that every relevant sentence corresponds to a mention of a police killing, soft models assume that at least one of the sentences do. That is, if you take all the sentence in the corpus which mention an individual known to have been killed by police, at least one of those sentences directly conveys information of the killing.
Intuitively, this makes sense – while the hard assumption takes every mention of Alton Sterling, Michael Brown, or Philandro Castile to occur in a sentence mentioning a police killing, we know from simply reading the news that some of those sentence will talk about their lives, their families, or the larger context in which their killing took place.
For both assumptions, Keith et al compare performance between a convolutional neural net and a linear regression model – ultimately finding that the regression, with the soft assumption, out performs the neural net.
There’s plenty of room for improvement and future work on their model, but overall, this paper presents a clever NLP application to a critical, real world problem. It’s a great example of the broad and important impact NLP approaches can have.