Reading articles skeptical of the veracity of topic model outputs has reminded me of this passage from Wittgenstein’s Philosophical Investigations:
Our language can be seen as an ancient city: a maze of little streets and squares, of old and new houses, and of houses with additions from various periods; and this surrounded by a multitude of new boroughs with straight regular streets and uniform houses.
In short: words are complicated. Their meaning and use shifts over time, building a complex infrastructure which can be difficult to interpret. Indeed, humanists can spend a whole career examining and arguing over the implications of words.
In theory, topic models can provide a solution to this complication: if a “topic” accurately represents a “concept,” then it dramatically reduces the dimensionality of a set of documents, eliciting the core concepts while moving beyond the complication of words.
Of course, topics are also complicated. As Ben Schmidt argues in Words Alone: Dismantling Topic Models in the Humanities, topics are even more complicated – words, at least, are complicated in an understood and accessible way. Topics models, on the other hand, are abstract and potentially inaccessible to people without the requisite technical knowledge.
To really understand a topic returned by a topic model, it is not enough to look at the top N words – a common practice for evaluating and presenting topics – you need to look at the full distribution.
But what does it even look like to examine the distribution of words returned by a topic model? The question itself belies understanding.
While “words” are generally complicated, Schmidt finds a clever opportunity to examine a distribution of “words” using ships logs. Each text contains the voyage of a single ship and each “word” is given as a single longitude and latitude. The “words” returned by the topic model can then be plotted precisely in 2D space.
With these visualizations of topic distributions, Schmidt raises important questions about the assumptions of coherence and stability which topic models assume.
He doesn’t advocate entirely against topic models, but he does warn humanists to be weary. And, importantly, he puts forth a call for new methods to bring the words back to topic models – to find ways to visualize and understand entire distributions of words rather than simply truncating topics to lists of top words.