A common technique in natural language processing involves treating a text as a bag of words. That is, rather than restrict analysis to preserving the order in which words appear, these automated approaches begin by simply examining words and word frequencies. In this sense, the document is reduced from a well-ordered, structured object to a metaphorical bag of words from which order has been discarded.
Numerous studies have found the bag of words approach to be sufficient for most tasks, yet this finding is somewhat surprising – even shocking, as Grimmer and Stewart note – given the reduction of information represented by this act.
Other pre-processing steps for dimensionality reduction seem intuitively less dramatic. Removing stop words like “the” and “a” seems a reasonable way of focusing on core content words without getting bogged down in the details of grammar. Lemmatization, which assigns words to a base family also makes sense – assuming it’s done correctly. Most of the time, it doesn’t matter much whether I say “community” or “communities.”
But reducing a text – which presumably has been well-written and carefully follows the rules of it’s language’s grammar seems surprisingly profane. Do you lose so little when taking Shakespeare or Homer as a bag of words? Even the suggestion implies a disservice to the poetry of language. Word order is important.
Why, then, is a bag of words approach sufficient for so many tasks?
One possible explanation is that computers and humans process information differently. For a human reading or hearing a sentence, word order helps them predict what is to come next. It helps them process and make sense of what they are hearing as they are hearing it. To make sense of this complex input, human brains need this structure.
Computers may have other shortcomings, but they don’t feel the anxious need to understand input and context as it is received. Perhaps bag of words works because – while word order is crucial for the human brain – it provides unnecessary detail for the processing style of a machine.
I suspect there is truth in that explanation, but I find it unsatisfactory. It implies that poetry and beauty are relevant to the human mind alone – that these are artifacts of processing rather than inherent features of a text.
I prefer to take a different approach: the fact that bag of words models work actually emphasizes the flexibility and beauty of language. It highlights the deep meaning embedded in the words themselves and illustrates just how much we communicate when we communicate.
Linguistic philosophers often marvel that we can manage to communicate at all – the words we exchange may not mean the same thing to me as they do to you. In fact, they almost certainly do not.
In this sense, language is an architectural wonder; a true feat of achievement. We convey so much with subtle meanings of word choice, order, and grammatical flourishes. And somehow through the cacophony of this great symphony – which we all experience uniquely – we manage to schedule meetings, build relationships, and think critically together.
Much is lost in translating the linguistic signal between me and you. We miss each other’s context and reinterpret the subtle flavors of each word. We can hear a whole lecture without truly understanding, even if we try.
And that, I think, is why the bag of words approach works. Linguistic signals are rich, they are fiercely high-dimensional and full of more information than any person can process.
Do we lose something when we reduce dimensionality? When we discard word order and treat a text as a bag of words?
But that isn’t an indication of the gaudiness of language; rather it is a tribute to it’s profound persistence.