Category Archives: Computer Science

Interaction Dynamics and Persuasion Strategies

I recently read Chenhao Tan et al’s 2016 WWW paper Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions, which presents an interesting study of the linguistic features of persuasion.

Coming from a deliberative background, the word ‘persuasion’ has negative connotations. Indeed, Habermas and others strongly argue that deliberation must be free from persuasion – defined roughly as an act of power that causes an artificial opinion change.

In its more colloquial sense, however, persuasion needn’t be so negatively defined. Within the computer science literature on argument mining and detection, persuasion is generally more benignly considered as any catalyst causing opinion change. If I “persuade” you to take a different route because the road you were planning to take is closed, that persuasion is not problematic in the Habermasian sense as long as I’m not distorting the truth in order to persuade you.

Furthermore, Tan et al gather a very promising data set for this investigation – a corpus of “good faith online discussions” as the title says. Those discussions come from Reddit’s Change My Mind forum, a moderated platform with explicit and enforced norms for sharing reasoned arguments.

Each thread starts with a user who explicit states they want to have their opinion changed. That user then shares said opinion and outlines their reasoning behind the opinion. Other users then present arguments to the contrary. The original poster then has the opportunity to award a “delta” to a response if it succeeded in changing their opinion.

So there’s a lot to like about the structure of the dataset.

I have a lot of questions, though, about the kinds of opinion which are being shared and changed. Looking through the site today, posts cover a mix of serious political discussion, existential crises, and humorous conundrums.

The all time most highly rated post on the site begins with the opinion, “Strange women lying in ponds distributing swords is no basis for a system of government.” So it’s unclear just how much we can infer about debate more broadly from these users.

However, Tan et al, intentionally restrict their analysis to linguistic features, carefully comparing posts which ultimately win a “delta” to the most similar non-delta post responding to the same opinion. In this way, they aim to “de-emphasize what is being said in favor of how it is expressed.” 

There’s a lot we lose, of course, by not considering content, but this paper makes valuable contributions in disambiguating the effects of content from the effects of syntactic style.

Interestingly, they find that persuasive posts – those which earn a delta from the original poster – are more dissimilar for the originating post in content words, while being more similar in stop words (common words such as “a”, “the”, etc). The authors are careful not to make causal claims, but I can’t help but wonder what the causal mechanism behind that might be. The similarity of content words matched by the dissimilarity of stop words seems to imply that users are talking about different things, but in similar ways.

There’s a lot of debate, though, about exactly, what should count as a “stop word” – and whether stop word lists should be specially calibrated for the content. Furthermore, I’m not familiar with any deep theory on the use of stop words, so I’m not sure this content word/stop word disjunction really tells us much at all.

The authors also investigate usage of different word categories – finding, for example, that posts tend to begin and end with tangible arguments while become more abstract in the middle.

Finally, they investigate the features of users who award deltas – e.g., users who do change their mind. In this setting, they find that people who use more first person singular pronouns are more likely to change, while those using more first person plurals are less likely to change. They posit that the first person plural indicates a sort of diffuse sense of responsibility for a view, indicating that the person feels less ownership and is therefore less likely to change.

I’d love to see an extension of this work which dives into the content and examines, for example, what sorts of opinions people are likely to change – but this paper presents a thought-provoking look the persuasive effects of linguistic features themselves.

facebooktwittergoogle_plusredditlinkedintumblrmail

Text as Data Conference

At the end of last week, I had the pleasure of attending the eighth annual conference on New Directions in Analyzing Text as Data, hosted by Princeton University and organized by Will Lowe, John Londregan, Marc Ratkovic, and Brandon Stewart.

The conference had a truly excellent program, and was packed with great content on a wide variety of text analysis challenges.

There were a number of papers on topic modeling, including work from my colleague Ryan Gallagher on Anchored correlation explanation: Topic modeling with minimal domain knowledge. – a really cool, information-theory approach to topic modeling.

Luke Miratrix also presented joint work with Angela Fan, Finale Doshi-Velez on Prior matters: simple and general methods for evaluating and improving topic quality in topic modeling, an approach which aims to approve upon standard LDA by using priors to promote informative words.

I also really enjoyed Hanna Wallach’s presentation on A network model for dynamic textual communications with application to government email corpora, which introduces the Interaction-Partition Topic Model (IPTM), which combines elements of LDA with ERGMs.

There were also a number of talks reflecting and improving upon the ways in which we approach the methodological challenges of textual data.

Laura Nelson argued for a process of computational grounded theory, in which textual analysis helps guide and direct deep reading, but in which the researcher stays intimately familiar with her corpus.

Justin Grimmer presented the great paper, How to make causal inferences using texts, which presents a conceptual framework for making causal inference using text.

For my own research, Will Hobbs might get the prize for method I’d most like to use, with his paper on Latent dimensions of attitudes on the Affordable Care Act: An application of stabilized text scaling to open-ended survey responses. He presents a very clever method for scaling common and uncommon words in order to extract latent dimensions from short text. It’s really cool.

And, of course, Nick Beauchamp presented work done jointly with myself and Peter Levine on mapping conceptual networks. In this work, we present and validate a model for measuring the conceptual network an individual uses when reasoning. In these networks, nodes are concepts and edges represent the connections between those concepts More on this in future posts, I’m sure.

Finally the session titles were the absolute best. See, for example:

  • How Does This Open-Ended Question Make You Feel?
  • Fake Pews! (a session on religiosity)
  • America’s Next Top(ic) Model
  • Fwd: Fw: RE: You have to see this paper!

Well played, well played.

Many thanks to all the conference organizers for a truly engaging and informative couple of days.

facebooktwittergoogle_plusredditlinkedintumblrmail

Automated Methods for Identifying Civilians Killed by Police

I recently read Keith et al’s excellent paper, Identifying civilians killed by police with distantly supervised entity-event extraction, which was presented this year at the conference on Empirical Methods on Natural Language Processing, or, as it’s more commonly known, EMNLP.

The authors present an initial framework for tackling an important real world question: how can you automatically extract from a news corpus the names of civilians killed by police officers? Their study focuses on the U.S. context, where there are no complete federal records of such killings.

Filling this gap, human rights organizations and journalists have attempted to compile such a list through the arduous – and emotionally draining – task of reading millions of news articles to identify victim names and event details.

Given the salience of this problem, a Keith et al set out to develop a more streamlined solution.

The event-extraction problem is furthermore an interesting NLP challenge in itself – there are non-trivial disambiguation problems as well as semantic variability around indicators of civilians killed by police. Common false positives in their automated approaches, for example, include officers killed in the line of duty and non-fatal encounters.

Their approach relies on distant supervision – using existing databases of civilians killed as mention-level labels. They implement this labeling with both “hard” and a “soft” assumption models. The hard labeling assumes that every mention of a person (name and location) from the gold-standard database corresponds to a mention of a police killing. This assumption proves to be too hard and an inaccurate model of the textual input.

The “soft” models perform better. Rather than assume that every relevant sentence corresponds to a mention of a police killing, soft models assume that at least one of the sentences do. That is, if you take all the sentence in the corpus which mention an individual known to have been killed by police, at least one of those sentences directly conveys information of the killing.

Intuitively, this makes sense – while the hard assumption takes every mention of Alton Sterling, Michael Brown, or Philandro Castile to occur in a sentence mentioning a police killing, we know from simply reading the news that some of those sentence will talk about their lives, their families, or the larger context in which their killing took place.

For both assumptions, Keith et al compare performance between a convolutional neural net and a linear regression model – ultimately finding that the regression, with the soft assumption, out performs the neural net.

There’s plenty of room for improvement and future work on their model, but overall, this paper presents a clever NLP application to a critical, real world problem. It’s a great example of the broad and important impact NLP approaches can have.

facebooktwittergoogle_plusredditlinkedintumblrmail

A Living Language

Languages which are still being spoken are generally referred to as living languages. The metaphor is apt – languages are “living” not only insofar as its speakers are biologically living, but in that the language itself grows and changes throughout time. In a genuinely meaningful sense of the word, the language is alive.

This is a beautiful metaphor, but problematic for text analysis. It is, after all, difficult to model something which is changing while you observe it.

Language drift can be particularly problematic for digital humanities projects with corpora spanning a century or more. As Ben Schmidt has pointed out, topic models trained on such corpora produce topics which are not stable over time – e.g. a single topic represents different or drifting concept during different windows of time.

But the changes of a language are not restricted to such vast time scales. On social media and other online platforms, words and meanings come and go, sometimes quite rapidly. Indeed, there’s no a priori reason to think such rapid change isn’t a feature of all every day language – it is simply better documented through digital records.

This raises interesting questions and problems for scholars doing text analysis – at what time scales do you need to worry about language change? What does language change indicate for an individual or for a society?

One particularly interesting paper which tackles some of these questions is Danescu-Niculescu-Mizil et al’s No country for old members: User lifecycle and linguistic change in online communities.

Studying users of two online beer discussion forums, they find remarkably that users have a consistent life cycle – new users adopt the language of the community, getting closer and closer to linguistic norms. At a certain point, however, their similarity peaks – users cease changing with the community and move further and further linguistically as a result.

The language of the community continues changing, but the language of these “older” users does not.

This finding is reminiscent of earlier studies on organizational learning, such as those by James March – in which employees learn from an organization while the organization simultaneously learns from the employees. In his simulations, organizations in which people learn too quickly fail to converge on optimal information. Organizations in which people learn more slowly – or in which employees come and go – ultimately converge on better solutions.

Both these findings reflect the sociolinguistic theory of adult language stability – the idea that your learning, and specifically your language stays steady after a certain age. The findings from Danescu-Niculescu-Mizil, however, suggests something more interesting: your language becomes stable overtime in a given community. It’s not clear that your overall language will stabilize, rather, you learn the norms of a given community. Since these communities may change overtime, your overall language may still be quite dynamic.

facebooktwittergoogle_plusredditlinkedintumblrmail

Words and Topics

Reading articles skeptical of the veracity of topic model outputs has reminded me of this passage from Wittgenstein’s Philosophical Investigations:

Our language can be seen as an ancient city: a maze of little streets and squares, of old and new houses, and of houses with additions from various periods; and this surrounded by a multitude of new boroughs with straight regular streets and uniform houses.

In short: words are complicated. Their meaning and use shifts over time, building a complex infrastructure which can be difficult to interpret. Indeed, humanists can spend a whole career examining and arguing over the implications of words.

In theory, topic models can provide a solution to this complication: if a “topic” accurately represents a “concept,” then it dramatically reduces the dimensionality of a set of documents, eliciting the core concepts while moving beyond the complication of words.

Of course, topics are also complicated. As Ben Schmidt argues in Words Alone: Dismantling Topic Models in the Humanities, topics are even more complicated – words, at least, are complicated in an understood and accessible way. Topics models, on the other hand, are abstract and potentially inaccessible to people without the requisite technical knowledge.

To really understand a topic returned by a topic model, it is not enough to look at the top N words –  a common practice for evaluating and presenting topics – you need to look at the full distribution.

But what does it even look like to examine the distribution of words returned by a topic model? The question itself belies understanding.

While “words” are generally complicated, Schmidt finds a clever opportunity to examine a distribution of “words” using ships logs. Each text contains the voyage of a single ship and each “word” is given as a single longitude and latitude. The “words” returned by the topic model can then be plotted precisely in 2D space.

With these visualizations of topic distributions, Schmidt raises important questions about the assumptions of coherence and stability which topic models assume.

He doesn’t advocate entirely against topic models, but he does warn humanists to be weary. And, importantly, he puts forth a call for new methods to bring the words back to topic models – to find ways to visualize and understand entire distributions of words rather than simply truncating topics to lists of top words.

facebooktwittergoogle_plusredditlinkedintumblrmail

Gender and Language

Both gender and language are social constructs, and sociological research indicates a link between the two.

In Lakoff’s classic 1973 paper, Language and woman’s place, she argues that “the marginality and powerlessness of women is reflected in both the ways women are expected to speak, and the ways in which women are spoken of.” This socialization process achieves its end in two ways: teaching women the ‘proper’ way to speak while simultaneously marginalizing the voices of women who refuse to follow the linguistic norms dictated by society. As Lakoff writes:

So a girl is damned if she does, damned if she doesn’t. If she refuses to talk like a lady, she is ridiculed and subjected to criticism as unfeminine; if she does learn, she is ridiculed as unable to think clearly, unable to take part in a serious discussion: in some sense, as less than fully human. These two choices which a woman has – to be less than a woman or less than person – are highly painful.

Lakoff finds numerous lexical and syntactic differences between the speech of men and women. Women tend to use softer, more ‘polite’ language and are more like to hedge or otherwise express uncertainty with in their comments. While she acknowledges that – as of the early 70s – these distinctions have begun to blur, Lakoff also notes that the blurring comes almost entirely in the direction of “women speaking more like men.” Eg, language is still gendered, but has acceptable language grown in breadth for women, while ‘male’ language remains narrow and continues to be taken as the norm.

A more recent study by Sarawgi et al looks more closely at algorithmic approaches to identifying gender. They present a comparative study using both blog posts and scientific papers, examining techniques which learn syntactic structure (using a context-free grammar), lexis-syntatic patterns (using n-grams), and morphological patterns using character-level n-grams.

Sarawgi et al further argue that previous studies made the gender-identification task easier by neglecting to account for possible topic bias, and they therefore carefully curate a dataset of topic-balanced corpora. Additionally, their model allows for any gamma number of genders, but the authors reasonably restrict this initial analysis to the simpler binary classification task, selecting only authors who fit a woman/man gender dichotomy.

Lakoff’s work suggests that there will be lexical and syntactic differences by gender, but surprisingly, Sarawgi et al find that the character-level n-gram model outperformed the other approaches.

This, along with the fact that the finding holds in both formal and informal writing, seems to suggest that gender-socialized language may be more subtle and profound than previously thought. It is not just about word choice or sentence structure, it is more deeply about the very sounds and rhythm of speech.

The character n-gram approach used by Sarawgi is taken from an earlier paper by Peng et al which uses character n-grams for the more specific task of author attribution. They test their model on English, Greek, and Chinese corpora, achieving impressive accuracy on each. For the English corpus, they are able to correctly identify the author of text 98% of the time, using a 6-gram character model.

Peng et al make an interesting case for the value of character n-grams over word n-grams, writing:

The benefits of the character-level model in the context of author attribution are that it avoids the need for explicit word segmentation in the case of Asian languages, it captures important morphological properties of an author’s writing, it can still discover useful inter-word and inter-phrase features, and it greatly reduces the sparse data problems associated with large vocabulary models.

While I initially found it surprising that a character level n-gram approach would perform best at the task of gender classification, the Peng et al paper seems to shed computation light on this question – though the area is still under theorized. If character n-grams are able to so accurately identify the single author of a document, and that author has a gender, it seems reasonable that this approach would be able to infer the gender of an author.

Still, the effectiveness of character n-grams in identifying an author’s gender indicates an interesting depth to the gendered patterns of language.  Even as acceptable language for women converges to the acceptable language of men, the subtleties of style and usage remain almost subconsciously gendered – even in formal writing.

facebooktwittergoogle_plusredditlinkedintumblrmail

Robot Humor

Text processing algorithms are notoriously bad at processing humor. The subtle, contradictory humor of irony and sarcasm can be particularly hard to automatically detect.

If, for example, I wrote, “Sharknado 2 is my favorite movie,” an algorithm would most likely take that statement at face value. It would find the word “favorite” to be highly correlated with positive sentiment. Along with some simple parsing, it might then reasonably infer that I was making a positive statement about an entity of type “movie” named “Sharknado 2.”

Yet, if I were indeed to write “Sharknado 2 is my favorite movie,” you, a human reader, might think I meant the opposite. Perhaps I mean “Sharknado 2 is a terrible movie,” or, more generously, “Sharknado 2 is my favorite movie only insofar as it is so terrible that it’s enjoyably bad.”

This broader meaning is not indicated anywhere in the text, yet a human might infer it from the mere fact that…why would Sharknado 2 be my favorite movie?

There was nothing deeply humorous in that toy example, but perhaps you can see the root of the problem.

Definitionally, irony means expressing meaning “using language that normally signifies the opposite,” making it a linguistic maneuver which is fundamentally difficult to operationalize. A priori, how can you tell when I’m being serious and when I’m being ironic?

Humans are reasonably good at this task – though, suffering from resting snark voice myself, I do often feel the need to clarify when I’m not being ironic.

Algorithms, on the other hand, perform poorly on this task. They just can’t tell the difference.

This is an active area of natural language processing research, and progress is being made. Yet it seems a shame for computers to be missing out on so much humor.

I feel strongly that, should the robot uprising come, I’d like our new overlords to appreciate humor.

Something would be lost in a world without sarcasm.

facebooktwittergoogle_plusredditlinkedintumblrmail

Normalizing the Non-Standard

I recently read Eisenstein’s excellent, What to do about bad language on the internet, which explores the challenge of using Natural Language Processing on “bad” – e.g., non-standard – text.

I take Eisenstein’s use of the normative word “bad” here somewhat ironically. He argues that researchers dislike non-standard text because it complicates NLP analysis, but it is only “bad” in this narrow sense. Furthermore, while the effort required to analyze such text may be frustrating, efforts to normalize these texts are potentially worse.

It has been well documented that NLP approaches trained on formal texts, such as the Wall Street Journal, perform poorly when applied to less formal texts, such as Twitter data. Intuitively this makes sense: most people don’t write like the Wall Street Journal on Twitter.

Importantly, Eisenstein quickly does away with common explanations for the prevalence of poor language on Twitter. Citing Drouin and Davis (2009), he notes that there are no significant differences in the literacy rates of users who do or do not use non-standard language. Further studies also dispel notions of users being too lazy to type correctly, Twitter’s character limit forcing unnatural contractions, and phones auto-correcting going out of control.

In short, most users employ non-standard language because they want to. Their grammar and word choice intentionally convey meaning.

In normalizing this text, then, in moving it towards the unified standards on which NLP classifiers are trained, researchers explicitly discard important linguistic information. Importantly, this approach has implications for not only for research, but for language itself. As Eisenstein argues:

By developing software that works best for standard linguistic forms, we throw the weight of language technology behind those forms, and against variants that are preferred by disempowered groups. …It strips individuals of any agency in using language as a resource to create and shape their identity.

This concern is reminiscent of James C. Scott’s Seeing Like a State, which raises deep concerns about the power of a centralized, administrative state. In order to function effectively and efficiently, an administrative state needs to be able to standardize certain things – weights and measures, property norms, names, and language all have implications for taxation and distribution of resources. As Scott argues, this tendency towards standardization isn’t inherently bad, but it is deeply dangerous – especially when combined with things like a weak civil society and a powerful authoritarian state.

Scott argues that state imposition of a impose a single, official language is “one of the most powerful state simplifications,” which lays the groundwork for additional normalization. The state process of normalizing language, Scott writes, “should probably be viewed, as Eugen Weber suggests in the case of France, as one of domestic colo­nization in which various foreign provinces (such as Brittany and Occitanie) are linguistically subdued and culturally incorporated. …The implicit logic of the move was to define a hierarchy of cultures, relegating local languages and their regional cultures to, at best, a quaint provincialism.”

This is a bold claim, yet not entirely unfounded.

While there is further work to be done in this area, there is good reason to think that the “normalization” of language disproportionally effects people who are outside the norm along other social dimensions. These marginalized communities – marginalized, incidentally, because they fall outside whatever is defined as the norm – develop their own linguistic styles. Those linguistic styles are then in turn disparaged and even erased for following outside the norm.

Perhaps one of the most well documented examples of this is Su Lin Bloggett and Brendan O’Connor’s study on Racial Disparity in Natural Language Processing. As Eisenstein points out, it is trivially impossible for Twitter to represent a coherent linguist domain – users around the globe user Twitter in numerous languages.

The implicit pre-processing step, then, before even normalizing “bad” text to be in line with dominant norms, is to restrict analysis to English-language text. Bloggett and O’Connor find that  tweets from African-American users are over-represented among the Tweets that thrown out for being non-English.

Dealing with non-standard text is not easy. Dealing with a living language that can morph in a matter of days or even hours (#covfefe) is not easy. There’s no getting around the fact that researchers will have to make difficult calls in how to process this information and how to appropriately manage dimensionality reduction.

But the worst thing we can do is to pretend that it is not a matter of concern; to begin our work by thoughtlessly filtering and normalizing without giving significant thought to what we’re discarding and what that discarded data represents.

facebooktwittergoogle_plusredditlinkedintumblrmail

Social and Algorithmic Bias

A commonly lamented problem in machine learning is that algorithms are biased. This bias can come from different sources and be expressed in different ways, sometimes benignly and sometimes dramatically.

I don’t disagree that there is bias in these algorithms, but I’m inclined to argue that in some senses, this is a feature rather than a bug. That is: all methodical choices are biased, all data are biased, and all models are wrong, strictly speaking. The problem of bias in research is not new, and the current wave of despair is simply a reframing of this problem with automated approaches as the culprit.

To be clear, there are serious cases in which algorithmic biases have led to deeply problematic outcomes. For example, when a proprietary, black box algorithm regularly suggests stricter sentencing for black defendants and those suggestions are taken to be unbiased, informed wisdom – that is not something to be taken lightly.

But what I appreciate about the bias of algorithmic methods is the visibility of their bias; that is – it gives us a starting point for questioning, and hopefully addressing, the inherent social biases. Biases that we might otherwise be blind to, given our own personal embedding in the social context.

After all, strictly speaking, an algorithm isn’t biased; its human users are. Humans choose what information becomes recorded data and they choose which data to feed into an algorithm. Fundamentally, humans – both specific researchers and through the broader social context – chose what counts as information.

As urban planner Bent Flyvbjerg writes: Power is knowledge. Those with power not only hold the potential for censorship, but they play a critical role in determining what counts as knowledge. In his ethnographic work in rural appalachia, John Gaventa similarly argues that a society’s power dynamics become so deeply entrenched that the people embedded in that society no longer recognize these power dynamics at all. They take for granted a shared version of fact and reality which is far from the unbiased Truth we might hope for – rather it is a reality shaped by the role of power itself.

In some ways, algorithmic methods may exacerbate this problem – as algorithmic bias is applied to documents resulting from social bias – but a skepticism of automated approaches opens the door to deeper conversations about biases of all forms.

Ted Underwood argues that computational algorithms need to be fundamentally understood as tools of philosophical discourse, as “a way of reasoning.” These algorithms, even something as seemingly benign as rank-ordered search results – deeply shape what information is available and how it is perceived.

I’m inclined to agree with Underwood’s sentiment, but to expand his argument broadly to a diverse set of research methods. Good scientists question their own biases and they question the biases in their methods – whether those methods are computational or not. All methods have bias. All data are biased.

Automated methods, with their black-box aesthetic and hopefully well-documented Git pages,  may make it easier to do bad science, but for good scientists, they convincingly raise the specter of bias, implicit and explicit, in methods and data.

And those are concerns all researchers should be thinking about.

 

facebooktwittergoogle_plusredditlinkedintumblrmail

Bag of Words

A common technique in natural language processing involves treating a text as a bag of words. That is, rather than restrict analysis to preserving the order in which words appear, these automated approaches begin by simply examining words and word frequencies. In this sense, the document is reduced from a well-ordered, structured object to a metaphorical bag of words from which order has been discarded.

Numerous studies have found the bag of words approach to be sufficient for most tasks, yet this finding is somewhat surprising – even shocking, as Grimmer and Stewart note – given the reduction of information represented by this act.

Other pre-processing steps for dimensionality reduction seem intuitively less dramatic. Removing stop words like “the” and “a” seems a reasonable way of focusing on core content words without getting bogged down in the details of grammar. Lemmatization, which assigns words to a base family also makes sense – assuming it’s done correctly. Most of the time, it doesn’t matter much whether I say “community” or “communities.”

But reducing a text – which presumably has been well-written and carefully follows the rules of it’s language’s grammar seems surprisingly profane. Do you lose so little when taking Shakespeare or Homer as a bag of words? Even the suggestion implies a disservice to the poetry of language. Word order is important.

Why, then, is a bag of words approach sufficient for so many tasks?

One possible explanation is that computers and humans process information differently. For a human reading or hearing a sentence, word order helps them predict what is to come next. It helps them process and make sense of what they are hearing as they are hearing it. To make sense of this complex input, human brains need this structure.

Computers may have other shortcomings, but they don’t feel the anxious need to understand input and context as it is received. Perhaps bag of words works because – while word order is crucial for the human brain – it provides unnecessary detail for the processing style of a machine.

I suspect there is truth in that explanation, but I find it unsatisfactory. It implies that poetry and beauty are relevant to the human mind alone – that these are artifacts of processing rather than inherent features of a text.

I prefer to take a different approach: the fact that bag of words models work actually emphasizes the flexibility and beauty of language. It highlights the deep meaning embedded in the words themselves and illustrates just how much we communicate when we communicate.

Linguistic philosophers often marvel that we can manage to communicate at all – the words we exchange may not mean the same thing to me as they do to you. In fact, they almost certainly do not.

In this sense, language is an architectural wonder; a true feat of achievement. We convey so much with subtle meanings of word choice, order, and grammatical flourishes. And somehow through the cacophony of this great symphony – which we all experience uniquely – we manage to schedule meetings, build relationships, and think critically together.

Much is lost in translating the linguistic signal between me and you. We miss each other’s context and reinterpret the subtle flavors of each word. We can hear a whole lecture without truly understanding, even if we try.

And that, I think, is why the bag of words approach works. Linguistic signals are rich, they are fiercely high-dimensional and full of more information than any person can process.

Do we lose something when we reduce dimensionality? When we discard word order and treat a text as a bag of words?

Of course.

But that isn’t an indication of the gaudiness of language; rather it is a tribute to it’s profound persistence.

facebooktwittergoogle_plusredditlinkedintumblrmail