Both gender and language are social constructs, and sociological research indicates a link between the two.
In Lakoff’s classic 1973 paper, Language and woman’s place, she argues that “the marginality and powerlessness of women is reflected in both the ways women are expected to speak, and the ways in which women are spoken of.” This socialization process achieves its end in two ways: teaching women the ‘proper’ way to speak while simultaneously marginalizing the voices of women who refuse to follow the linguistic norms dictated by society. As Lakoff writes:
So a girl is damned if she does, damned if she doesn’t. If she refuses to talk like a lady, she is ridiculed and subjected to criticism as unfeminine; if she does learn, she is ridiculed as unable to think clearly, unable to take part in a serious discussion: in some sense, as less than fully human. These two choices which a woman has – to be less than a woman or less than person – are highly painful.
Lakoff finds numerous lexical and syntactic differences between the speech of men and women. Women tend to use softer, more ‘polite’ language and are more like to hedge or otherwise express uncertainty with in their comments. While she acknowledges that – as of the early 70s – these distinctions have begun to blur, Lakoff also notes that the blurring comes almost entirely in the direction of “women speaking more like men.” Eg, language is still gendered, but has acceptable language grown in breadth for women, while ‘male’ language remains narrow and continues to be taken as the norm.
A more recent study by Sarawgi et al looks more closely at algorithmic approaches to identifying gender. They present a comparative study using both blog posts and scientific papers, examining techniques which learn syntactic structure (using a context-free grammar), lexis-syntatic patterns (using n-grams), and morphological patterns using character-level n-grams.
Sarawgi et al further argue that previous studies made the gender-identification task easier by neglecting to account for possible topic bias, and they therefore carefully curate a dataset of topic-balanced corpora. Additionally, their model allows for any gamma number of genders, but the authors reasonably restrict this initial analysis to the simpler binary classification task, selecting only authors who fit a woman/man gender dichotomy.
Lakoff’s work suggests that there will be lexical and syntactic differences by gender, but surprisingly, Sarawgi et al find that the character-level n-gram model outperformed the other approaches.
This, along with the fact that the finding holds in both formal and informal writing, seems to suggest that gender-socialized language may be more subtle and profound than previously thought. It is not just about word choice or sentence structure, it is more deeply about the very sounds and rhythm of speech.
The character n-gram approach used by Sarawgi is taken from an earlier paper by Peng et al which uses character n-grams for the more specific task of author attribution. They test their model on English, Greek, and Chinese corpora, achieving impressive accuracy on each. For the English corpus, they are able to correctly identify the author of text 98% of the time, using a 6-gram character model.
Peng et al make an interesting case for the value of character n-grams over word n-grams, writing:
The benefits of the character-level model in the context of author attribution are that it avoids the need for explicit word segmentation in the case of Asian languages, it captures important morphological properties of an author’s writing, it can still discover useful inter-word and inter-phrase features, and it greatly reduces the sparse data problems associated with large vocabulary models.
While I initially found it surprising that a character level n-gram approach would perform best at the task of gender classification, the Peng et al paper seems to shed computation light on this question – though the area is still under theorized. If character n-grams are able to so accurately identify the single author of a document, and that author has a gender, it seems reasonable that this approach would be able to infer the gender of an author.
Still, the effectiveness of character n-grams in identifying an author’s gender indicates an interesting depth to the gendered patterns of language. Even as acceptable language for women converges to the acceptable language of men, the subtleties of style and usage remain almost subconsciously gendered – even in formal writing.