One way to compare the similarity of documents is to examine the comparative log-likelihood of word frequencies.
This can be done with any two documents, but it is a particularly interesting way to compare the similarity of a smaller document with the larger body of text it is drawn from. For example, with access to the appropriate data, you may want to know how similar Shakespeare was to his contemporaries. The Bard is commonly credited with coining a large number of words, but it’s unclear exactly how true this is – after all, the work of many of his contemporaries has been lost.
But, imagine you ran across a treasure trove of miscellaneous documents from 1600 and you wanted to compare them to Shakespeare’s plays. You could do this by calculating the expected frequency of a given word and comparing this to the observed frequency. First, you can calculate the expected frequency as:
Where Ni is the total number of words in document i and Oi is the observed frequency of a given word in document i. That is, the expected frequency of a word is: (number of words in your sub corpus) * (sum of observed frequency in both corpora) / the number of words in both corpora.
Then, you can use this expectation to determine a word’s log-likelihood given the larger corpus as:
Sorting words by their log-likelihood, you can then see the most unlikely – eg, the most unique – words in your smaller corpus.