Sunday, December 13, 2009

On Linguistic Fingerprinting

Can an author's writing style be defined by the frequency of unique words in their writings? According to physicist Sebastian Bernhardsson, the answer is yes. He found a couple of interesting facts: 1) the more we write, the more we repeat words and 2) the rate of repetition (or rate of change) seems to be unique to individual authors (creating a "linguistic fingerprint"... literally his words, not mine). Let me walk through his claims and findings, just a bit.

Bernhardsson et al. are in press with a corpus linguistics study which compared rates of unique words between short and long form writing (short stories vs. novels vs. corpora). I stumbled on to this research earlier this week when a BBC News title caught my eye: Rare words 'author's fingerprint': Analyses of classic authors' works provide a way to "linguistically fingerprint" them, researchers say.

The idea of linguistically fingerprinting authors has been around for a while. In some ways it acted as a lost leader decades ago, piquing interest in the use of corpora and statistical methods to study language and now there is even a whole journal called Literary and Linguistic Computing. Plus, there is an established practice of forensic linguistics where linguistic methods are used to establish authorship of critical legal documents.

However, Bernhardsson makes a bold claim. He claims that the process of writing (a cognitively complex process) can be described as the process of pulling chunks out of a large meta-book which shows the same statistical regularities of an authors real work (he hedges on this a bit, of course). I always shiver when I run across a non-linguist jumping head first into linguistics making bold claims like this, but I also recognize that Bernhardsson and and his co-authors are pretty smart folks so I gave them the benefit of the doubt and skimmed one of their two available papers (freely available here).
  • The meta book and size-dependent properties of written language. Authors: Sebastian Bernhardsson, Luis Enrique Correa da Rocha, Petter Minnhagen. New Journal of Physics (2009), accepted.
First, I concentrated on the first section because the paper goes into a different direction that was not necessary for me to cover (and had lots of scary algorithms; it is Sunday and I do want to watch football, hehe). What they did was count the number of words in a text, then count the number of unique words (this is a classic type/token distinction). Here's what they found:

When the length of a text is increased, the number of different words is also increased. However, the average usage of a specific word is not constant, but increases as well. That is, we tend to repeat the words more when writing a longer text. One might argue that this is because we have a limited vocabulary and when writing more words the probability to repeat an old word increases. But, at the same time, a contradictory argument could be that the scenery and plot, described for example in a novel, are often broader in a longer text, leading to a wider use of ones vocabulary. There is probably some truth in both statements but the empirical data seem to suggest that the dependence of N (types) on M (tokens) reflects a more general property of an authors language. (my emphasis and additions).

First, let's make sure we get what the author's did. We have to use words more than once, right? I've already repeated the word "we" in just the last two sentences. And we repeat words like "the" and "of" all the time. We have to. So there are types of words, like "the" but there are also the number of times those words get repeated (tokens). It's pretty straight forward to simply count the total number of words in a story, then count the total number of types of words. Thus giving us a ratio. For example, let's say we have a short story by Author X with 1000 words it (= tokens). Then we count how many times each word is repeated and we find that there are only 250 unique words (= types), this means there is a ratio of 1000/250, or 100/25 (for comparison's sake I'm using this ratio). This means that only 25% of the words are unique, which also means that, on average, a word is repeated 4 times in this story.

Now let's take a novel by Author X with 100,000 words (= tokens). After counting repetitions we find it has 11000 unique words. Our token/type ration = 100,000/11000, or 100/11. This means that only 11% of the words are unique, which means, on average, a word gets repeated about 9 times. That's higher than in the short story. Words are being repeated more in the novel. Now let's imagine we take all of Author X's written work, put it together into a single corpus and repeat the process and discover that the ratio is 100/7 (on average, a word gets repeated about 14 times).

UPDATE: whoa, my maths was off a bit the first time I did this. That'll teach me to write a blog post while watching Indie crush Denver. Sorry, eh,

This is what the author's found: "The curve shows a decreasing rate of adding new words which means that N grows slower than linear (α less than 1)."

They discovered something potentially even more interesting. there is a rate of change between these ratios is unique to each author: Here's is their graph from the article (H = Thomas Hardy, M = Herman Melville, and L = D.H. Lawrence):
FIG. 1: The number of different words, N, as a function of the total number of words, M, for the authors Hardy, Melville and Lawrence. The data represents a collection of books by each author. The inset shows the exponent = lnN/ lnM as a function of M for each author.

Their conclusions about the meta-book and linguistic fingerprint:

These findings lead us towards the meta book concept : The writing of a text can be described by a process where the author pulls a piece of text out of a large mother book (the meta book) and puts it down on paper. This meta book is an imaginary infinite book which gives a representation of the word frequency characteristics of everything that a certain author could ever think of writing. This has nothing to do with semantics and the actual meaning of what is written, but rather to the extent of the vocabulary, the level and type of education and the personal preferences of an author. The fact that people have such different backgrounds, together with the seemingly different behavior of the function N(M) for the different authors, opens up for the speculation that every person has its own and unique meta book, in which case it can be seen as a fingerprint of an author. (my emphasis)

They are quick to point out that this finding says nothing about the semantic content of the writings. So what does it say? I admit I was having a hard time seeing any conclusion about cognition or the writing process, even while finding this methodology interesting, I'm just not at all sure what it really says about the human brain and language, if anything at all. The speculation that "every person has their own unique meta book" is bold. Unfortunately, it is also almost entirely untestable. Keep in mind that this research had zero psycholinguistic component. They were just counting words on pages. I'd caution against drawing any conclusion about the human language system based solely on this work. (I should note that I skipped one of the most interesting findings, that the section of work doesn't matter, simply the size. meaning, they took random chunks from their corpora and found the same patterns, if I understood that part correctly.) Which begs the question: why is this being published in a physics journal? It's being published in The New Journal of Physics and a quick perusal of the articles from previous editions doesn't show anything remotely similar to this work (no surprise).

I'm a fan of corpus linguistics, but I'm also a fan of caution. I'm not convinced any conclusions about the psycholinguistics of the complex writing process can be drawn from this work. Not as yet. But interesting, nonetheless.

FYI: it's easy enough to fact check some of these results using freely available tools, namely KWIC Concordance
. This tool will take any text and count the total tokens and number of repeats for us. I did this for Melville's Bartleby, the Scrivener and Moby Dick. I got text versions of each from Project Gutenberg, then ran the wordlist function within KWIC and here are my results:

Bartleby
Total Tokens: 18111
Total Types: 3462
Type-Token Ratio: 0.191155

Moby Dick
Total Tokens: 221912
Total Types: 17354
Type-Token Ratio: 0.078202

Bartleby = 0.191155
Moby Dick =
0.078202

Yep, the short story
Bartleby has more unique words than the longer Moby Dick. FYI, this is a weak test simply because the tokens are not stemmed, meaning morphological variants are treated as different words. I don't know if this is consistent with Bernhardsson's methodology or not.


2 comments:

Anonymous said...

As I see these kinds of articles coming out more and more, I have a suggestion. Would one want to collect these (and similar) as a list? Of course, from a linguistic point of view quasi linguistic works are not interesting, but one could just point people who take these kinds of articles seriously to a list and to some discussion? My recent "find" that inspired this comment is the following blog post by "Dusk in Autumn":

http://akinokure.blogspot.com/2009/12/great-moderation-of-ego-during-1980s.html

Unknown said...

I am working with High School students and find it useful to use Wordle online as a visual and analytical tool for their own writing and for rapid take on readings they have to do in terms of word frequency and salience. We have discussed the proposition that everyone has their own writing personality which now I will refer to as fingerprint as it seems to fit quite well. They use the tool to add more power and precision to their writing by identifying and replacing terms that we identify as non-useful (words like nice,good, prepositional verbs rather than their more academic alternatives,it, this, do, make...)

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...