What’s Your “Cinnamon Word”? The Stats on How Authors Use Language

Grady Hendrix’s recent stats-focused piece on Stephen King’s body of work reminded me of a volume I’ve been meaning to recommend publicly for some time. Back in May, browsing the “Essays/Literary Criticism” section of a local bookstore, I chanced upon a book that so thoroughly captivated me I spent nearly an hour turning its pages while standing in the exact same spot I’d been standing when I first pulled it off the shelf. Fortunately—or so I like to tell myself—it was a slow day at the lit crit section, and I didn’t impede access to these shelves while I rapturously bounded from one enthralling section of the book to the next, from one hypnotic table to another, from one dazzling bar chart to another.

Tables? Bar charts? In a book of literary criticism, you ask? Indeed, for this one is a rare specimen, a marriage of literary analysis and… statistics.

In Nabokov’s Favorite Word is Mauve: What the Numbers Reveal About the Classics, Bestsellers, and Our Own Writing, statistician and journalist Ben Blatt seeks to answer a number of fascinating questions about writers and their various techniques through sophisticated statistical analyses. And for the most part, he does. Bravo!

I’ll give you an example, related to Stephen King. In his book On Writing, King suggests that writers should use adverbs (meaning specifically adverbs ending in “-ly”) sparingly. Other writers, both preceding and following King, have shared this same advice. Blatt wonders how well the fiction of these writers measures up their exacting standards for “-ly” adverb usage, and he uses data analysis to find out. Crunching the numbers for the body of works of fifteen writers, a mix of popular and award-winning folks, Blatt calculates that Hemingway uses “-ly” adverbs the most sparingly, at a rate of only 81 per 10,000 words throughout ten major works. Stephen King, meanwhile, is roughly in the middle of a list, with a usage of 105 “-ly” adverbs per 10,000 words over the course of 51 novels. J. K. Rowling, for another genre comparison, is much higher, at 140 per 10,000 words.

After seeing this first table on page 13, I was hooked. Immediately, questions popped up in my mind, such as, does “-ly” adverb usage change over time for individual authors? (In some cases, most certainly. Pages 15-16 of the book explore this for Hemingway, Steinbeck, and Faulkner). Is there some correlation between “-ly” adverb frequency and “greatness,” as defined by books making it onto various lists of classics? (See pages 17-19 for the answer). How about correlations with popularity, as measured for example by Goodreads ranking? (Pages 19-25). Do fan-fiction writers tend to deploy “-ly” adverbs with the same frequency as professional authors? (Pages 26-29).

Blatt, by the way, is helpfully transparent with the assumptions he makes, the methodology he uses, and the limitations he himself is aware of in the results. Time and again he cautions us not to read too much into a particular statistical finding and to consider other factors that may be at play.

With the same irrepressible enthusiasm displayed in that opening chapter, Blatt proceeds to apply data analysis to research gender differences in fiction, whether authors can be said to have a numerically-measurable literary “fingerprint,” whether authors tend to follow in their own works the advice they dole out for others to adhere to, the complexity and grade levels of bestsellers over time, differences between U. K. and U. S. usage, authors’ use of clichés, the various percentages of cover space taken up by authors’ names, and the use of specific techniques to start and end sentences, as well as the general properties of classic opening lines.

It’s intoxicating stuff. You can practically flip to any page of Blatt’s book and discover something compelling about language usage. The chapter on clichés, I’ll admit, quickly became a favorite. I’ve often witnessed discussions on social media, usually initiated by writers, about over-used words. One stylistic device—sometimes implemented knowingly, sometimes not—is the repetition of a word or phrase at the start of consecutive sentences (this is called anaphora). I love the table on page 150 that shows some of the books with the highest percentage of one-word anaphora. Virginia Woolf’s The Waves is at 16%! If you’ve read The Waves, that won’t be shocking, but it’s a cool way to quantify part of Woolf’s technique. (Page 151, if you’re curious, features a table of two-word anaphora percentages, to eliminate the simple repetition, for example, of sentences that begin with “the.” The Waves is still at the top of the list.) Can you guess the bestselling genre author who also has a high percent of one-word anaphora?

(Okay, I’ll reveal the answer: Neil Gaiman. Again, if you’ve read The Ocean at the End of the Lane, that’s not surprising.)

Beyond simple anaphora, Blatt tackles actual clichés. As usual in most of these analyses, he uses an external reference as an authority, rather than attempting to define terms—in this case the cliché—for himself. Here he leans on Christine Ammer’s The Dictionary of Clichés (2013), which compiles some 4,000 clichés. Examining hundreds of novels by fifty authors, Blatt then calculates the number of clichés per 100,000 words (p. 158). Top of the list: James Patterson, with 160. At the other extreme is Jane Austen, with only 45. Stephen King is on the high end, with 125, while J. K. Rowling, with 92, is roughly at the same level as Dan Brown, with 93. What about clichés used by authors in more than half of their works (p. 156)? Ray Bradbury, for example, really likes “at long last”; George R. R. Martin relishes “black as pitch”; Rick Riordan tends to repeat “from head to toe”, and Tolkien gravitates towards “nick of time.”

Blatt also explores the usage frequency of different types of similies, like animal-related similies, and then moves on to the type of word that gives this piece its title, the “cinnamon word.” This refers to a specific word used by an author much more frequently than other authors, and stems from Bradbury’s affinity for the word “cinnamon”, which he uses 4.5 times more often than the word appears in the Corpus of Historical American English (a repository of over 400 million words of searchable text from the 1810s through the 2000s). It turns out that Bradbury uses spice-related words quite often: he uses “spearmint,” for instance, 50 times more often than it appears in the Corpus of Historical American English. Bradbury also uses the word “ramshackle” more often than at least fifty other writers Blatt considers. Blatt’s criteria for cinnamon words excludes proper nouns and demands that they occur in at least half of an author’s works, that they appear at least once per 100,000 words, and that they’re not super-obscure (he defines this). But what about non-proper-nouns appearing at the rate of at least 100 per 100,000 and occurring in all of an author’s works? These Blatt terms “nod” words. These are closer to tics, if you will.

The four-page table (!) on p. 173-176, a true thing of beauty, summarizes the top three cinnamon words and the top three nod words for fifty authors. Some genre examples: Ray Brabury’s nod words are “someone, cried, boys”, Cassandra Clare’s are “blood, hair, looked”, George R. R. Martin’s are “lady, red, black” and Lemony Snicket’s are “siblings, orphans, children.” (Of course, these results are influenced by which books Blatt included in the analyses; these don’t always extend to complete bibliographies, sometimes focusing only on popular series. Since he only looks at Asimov’s Foundation series, for example, it makes sense that Asimov’s top three cinnamon words would come up as “galactic, terminus, councilman”.)

There’s so much more of interest, but I don’t want to spoil too much. I’ll mention one more counting exercise I found intriguing. About a decade ago I was reading an essay on effective openings, and the discussion included some thoughts on the pros and cons of using description and weather imagery in an opening. Ever since then, I’ve wondered which authors tend to open with descriptions more than others. The answer is spelled out on p. 207. It turns out that romance is big on weather-related openings. A whopping 46% of the 92 novels by Danielle Steel do so, and 22% of Nicholas Sparks’ 18 novels have the distinction as well. In between them? John Steinbeck, at 26%. Hmmm.

As you’ve been reading about some of these statistical exercises, you’ve probably started to formulate your own objections or caveats. What about X or Y, you say? In the counts on “-ly” adverb usage, for instance, I wondered if the study should be historically normalized in some sense, since it’s not clear a priori that general historical trends are the same for “-ly” adverbs, which would automatically weigh some books more heavily than others based on their date of composition. When discussing Goodreads rankings on p. 21 it occurred to me that these ratings are merely a reflection of contemporary taste, rather than a proxy measure of a book’s success over its lifetime. When Blatt points to Khaled Hosseini’s The Kite Runner as work in which the author “offers a defense of clichés” on p. 161, I think we shouldn’t lose sight of the fact that Hosseini himself isn’t defending anything, but describing the position of one of his characters. And so on. Indeed, the very title of the book invites disputation: just because “mauve” is Nabokov’s top cinnamon word (followed by “banal” and “pun”—oh dear), can we really say it’s his favorite? Some writers grow to dislike words they use often. Maybe Nabokov’s favorite word is one he hardly ever used, reserving it for special occasions. Who can tell? Still, rather than looking on these objections as flaws, I believe that this is one of the book’s pleasures: it invites us to engage in critical thinking about the subject matter.

Throughout the book—and in some of the examples I’ve mentioned—Blatt includes science fiction and fantasy writers in his surveys. Science fiction authors often like to claim dibs on popular scientific/technical notions, and when I first mentioned Blatt’s book I said it was “a rare specimen” rather than one-of-a-kind. That’s because I’m aware of at least one earlier volume of data analysis applied to literary matters, a precedent that concerns a well-known science fiction writer. The book in question is Asimov Analyzed (1970) by Neil Goble. I haven’t read it in thirteen years and can’t vouch for its charm. Even at my most enthusiastic I think I’d endorse it only to hardcore Asimov fans with time and patience on their hands. Goble, working on this project in the 1970s, couldn’t benefit from the mass text-digitization and sophisticated software at Blatt’s disposal. His work is consequently more limited, with most of his “conclusions” based on small word samples within larger works. On the other hand, he considers some issues that Blatt doesn’t touch on (but only in the context of Asimov’s work), and there’s something to be said in favor of being a pioneer, at least within our genre.

While their methods and scopes are radically different, Blatt and Goble both illustrate how data analysis and literary criticism can be allies rather than foes. These books are motivated by an inquisitive and thoughtful spirit. The goal is to better understand writers and their works via non-traditional, but empirically reproducible, means.

For those of you with an analytical bent, Blatt’s numerous “literary experiments” will inform and amuse, and perhaps provoke curiosity about authors you haven’t read. For the writers among you, it’s sure to generate heightened awareness of the many writing-related choices that go into the assemblage of a text.

What’s your cinnamon word?

Nabakov’s Favorite Word Is Mauve is published by Simon & Schuster.

traveler-silverbergAlvaro Zinos-Amaro is the author of the Hugo- and Locus-finalist Traveler of Worlds: Conversations With Robert Silverberg (2016). Alvaro has published many stories, essays, reviews, and interviews, as well as Rhysling-nominated poetry.


