Vocabulary

06 December 2009
archiving

One day, driving home after work, I was thinking about the monstrous collections of personal logs, emails, etc., that I've collected over the years. And at the same time, it came back into my head that, on average, people use around 2000-3000 words in their "day -to-day" speech. And I thought to myself, "I wonder how many words I use?"

So this evening I ran an experiment. I took my database of instant message logs, queried them for only my messages, divided out the words, filtered out the crap (common words, contractions, single character punctuation marks, etc.), and ran some numbers. Given a dictionary of 97,070 words (extracted from an Ubuntu Linux installation), a list of 564 common words and contractions, and among 90,264 messages (549,619 words) sent by me from 2003 to 2008 to various recipients:

My longest word is "transubstantiation" (used once).
There are 5,242 instances in my data set that I have used a word only once.
Used 9,736 words that could not be found in the dictionary (possibly misspellings, acronyms, etc.)
I made use of 13,239 dictionary words.

Initially I thought to myself, "Man, I'm sort of a big deal." 13,000+ unique words used? Maybe I was off on my calculations. I tried it with a few other people in the database. Everyone I tried ranged within 2,000 to 5,000 words, with similarly scaled numbers for the other metrics. I started to wonder if there was a problem with my method. I can think of a few potential problems right off:

Instant messaging isn't Shakespeare to most people, so, in all likelihood, people are going to spell poorly and use a limited vocabulary.
By far, due to the fact that the database was compiled against my IM conversations, I have the most messages of any user (90,000+ for me, followed by 20,000, next by 11,000). It makes sense that given a larger set of data, the probability of me using a unique word is higher.
IM conversations tend to be short, exchange relatively little information (compared to, say, emails, phone conversations, blog entries), and are not subject to analysis either before or after a message is sent (most people don't record their conversations). In general, I would suppose that this limits the topics, and consequently, the vocabulary used to conduct an IM conversation.
My algorithm can't take into account whether a word I use is a base word, or a conjugation thereof. For instance, the way it works currently, if I've used the word "definite," it will be counted separately from "definitely" or "indefinite."

According to Wikipedia, and somewhat confirmed by a book or two I've read on language, children at the age of 5 or 6 have a full vocabulary of about 3000 words. In addition, children learn on average 3000 words per year while they are in school. From my 3000 word base, from kindergarten until graduation from high school (17 years), I should have somehow accrued 3,000 + 17*3,000 = 54,000 words. According to my chat logs, I'm pretty far behind.

Given that my data set is suspect and not perfectly relevant to the question I started with, and potentially my calculations are possibly inflated and/or inaccurate, I'm not sure that this experiment has really yielded anything useful. About the only thing I can take away from it is that I treat instant message conversations different than everyone else, in that I frequently use uncommon words, or at least, use them at a greater rate than the people I converse with.

I guess that's something.

(written around 24.4 years old)

Previous: Like the famous Lebanese-American poet...
Next: Recorded for posterity