Skip to main content
Escape Characters

Vocabulary

One day, driving home after work, I was thinking about the monstrous collections of personal logs, emails, etc., that I've collected over the years. And at the same time, it came back into my head that, on average, people use around 2000-3000 words in their "day -to-day" speech. And I thought to myself, "I wonder how many words I use?"

So this evening I ran an experiment. I took my database of instant message logs, queried them for only my messages, divided out the words, filtered out the crap (common words, contractions, single character punctuation marks, etc.), and ran some numbers. Given a dictionary of   97,070 words (extracted from an Ubuntu Linux installation), a list of 564 common words and contractions, and among 90,264 messages (549,619 words) sent by me from 2003 to 2008 to various recipients:

Initially I thought to myself, "Man, I'm sort of a big deal." 13,000+ unique words used? Maybe I was off on my calculations. I tried it with a few other people in the database. Everyone I tried ranged within 2,000 to 5,000 words, with similarly scaled numbers for the other metrics. I started to wonder if there was a problem with my method. I can think of a few potential problems right off:

According to Wikipedia, and somewhat confirmed by a book or two I've read on language, children at the age of 5 or 6 have a full vocabulary of about 3000 words. In addition, children learn on average 3000 words per year while they are in school. From my 3000 word base, from kindergarten until graduation from high school (17 years), I should have somehow accrued 3,000 + 17*3,000 = 54,000 words. According to my chat logs, I'm pretty far behind.

Given that my data set is suspect and not perfectly relevant to the question I started with, and potentially my calculations are possibly inflated and/or inaccurate, I'm not sure that this experiment has really yielded anything useful. About the only thing I can take away from it is that I treat instant message conversations different than everyone else, in that I frequently use uncommon words, or at least, use them at a greater rate than the people I converse with.

I guess that's something.