TL;DR; Jump straight to the visualisation
Let’s say you wanted to briefly know what was happening in the last couple of years (think any number).
If the number is in range of millions, or billions, then Stephen Hawking's Universe and Neil deGrasse Tyson's Cosmos series, or A Short History of Nearly Everything might be good starting point.
If the number is in range of thousands, or hundreds, some options might be to binge watch Crash Course World History series, listen to Dan Carlin's Hardcore History podcast, or perhaps read Sapiens: A Brief History of Humankind.
If the number is in smaller range and you’re interested more in the US, A People's History of the United States and Economix can give some answers.
So that’s what happened. But what was being written about in the news?
There are many online newspaper archives to scan through, to get in-depth look.
Fortunately, computers can do the hard work. One of the archives that is both extensive and understandable by computer is New York Times TimesMachine which offers digitized prints since 1851 until today. The archive is also available for search in the online version.
For developers (and computers), there is New York Times Archive API.
Every article contains it’s headline and excerpt of the text. Some of the texts were extracted from digital scans of the prints, so there are some mispellings.
Both headline and excerpt of the text are split into words based on following rules
- words are separated by space, or dash (therefore Hong Kong would be split to “Hong” and “Kong”)
- typos are not corrected
- numbers, emails, phone numbers, websites are ignored
Applying the rules to all 14,076,651 articles published between 1851 and 2022 yields 1,689,503 unique words.
For comparison, Oxford English Dictionary contains over 600,000 words.
GPT-Neo (Aritificial Intelligence trained by reading a lot of text) estimates the number of words english speaker knows ranges from 20,000 to 100,000.
According to the Global Language Monitor, the current estimated number of English-language words is 1,066,095 as of June 2022, since a new word is created every 98 minutes.
The Newspapers section of the Corpus of Contemporary American English contains 122,959,393 words.
The NYT dataset contains many names, locations, business names, mispelled words, words in different languages, words with different accents, therefore the high count of unique words.
Spell correction (huor -> hour), accent removal (über -> uber), stemming (iraqi -> iraq) is omitted for now, because all of the tested algorithms wiped out some relevant words.
The result of visualising the counts of words from 1851 until today can be found at wordeebee.com.
wordeebee visualises the history of each word found in the dataset and keeps record of the most mentioned words since 1851.
Following visualisation shows top 6 most mentioned words in the last 20 years. Note, that commonly used words (and, the, today, …) are ignored. (Although you can still find them on wordeebee)
Click any word to open the visualisation in wordeebee. To see more words, click any year to open a detailed view.
Be sure to check out wordeebee.com and share if you find something interesting.
What will happen next?#
We cannot know, but if you are interested in some ideas, try
Also, if you liked the article, please share it on
For any questions/suggestions, give a shout to @the_datanaut on Twitter, Instagram, or Reddit.