ngram.sh: a script for extracting Google Ngram data

What is ngram.sh?

The Google Ngram Viewer is a database browser used to chart the relative frequency of words or phrases. The data source is the Google Books database (sort of) and the graphic engine is Google Charts. It’s cool. It’s pretty. It’s hard to use for academic work because it doesn’t easily give up the raw data.

This page explains how to run a script — ngram.sh — on a *NIX shell account and extract your keywords into spreadsheet-ready files, without buying a terabyte hard drive.

Before mapping out your research project, you should spend some time testing out search terms in the Google Ngram Viewer, reading this page’s bibliography and selecting one of the Ngram datasets.

How ngram.sh works

By default, ngram.sh is configured for 1gram searches of the English Version 20090715 dataset. It will download ten ~210mb ZIP files, one at a time, unpack the ~1000mb CSV inside, and (grep) search for each keyword. It will write the output to a keyword-named CSV, and then delete the source file. You must have a MINIMUM 2gb to run the default script (sorry, your SFU diskshare isn’t big enough).

This script can take quite a while to run. In part that’s because it’s searching large amounts of data, but mostly it’s because it’s downloading lots of data. Unless you’re running on a very fast Internet connection, bandwidth is your bottleneck. You can speed things up for future searches by removing the lines to delete source (or CSV) files. You could then modify the script to run off your hard drive without downloading anew. However, you must have you must have upwards of 10gb available to do this with 1grams. If you change the script to process 2-grams or higher, WATCH OUT! A multi-keyword search of 5-grams without deletes can easily top a terabyte of data!

It is really, REALLY easy to fill every last sector of your hard drive — or bust your bandwidth cap! — with this script. Be cautious and pay attention.

How to run ngram.sh

You can run this script on any UNIX or *NIX system, including your OSX or higher Apple/Mac personal computer (YouTube).

Download ngram.sh. The script will download after you click through the registration.

Permissions: Some people like to run their scripts with “bash ngram.sh” and file permissions unchanged, others prefer to set executable permissions with chmod 755 and run with “./ngram.sh“.

Configuration: The very first line will need to be edited with your path to bash. Replace the 1grams on line 41 (beginning “for word in”) with your search keywords. Save and run the script.

Cleaning the results: I like to manipulate the CSVs in MS Excel, but any spreadsheet application will do (even GoogleDocs). When reading the results of your query, you’re likely to discover that you grabbed a bunch of words you didn’t intend to, will have to collapse a bunch that are similar, and might have missed a few that you wanted. Keyword selection is a science and an art. Just modify your script, set it to run, make yourself some tea and hope you don’t cause your ISP to blow a gasket.

Happy counting!

bibliography

Bentley, J., & Knuth, D. (1986). Programming pearls: literate programming. Communications of the Association of Computing Machinery, 29(5), 384-369. doi:10.1145/5689.315644

Bentley, J., Knuth, D., & McIlroy, D. (1986). Programming pearls: a literate program. Communications of the Association of Computing Machinery, 29(6), 471-483. doi:10.1145/5948.315654

Lieberman, E., Michel, J.-B., Jackson, J., Tang, T., & Nowak, M. A. (2007). Quantifying the evolutionary dynamics of language. Nature, 449(7163), 713-716. doi:10.1038/nature06137

Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., et al. (2010). Quantitative Analysis of Culture Using Millions of Digitized Books. Science. doi:10.1126/science.1199644

van Dijck, J. (2010). Search engines and the production of academic knowledge. International Journal of Cultural Studies, 13(6), 574-592. doi:10.1177/1367877910376582