Wikipedia & Project Gutenberg Ngram databases

I thought I’d follow-up ngram.sh: a script for extracting Google Ngram data with data sources for Wikipedia and Project Gutenberg ngrams. The ngram.sh script can easily be modified to extract your keywords from these databases too.

Number of publications offered by Project Gutenberg, 1994-2008.

Number of publications offered by Project Gutenberg, 1994-2008.

Wikipedia Ngram data

Title: Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram
Author: Javier Artiles & Satoshi Sekine at NYU’s Proteus Project.
Source: Wikipedia [18:12, June 8, 2008 version]
Ngrams: 1gram (31MB); 2gram (447MB); 3gram (1.9GB); 4gram (4.3GB); 5gram (7.1GB); 6gram (10GB); 7gram (13GB)
Also: List of headwords, Infobox data, and sentences tagged by NLP (Natural Language Processing) tools.
Code: None provided.

Number of articles in the English Wikipedia, 2001-2012.

Number of articles in the English Wikipedia, 2001-2012.

Project Gutenberg Ngram data

Title: N-gram data from Project Gutenberg
Author: Prashanth Ellina
Source: Project Gutenberg [n.d. probably 2008]
Ngrams: 2gram %amp; 3gram (624mb);
Also: three tarballs (5.3gb each) of the “complete” text database.
Note: This post is from May 2008, and Project Gutenberg is always growing, so why not get fresh data from the source? The Project Gutenberg Mirroring How-To.
Code: Yes, step-by-step instructions.