Wikipedia & Project Gutenberg Ngram databases
I thought I’d follow-up ngram.sh: a script for extracting Google Ngram data with data sources for Wikipedia and Project Gutenberg ngrams. The ngram.sh script can easily be modified to extract your keywords from these databases too.
Wikipedia Ngram data
Title: Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram
Author: Javier Artiles & Satoshi Sekine at NYU’s Proteus Project.
Source: Wikipedia [18:12, June 8, 2008 version]
Ngrams: 1gram (31MB); 2gram (447MB); 3gram (1.9GB); 4gram (4.3GB); 5gram (7.1GB); 6gram (10GB); 7gram (13GB)
Also: List of headwords, Infobox data, and sentences tagged by NLP (Natural Language Processing) tools.
Code: None provided.
Project Gutenberg Ngram data
Title: N-gram data from Project Gutenberg
Author: Prashanth Ellina
Source: Project Gutenberg [n.d. probably 2008]
Ngrams: 2gram %amp; 3gram (624mb);
Also: three tarballs (5.3gb each) of the “complete” text database.
Note: This post is from May 2008, and Project Gutenberg is always growing, so why not get fresh data from the source? The Project Gutenberg Mirroring How-To.
Code: Yes, step-by-step instructions.


