
{"id":850,"date":"2012-02-13T08:42:04","date_gmt":"2012-02-13T08:42:04","guid":{"rendered":"http:\/\/opendna.com\/blog\/?p=850"},"modified":"2022-11-08T11:29:51","modified_gmt":"2022-11-08T11:29:51","slug":"850","status":"publish","type":"post","link":"https:\/\/opendna.com\/blog\/2012\/02\/13\/850\/","title":{"rendered":"ngram.sh: a script for extracting Google Ngram data"},"content":{"rendered":"<h2>What is ngram.sh?<\/h2>\n<p>The <a href=\"http:\/\/books.google.com\/ngrams\">Google Ngram Viewer<\/a> is a database browser used to chart the relative frequency of words or phrases. The data source is the Google Books database (sort of) and the graphic engine is <a href=\"http:\/\/code.google.com\/apis\/chart\/\">Google Charts<\/a>. It&#8217;s cool. It&#8217;s pretty. It&#8217;s hard to use for academic work because it doesn&#8217;t easily give up the raw data.<\/p>\n<p>This page explains how to run a script \u2014 ngram.sh \u2014 on a *NIX shell account and extract your keywords into spreadsheet-ready files, <em>without<\/em> buying a terabyte hard drive.<\/p>\n<p>Before mapping out your research project, you should spend some time testing out search terms in the <a href=\"http:\/\/books.google.com\/ngrams\">Google Ngram Viewer<\/a>, reading this page&#8217;s <a href=\"#bibliography\">bibliography<\/a> and selecting one of the <a href=\"http:\/\/books.google.com\/ngrams\/datasets\">Ngram datasets<\/a>.<\/p>\n<h2>How ngram.sh works<\/h2>\n<p>By default, ngram.sh is configured for 1gram searches of the English Version 20090715 dataset. It will download ten ~210mb ZIP files, one at a time, unpack the ~1000mb CSV inside, and (grep) search for each keyword. It will write the output to a keyword-named CSV, and then delete the source file. You must have a MINIMUM 2gb to run the default script (sorry, your SFU diskshare isn&#8217;t big enough).<\/p>\n<p>This script can take quite a while to run. In part that&#8217;s because it&#8217;s searching large amounts of data, but mostly it&#8217;s because it&#8217;s <em>downloading<\/em> lots of data. Unless you&#8217;re running on a very fast Internet connection, bandwidth is your bottleneck. You can speed things up for future searches by removing the lines to delete source (or CSV) files. You could then modify the script to run off your hard drive without downloading anew. However, you must have you must have upwards of 10gb available to do this with 1grams. If you change the script to process 2-grams or higher, WATCH OUT! A multi-keyword search of 5-grams without deletes can easily top a terabyte of data!<\/p>\n<p>It is really, REALLY easy to fill every last sector of your hard drive \u2014 or bust your bandwidth cap! \u2014 with this script. Be cautious and <strong>pay attention<\/strong>.<\/p>\n<h2>How to run ngram.sh<\/h2>\n<p>You can run this script on any UNIX or *NIX system, including your OSX or higher <a href=\"http:\/\/youtu.be\/nZqi3BqqeqI\">Apple\/Mac personal computer<\/a> (YouTube).<\/p>\n<p><strong><a href=\"http:\/\/opendna.com\/download.php?f=ngram.sh.txt\">Download ngram.sh<\/a><\/strong>. The script will download after you click through the registration.<\/p>\n<p>Permissions: Some people like to run their scripts with &#8220;<em>bash ngram.sh<\/em>&#8221; and file permissions unchanged, others prefer to set executable permissions with chmod 755 and run with &#8220;<em>.\/ngram.sh<\/em>&#8220;.<\/p>\n<p>Configuration: The very first line will need to be edited with your path to bash. Replace the 1grams on line 41 (beginning &#8220;for word in&#8221;) with your search keywords. Save and run the script.<\/p>\n<p>Cleaning the results: I like to manipulate the CSVs in MS Excel, but any spreadsheet application will do (even GoogleDocs). When reading the results of your query, you&#8217;re likely to discover that you grabbed a bunch of words you didn&#8217;t intend to, will have to collapse a bunch that are similar, and might have missed a few that you wanted. Keyword selection is a science and an art. Just modify your script, set it to run, make yourself some tea and hope you don&#8217;t cause your ISP to blow a gasket.<\/p>\n<p>Happy counting!<\/p>\n<p><a name=\"bibliography\"><\/a><\/p>\n<h2>bibliography<\/h2>\n<p>Bentley, J., &amp; Knuth, D. (1986). <a title=\"When was the last time you spent a pleasant evening in a comfortable chair, reading a good program? I don\u2019t mean the slick subroutine you wrote last summer, nor even the big system you have to modify next week. I\u2019m talking about cuddling up with a classic, and starting to read on page one. Sure, you may spend more time studying this elegant routine or worrying about that questionable decision, and everybody skims over a few parts they find boring. But let\u2019s get back to the question: when was the last time you read an excellent program? Until recently, my answer to that question was, &quot;Never.&quot;\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=5689.315644\" target=\"_blank\" rel=\"noopener\">Programming pearls: literate programming<\/a>. <em>Communications of the Association of Computing Machinery<\/em>, <em>29<\/em>(5), 384-369. doi:10.1145\/5689.315644<\/p>\n<p>Bentley, J., Knuth, D., &amp; McIlroy, D. (1986). <a title=\"The purpose of this program is to solve the following problem posed by Jon Bentley: &quot;Given a text file and an integer k, print the k most common words in the file (and the number of their occurrences) in decreasing frequency.&quot; Jon intentionally left the problem somewhat vague, but he stated that &quot;a user should be able to find the 100 most frequent words in a twenty-page technical paper (roughly a 50K byte file) without undue emotional trauma.&quot;\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=5948.315654\" target=\"_blank\" rel=\"noopener\">Programming pearls: a literate program<\/a>. <em>Communications of the Association of Computing Machinery<\/em>, <em>29<\/em>(6), 471-483. doi:10.1145\/5948.315654<\/p>\n<p>Lieberman, E., Michel, J.-B., Jackson, J., Tang, T., &amp; Nowak, M. A. (2007). <a title=\"Abstract: Human language is based on grammatical rules. Cultural evolution allows these rules to change over time5. Rules compete with each other: as new rules rise to prominence, old ones die away. To quantify the dynamics of language evolution, we studied the regularization of English verbs over the past 1,200 years. Although an elaborate system of productive conjugations existed in English's proto-Germanic ancestor, Modern English uses the dental suffix, '-ed', to signify past tense6. Here we describe the emergence of this linguistic rule amidst the evolutionary decay of its exceptions, known to us as irregular verbs. We have generated a data set of verbs whose conjugations have been evolving for more than a millennium, tracking inflectional changes to 177 Old-English irregular verbs. Of these irregular verbs, 145 remained irregular in Middle English and 98 are still irregular today. We study how the rate of regularization depends on the frequency of word usage. The half-life of an irregular verb scales as the square root of its usage frequency: a verb that is 100 times less frequent regularizes 10 times as fast. Our study provides a quantitative analysis of the regularization process by which ancestral forms gradually yield to an emerging linguistic rule.\" href=\"http:\/\/www.nature.com\/nature\/journal\/v449\/n7163\/abs\/nature06137.html\">Quantifying the evolutionary dynamics of language<\/a>. Nature, 449(7163), 713-716. doi:10.1038\/nature06137<\/p>\n<p>Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., et al. (2010). <a title=\"Abstract: We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of \u2018culturomics,\u2019 focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.\" href=\"http:\/\/www.sciencemag.org\/content\/331\/6014\/176.abstract\">Quantitative Analysis of Culture Using Millions of Digitized Books<\/a>. Science. doi:10.1126\/science.1199644<\/p>\n<p>van Dijck, J. (2010). <a title=\"This article argues that search engines in general, and Google Scholar in particular, have become significant co-producers of academic knowledge. Knowledge is not simply conveyed to users, but is co-produced by search engines\u2019 ranking systems and profiling systems, none of which are open to the rules of transparency, relevance and privacy in a manner known from library scholarship in the public domain. Inexperienced users tend to trust proprietary engines as neutral mediators of knowledge and are commonly ignorant of how meta-data enable engine operators to interpret collective profiles of groups of searchers. Theorizing search engines as nodal points in networks of distributed power, based on the notions of Manuel Castells, this article urges for an enriched form of information literacy to include a basic understanding of the economic, political and socio-cultural dimensions of search engines. Without a basic understanding of network architecture, the dynamics of network connections and their intersections, it is hard to grasp the social, legal, cultural and economic implications of search engines.\" href=\"http:\/\/ics.sagepub.com\/content\/13\/6\/574.abstract\" target=\"_blank\" rel=\"noopener\">Search engines and the production of academic knowledge<\/a>. <em>International Journal of Cultural Studies<\/em>, <em>13<\/em>(6), 574-592. doi:10.1177\/1367877910376582<\/p>\n\n","protected":false},"excerpt":{"rendered":"<p>The Google Ngram Viewer is a database browser used to chart the relative frequency of words or phrases. The data source is the Google Books database and the graphic engine is Google Charts. It&#8217;s cool. It&#8217;s pretty. It doesn&#8217;t easily give up the raw data. This script helps.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"webmentions_disabled_pings":false,"webmentions_disabled":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":""},"categories":[190],"tags":[157,158,159,160,161,43,162,51,163,45],"class_list":["post-850","post","type-post","status-publish","format-standard","hentry","category-builds","tag-academic-work","tag-data-source","tag-datasets","tag-n-gramn-gram","tag-ngram","tag-programming","tag-relative-frequency","tag-research","tag-shell-account","tag-unix"],"_links":{"self":[{"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/posts\/850","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/comments?post=850"}],"version-history":[{"count":1,"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/posts\/850\/revisions"}],"predecessor-version":[{"id":1858,"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/posts\/850\/revisions\/1858"}],"wp:attachment":[{"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/media?parent=850"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/categories?post=850"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/tags?post=850"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}