Catastrophic Frequencies


The source data for this study consisted of all articles published in Section A (the front-section) of the New York Times (NYT) for each of the ninety-six (96) months from January 2001 to December 2008 (inclusive). The New York Times was selected because of its “generally recognized influence on other media” (West, p.8).

Month-by-month searches of the Lexus-Nexus database of NYT articles limited the selection to “Section A” and excluded “paid notices”. Search results were downloaded in plain text for maximum compatibility with analysis softwares. Because the typical search returned between 2000 and 2500 articles per month, and Lexus-Nexus limits downloads to 500 article-batches, raw data for each month consisted of four to five separate text files. The mean hard disk use was just over seven (7) Megabytes (MB) per month or 87 MB per year, for a total of 700 MB over the eight-year period.

During initial trials, TextSTAT (Hüning 2008) was run on each month to produce word frequency counts for “Hurricane Katrina” and “Katrina”, “9/11, “Iraq” and “fail”. Results were verified using TextSTAT’s keyword-in-context (KWIC) tool.3 These trials demonstrated that TextSTAT is capable of (1) combining several data files into a single ‘corpus’, (2) rapidly producing KWIC lists from large data sets, (3) producing accurate frequency lists, and (4) exporting results in widely-compatible file formats. While TextSTAT agilely manipulates a single document, the specific demands of this project required a search for alternative software: (a) the process for producing word frequency lists is processor (and time) intensive, requiring as much as an hour per month; (b) exported KWIC lists do not list tallies and are ill-suited for spreadsheets; and (c) there is no scripting or job-queuing feature.

In the course of evaluating competing applications which could produce word frequency lists, it was discovered that the problem is a staple of Programming 101 texts. The widely-published and bash scripts (Cooper 2007) apparently originated with a challenge issued to Don Knuth by John Bentley in (Bentley, May 1986, p.368): “Given a text file and an integer K, you are to print the K most common words in the file (and the number of their occurrences) in decreasing frequency.” Doug McIlroy offered a six-line solution for Unix systems (Bentley, June 1986, p.471):

  1. Replace all spaces with line breaks and remove duplicate line breaks
  2. Transliterate upper case to lower case
  3. Sort alphabetically to bring identical words together
  4. Replace each run of duplicate words with a single representative and include a count
  5. Sort in reverse numeric order
  6. Print K lines

Initial tests run revealed that, in just seconds, a McIlroy-derived produced word frequency lists similar to TextSTAT. Errors in the frequency lists primarily resulted from the inability to parse punctuation (i.e. words were counted separately if co-joined by a comma or period) and non-printing ASCII characters presents in Lexis-Nexis exports. Using loops and standard Unix commands, a daisy-chain of scripts (based on was written to process the data files and produce longitudinal keyword frequency information: for each month in each year

Combine the multiple downloaded data files by month ($year$mo.txt)

  1. Output to a working directory for each file in the working directory
  2. Transliterate upper case to lower case letters
  3. Harmonize non-printable ASCII characters
  4. Transliterate common multi-word signs to single words (i.e. “New York” to “newyork”)
  5. transliterate all punctuation to spaces for each file
  6. Sort alphabetically to bring identical words together
  7. Replace each run of duplicate words with a single representative and include a count
  8. Sort in reverse numeric order
  9. Print a list for each month in each year for each defined keyword
  10. Search within each month-year list for the keyword
  11. Output the line including keyword, count and month-year

The initial seven keywords defined in were chosen as representative of generally-recognizable catastrophes or events of the 2001-2008 period:

Enron – a multinational energy-trading corporation with strong political ties to President Bush, which was accused for causing the California Energy Crisis (June 2001) and became one of the largest bankruptcies in American history (December 2001).

  • Tsunami – the December 26th 2004 Indian Ocean earthquake which released a tsunami resulting in one of the deadliest natural disasters in recorded history.
  • Katrina, Hurricane – the 2005 Atlantic hurricane which made landfall in northeast Louisiana (USA), resulting in storm surges which collapsed and flooded the city of New Orleans, and became one of the costliest natural disasters in American history.
  • 9/11 – the September 11 2001 terrorist attacks against the New York World Trade Centers, which caused the WTC towers to collapse and is widely regarded as one of the worst terrorist attacks in history.
  • Iraq – the nation against which the United States launched a military invasion in March 2003, ending the ceasefire which ended the Gulf War (1990-1991).
  • Abu Ghraib – the Iraqi prison which was used to torture enemies of the Baathist regime prior to the US invasion in 2003, and was revealed to be one location where US soldiers tortured Iraqi detainees.

Because the program’s default output would fail to distinguish between “9/11”, the sign for a terrorist attack, and “911”, the phone number for emergency services in most of North America, a series of transliterations were coded into to distinguish the two. Some further additions to aimed to transliterate “September 11th” with “9/11” while excluding newspaper datelines which might otherwise inflate word frequencies during the month of September.

The resulting data files were then manually examined for errors and inappropriate conjunctions, corrected and harmonized to the queried keywords. A data table of these results is included in Appendix A. In the following section the notation {word} is used to denote a keyword and its variants. For example, {WTC} denotes both “WTC” and “World Trade Center”. Ordinals followed by “wf” denote word frequency counts.