An Iterative Process for Dictionary Construction

Drawing inspiration from General Inquirer (1966) and KWIC, this post proposes an iterative hybrid of available methods in a quest for a more flexible and robust machine-assisted content analysis system.

Word Frequency

“The use of frequencies to make inferences from the text is based in part on the assumption that frequency is a function of intensity” (Diefenbach, 2001, p.17), which is to say that the more often a word is used, the more important the word is to the meaning of the document. Agenda-setting research maintains that “media coverage and public recognition go hand-in-hand” (Scharl & Weichselbraun, 2008, p.124) and that there are “strong correlations between the attention of news media and both public salience and attitudes toward presidential candidates” (ibid). Consequently, the methodology of the University of Vienna’s U.S. Election 2004 Web Monitor uses the frequency of presidential candidates’ names occur as a measure of media attention: web-mining software “identified groups of identical units and counted their occurrences, thereby creating an inventory of words (“word list”) or multi-word units of meaning” (ibid).

The production of word frequency lists is a staple of Programming 101 texts, owing in part to examples contained in royalty-free textbooks like Cooper’s Advanced Bash?Scripting Guide (2007). The widely-quoted wf.sh and wf2.sh bash scripts originate in a challenge printed in John Bentley’s column in the Journal of the Association for Computing Machinery: “Given a text file and an integer K, you are to print the K most common words in the file (and the number of their occurrences) in decreasing frequency” (Bentley, May 1986, p.368). Doug McIlroy offered a six-line solution for Unix systems (Bentley, June 1986, p.471):

  1. Replace all spaces with line breaks and remove duplicate line breaks
  2. Transliterate upper case to lower case
  3. Sort alphabetically to bring identical words together
  4. Replace each run of duplicate words with a single representative and include a count
  5. Sort in reverse numeric order
  6. Print K lines

Catatrophic Word Frequencies: a Pilot Project

In late 2008, an informal study conducted at Simon Fraser University, using ninety-six (96) corpuses, each consisting of the aggregated Section A of the New York Times for one month and covering the eight (8) years spanning January 2001 to December 2008, produced longitudinal word frequencies of several pre-identified catastrophes. In preliminary stages, several ‘word list’-producing software applications were tested and compared. Only TextSTAT (Hüning, 2008) and McIlroy-derived wf.sh were capable of processing the corpus file sizes, which averaged seven megabytes each. Few applications were royalty-free and fewer still were open source. Modifications were made to the McIlroy code to address the failure to parse punctuation and non-printing ASCII characters, and a primitive dictionary was developed to identify known noun phrases (e.g. “World Trade Center”) and alternative symbolic constructions (e.g. “9/11”).

The study confirmed that catastrophes produce substantial spikes in frequency of related signs and keywords, that patterns of occurrence are consistent between natural and anthropogenic disasters, and that the explanatory power of an event is visible in the “latency” of the word frequency over a period of several years. The findings demonstrated the myth-creation process described by Barthes (Barthes, 1972) and suggested a relationship between U.S. political elections and modern myth-creation (Friedman, 2007). The pilot project demonstrated that misleading results can be produced if inadequate attention is paid to noun phrases or initial keyword selection or with insufficient examination of source data.

Concordance/Collocation

Because “simple frequency counts are limited by the issue of context” (Diefenbach, 2001, p.17) many software applications provide ‘key word in context’ (KWIC): the leading and following text surrounding the search term. ‘Concordance’ (Hüning, 2008) and ‘Collocation’ (Berry-Rogghe, 1973) are two machine-assisted approaches which can be understood by J.R. Firth’s contextual use: “‘One of the meanings of ass is its habitual collocation with an immediately preceding you silly…’” (Berry-Rogghe, 1972, p.103). Whereas concordance is concerned with identifying instances of co-existence of words (“When does x appear within 100 words of y?”), collocation is concerned with the “probability of the item x co-occurring with the a, b, c” (Berry-Rogghe, 1972, p. 103). Berry-Rogghe conducted a pilot study which compared frequency lists of words where various “spans” or running lengths of context were processed and Tollenaere, in a search for the optimum running length of context suitable for literary dictionary development, proposed a flexible typographical definition of context which varied between 120 and 360 characters on either side of the keyword, depending on sentence construction (Tollenaera, 1972, p.29).

Scoring Dictionaries

The second hemisphere of machine-assisted content analysis systems compares one text to another. The oldest and most developed techniques in this area use pre-coded dictionaries to score a corpus. Because researchers pre-define the meaning and value of terms, “dictionary construction for computer-assisted content analysis is theoretically motivated” and can provide “the vital link between the theoretical formulation of the problem and the mechanics of analysis” (Diefenbach, 2001, p.17). The open nature of The General Inquirer (Stone, 1966) application, which allows users to substitute the default dictionary for their own, inspired generations of dictionary-based projects and encouraged both the development of a large number of new dictionaries and the migration of existing projects, including the Lasswell Value Dictionary for political analysis, to the platform (Diefenbach, 2001, p.17). The General Inquirer application enjoys continued use (Lim 2008; Hall 2005) and the many category schemes and dictionaries developed on its open format have been incorporated in new projects. The “predefined list of sentiment words known to have positive or negative connotations,” used by the U.S. Election 2004 Web Monitor to score the ‘attitude’ of news stories, originates in the General Inquirer dictionary (Scharl, et al, 2008, p.125).

Probabilistic Modeling

A relatively new addition to machine-assisted content analysis systems– pioneered by Wordscores (Benoit, Laver & Lowe, n.d.) – involves probabilistic calculations to compare texts, most commonly partisan manifestos (Klemmensen, Hobolt & Hansen, 2007). In “Understanding Wordscores”, Lowe describes a process by which the content of a sample of known texts with defined positions or scores are used to estimate the scores for each word, which are in turn used to estimate the positions (or score) of unknown “virgin” texts (Lowe, 2008, p.357). Among the many criticisms of this method is that the scores for virgin texts must be “rescaled” to show significance (ibid), that common words may be scored in such a way as to skew results (Martin & Vanberg, 2008), and that variation in known source texts may frustrate comparison between studies (Martin & Vanberg, 2007).

Leave a Reply

Your email address will not be published. Required fields are marked *