An Iterative Process for Dictionary Construction

Drawing inspiration from General Inquirer (1966) and KWIC, this post proposes an iterative hybrid of available methods in a quest for a more flexible and robust machine-assisted content analysis system.

This paper draws on previously identified irregular keyword frequency typologies of catastrophic disasters and early machine-assisted lexicographic methods to synthesize a technique for assisting in the analysis of a massive – and potentially expanding – corpus of text. It is hoped that Key Word In Context (KWIC) and collocation queries, first developed for quantitative text analysis, can be repurposed as steps in an iterative cycle and used to assist qualitative researchers in pattern-recognition and subject-specific dictionary construction.

The proliferation of consumer-grade computing equipment and digital information networks has resulted in an explosion of instantaneously available textual data, the flood of which has undermined received knowledge of opinion-formation in modern mass society, including broadcast-era models of mass persuasion, framing and agenda setting. The relative decline in influence of the American Big Three TV networks has led some researchers to suggest the arrival of a “New Era of Minimal Effects” (Bennett & Iyengar, 2008) in which swarms of micro-media cumulatively exert more influence than the media behemoths of the mid-20th Century. The Information Age – an era in which the daily production of audio, visual and textual data exceeds that which can be consumed in a rapidly increasing number of lifetimes – offers media researchers an effectively infinite body of research material while overwhelming most traditional, labour-intensive, content analysis systems.

A Brief History of Machine-Assisted Content Analysis

The practice of content analysis has gone through three distinct methodological eras, each enabled by technological developments – the pre-machine era, the mainframe era and the distributed/desktop era. While the technical capabilities of each era are remarkably different, the theoretical questions remain substantively unchanged. Despite decades of research into automatic translation and artificial intelligence, the challenges of machine-assisted textual analysis today are essentially the same as those faced by researchers in the 1960s and 1970s: the application of automatic processes to the analysis of natural language texts saves human labour while sacrificing comprehension. Because computers still cannot reliably understand unknown phrases, differentiate homographs or parse metaphors (Diefenbach, p.15), it is safe to say that machines which understand language remain the unrealized promise of computational linguistics.

In the pre-machine era researchers used teams of assistants to measure column inches and count word frequency in newspaper clippings. In the mainframe era, content analysis was assisted by machines which processed large volumes of text rapidly, but was hampered by the high costs of processing power, data input and storage: Optical character recognition (OCR) equipment cost upwards of $1.3 million (Diefenbach, p. 19), rented time on mainframe computers cost around $75 an hour (Stewart, n.d.), and two “technological feasibility studies of computer analysis of media content” found that “a large-scale project would cost $3 million per billion words” (Diefenbach, p. 19).

By the early 21st century, Moore’s law had driven the cost of computer processing down low enough that consumer-grade desktop computers out-performed earlier mainframe super-computers and, while retail OCR software cost less than most textbooks, online article databases (e.g. LexisNexis and Canadian Newsstand) and internet news sources effectively eliminated the need to input print media. At present, the proliferation of UNIX-clone operating systems (e.g. Linux, FreeBSD and MacOS X) has made it possible for virtually anyone to program equivalents to mainframe era applications and conduct multi-billion word content analysis projects on consumer grade computers.

While Diefenbach identifies “two types of computer-assisted content analysis systems” (Diefenbach, 2001, p.17) – those providing word counts and those using pre-set dictionaries – we can usefully subdivide further into those systems which (1) count word frequencies or (2) identify collocation or concurrence of words, and those which (3) score texts with pre-coded dictionaries or (4) compare one text with another. A fifth approach, presently enjoying a renaissance in the field of computer science, involves translating Structuralist theories of linguistics into software and using those models to parse grammar and identify the relationships between words in a text. While the first four approaches are well explored and their limits are generally recognized, the complexity of the fifth approach and the controversial nature of the underlying theories of meaning are sufficient to exclude it from the present project.

Leave a Reply

Your email address will not be published. Required fields are marked *