
{"id":1715,"date":"2009-04-22T12:01:38","date_gmt":"2009-04-22T12:01:38","guid":{"rendered":"http:\/\/opendna.com\/blog\/?p=3"},"modified":"2022-11-08T12:10:16","modified_gmt":"2022-11-08T12:10:16","slug":"an-iterative-process-for-dictionary-construction","status":"publish","type":"post","link":"https:\/\/opendna.com\/blog\/2009\/04\/22\/an-iterative-process-for-dictionary-construction\/","title":{"rendered":"An Iterative Process for Dictionary Construction"},"content":{"rendered":"<p>This paper draws on previously identified irregular keyword frequency typologies of catastrophic disasters and early machine-assisted lexicographic methods to synthesize a technique for assisting in the analysis of a massive \u2013 and potentially expanding \u2013 corpus of text. It is hoped that Key Word In Context (KWIC) and collocation queries, first developed for quantitative text analysis, can be repurposed as steps in an iterative cycle and used to assist qualitative researchers in pattern-recognition and subject-specific dictionary construction.<\/p>\n<p>The proliferation of consumer-grade computing equipment and digital information networks has resulted in an explosion of instantaneously available textual data, the flood of which has undermined received knowledge of opinion-formation in modern mass society, including broadcast-era models of mass persuasion, framing and agenda setting. The relative decline in influence of the American Big Three TV networks has led some researchers to suggest the arrival of a \u201cNew Era of Minimal Effects\u201d (Bennett &amp; Iyengar, 2008) in which swarms of micro-media cumulatively exert more influence than the media behemoths of the mid-20th Century. The Information Age \u2013 an era in which the daily production of audio, visual and textual data exceeds that which can be consumed in a rapidly increasing number of lifetimes \u2013 offers media researchers an effectively infinite body of research material while overwhelming most traditional, labour-intensive, content analysis systems.<\/p>\n<h3><a name=\"ipdc-1\"><\/a>A Brief History of Machine-Assisted Content Analysis<\/h3>\n<p>The practice of content analysis has gone through three distinct methodological eras, each enabled by technological developments \u2013 the pre-machine era, the mainframe era and the distributed\/desktop era. While the technical capabilities of each era are remarkably different, the theoretical questions remain substantively unchanged. Despite decades of research into automatic translation and artificial intelligence, the challenges of machine-assisted textual analysis today are essentially the same as those faced by researchers in the 1960s and 1970s: the application of automatic processes to the analysis of natural language texts saves human labour while sacrificing comprehension. Because computers still cannot reliably understand unknown phrases, differentiate homographs or parse metaphors (Diefenbach, p.15), it is safe to say that machines which understand language remain the unrealized promise of computational linguistics.<\/p>\n<p>In the pre-machine era researchers used teams of assistants to measure column inches and count word frequency in newspaper clippings. In the mainframe era, content analysis was assisted by machines which processed large volumes of text rapidly, but was hampered by the high costs of processing power, data input and storage: Optical character recognition (OCR) equipment cost upwards of $1.3 million (Diefenbach, p. 19), rented time on mainframe computers cost around $75 an hour (Stewart, n.d.), and two \u201ctechnological feasibility studies of computer analysis of media content\u201d found that \u201ca large-scale project would cost $3 million per billion words\u201d (Diefenbach, p. 19).<\/p>\n<p>By the early 21st century, Moore\u2019s law had driven the cost of computer processing down low enough that consumer-grade desktop computers out-performed earlier mainframe super-computers and, while retail OCR software cost less than most textbooks, online article databases (e.g. LexisNexis and Canadian Newsstand) and internet news sources effectively eliminated the need to input print media. At present, the proliferation of UNIX-clone operating systems (e.g. Linux, FreeBSD and MacOS X) has made it possible for virtually anyone to program equivalents to mainframe era applications and conduct multi-billion word content analysis projects on consumer grade computers.<\/p>\n<p>While Diefenbach identifies \u201ctwo types of computer-assisted content analysis systems\u201d (Diefenbach, 2001, p.17) \u2013 those providing word counts and those using pre-set dictionaries \u2013 we can usefully subdivide further into those systems which (1) count word frequencies or (2) identify collocation or concurrence of words, and those which (3) score texts with pre-coded dictionaries or (4) compare one text with another. A fifth approach, presently enjoying a renaissance in the field of computer science, involves translating Structuralist theories of linguistics into software and using those models to parse grammar and identify the relationships between words in a text. While the first four approaches are well explored and their limits are generally recognized, the complexity of the fifth approach and the controversial nature of the underlying theories of meaning are sufficient to exclude it from the present project.<br \/>\n<!--nextpage--><\/p>\n<h3><a name=\"ipdc-2\"><\/a>Word Frequency<\/h3>\n<p>\u201cThe use of frequencies to make inferences from the text is based in part on the assumption that frequency is a function of intensity\u201d (Diefenbach, 2001, p.17), which is to say that the more often a word is used, the more important the word is to the meaning of the document. Agenda-setting research maintains that \u201cmedia coverage and public recognition go hand-in-hand\u201d (Scharl &amp; Weichselbraun, 2008, p.124) and that there are \u201cstrong correlations between the attention of news media and both public salience and attitudes toward presidential candidates\u201d (ibid). Consequently, the methodology of the University of Vienna\u2019s U.S. Election 2004 Web Monitor uses the frequency of presidential candidates\u2019 names occur as a measure of media attention: web-mining software \u201cidentified groups of identical units and counted their occurrences, thereby creating an inventory of words (&#8220;word list&#8221;) or multi-word units of meaning\u201d (ibid).<\/p>\n<p>The production of word frequency lists is a staple of Programming 101 texts, owing in part to examples contained in royalty-free textbooks like Cooper\u2019s Advanced Bash?Scripting Guide (2007). The widely-quoted wf.sh and wf2.sh bash scripts originate in a challenge printed in John Bentley\u2019s column in the Journal of the Association for Computing Machinery: \u201cGiven a text file and an integer K, you are to print the K most common words in the file (and the number of their occurrences) in decreasing frequency\u201d (Bentley, May 1986, p.368). Doug McIlroy offered a six-line solution for Unix systems (Bentley, June 1986, p.471):<\/p>\n<ol>\n<li>Replace all spaces with line breaks and remove duplicate line breaks<\/li>\n<li>Transliterate upper case to lower case<\/li>\n<li>Sort alphabetically to bring identical words together<\/li>\n<li>Replace each run of duplicate words with a single representative and include a count<\/li>\n<li>Sort in reverse numeric order<\/li>\n<li>Print K lines<\/li>\n<\/ol>\n<h3><a name=\"ipdc-3\"><\/a>Catatrophic Word Frequencies: a Pilot Project<\/h3>\n<p>In late 2008, an informal study conducted at Simon Fraser University, using ninety-six (96) corpuses, each consisting of the aggregated Section A of the New York Times for one month and covering the eight (8) years spanning January 2001 to December 2008, produced longitudinal word frequencies of several pre-identified catastrophes. In preliminary stages, several \u2018word list\u2019-producing software applications were tested and compared. Only TextSTAT (H\u00fcning, 2008) and McIlroy-derived wf.sh were capable of processing the corpus file sizes, which averaged seven megabytes each. Few applications were royalty-free and fewer still were open source. Modifications were made to the McIlroy code to address the failure to parse punctuation and non-printing ASCII characters, and a primitive dictionary was developed to identify known noun phrases (e.g. \u201cWorld Trade Center\u201d) and alternative symbolic constructions (e.g. \u201c9\/11\u201d).<\/p>\n<p>The study confirmed that catastrophes produce substantial spikes in frequency of related signs and keywords, that patterns of occurrence are consistent between natural and anthropogenic disasters, and that the explanatory power of an event is visible in the \u201clatency\u201d of the word frequency over a period of several years. The findings demonstrated the myth-creation process described by Barthes (Barthes, 1972) and suggested a relationship between U.S. political elections and modern myth-creation (Friedman, 2007). The pilot project demonstrated that misleading results can be produced if inadequate attention is paid to noun phrases or initial keyword selection or with insufficient examination of source data.<\/p>\n<h3><a name=\"ipdc-4\"><\/a>Concordance\/Collocation<\/h3>\n<p>Because \u201csimple frequency counts are limited by the issue of context\u201d (Diefenbach, 2001, p.17) many software applications provide \u2018key word in context\u2019 (KWIC): the leading and following text surrounding the search term. \u2018Concordance\u2019 (H\u00fcning, 2008) and \u2018Collocation\u2019 (Berry-Rogghe, 1973) are two machine-assisted approaches which can be understood by J.R. Firth\u2019s contextual use: \u201c\u2018One of the meanings of ass is its habitual collocation with an immediately preceding you silly&#8230;\u2019\u201d (Berry-Rogghe, 1972, p.103). Whereas concordance is concerned with identifying instances of co-existence of words (\u201cWhen does x appear within 100 words of y?\u201d), collocation is concerned with the \u201cprobability of the item x co-occurring with the a, b, c\u201d (Berry-Rogghe, 1972, p. 103). Berry-Rogghe conducted a pilot study which compared frequency lists of words where various \u201cspans\u201d or running lengths of context were processed and Tollenaere, in a search for the optimum running length of context suitable for literary dictionary development, proposed a flexible typographical definition of context which varied between 120 and 360 characters on either side of the keyword, depending on sentence construction (Tollenaera, 1972, p.29).<\/p>\n<h3><a name=\"ipdc-5\"><\/a>Scoring Dictionaries<\/h3>\n<p>The second hemisphere of machine-assisted content analysis systems compares one text to another. The oldest and most developed techniques in this area use pre-coded dictionaries to score a corpus. Because researchers pre-define the meaning and value of terms, \u201cdictionary construction for computer-assisted content analysis is theoretically motivated\u201d and can provide \u201cthe vital link between the theoretical formulation of the problem and the mechanics of analysis\u201d (Diefenbach, 2001, p.17). The open nature of The General Inquirer (Stone, 1966) application, which allows users to substitute the default dictionary for their own, inspired generations of dictionary-based projects and encouraged both the development of a large number of new dictionaries and the migration of existing projects, including the Lasswell Value Dictionary for political analysis, to the platform (Diefenbach, 2001, p.17). The General Inquirer application enjoys continued use (Lim 2008; Hall 2005) and the many category schemes and dictionaries developed on its open format have been incorporated in new projects. The \u201cpredefined list of sentiment words known to have positive or negative connotations,\u201d used by the U.S. Election 2004 Web Monitor to score the \u2018attitude\u2019 of news stories, originates in the General Inquirer dictionary (Scharl, et al, 2008, p.125).<\/p>\n<h3><a name=\"ipdc-6\"><\/a>Probabilistic Modeling<\/h3>\n<p>A relatively new addition to machine-assisted content analysis systems\u2013 pioneered by Wordscores (Benoit, Laver &amp; Lowe, n.d.) \u2013 involves probabilistic calculations to compare texts, most commonly partisan manifestos (Klemmensen, Hobolt &amp; Hansen, 2007). In \u201cUnderstanding Wordscores\u201d, Lowe describes a process by which the content of a sample of known texts with defined positions or scores are used to estimate the scores for each word, which are in turn used to estimate the positions (or score) of unknown \u201cvirgin\u201d texts (Lowe, 2008, p.357). Among the many criticisms of this method is that the scores for virgin texts must be \u201crescaled\u201d to show significance (ibid), that common words may be scored in such a way as to skew results (Martin &amp; Vanberg, 2008), and that variation in known source texts may frustrate comparison between studies (Martin &amp; Vanberg, 2007).<br \/>\n<!--nextpage--><\/p>\n<h3><a name=\"ipdc-7\"><\/a>The Mechanics of Iterative Dictionary Discovery<\/h3>\n<p>When using automated techniques to assist in large-scale content analysis, particularly when engaging with dynamic corpuses, the researcher would be well served by tools which reported changes and identified anomalies in the corpus. While the four systems described above can be (and are) used independently in content analysis studies, a researcher more concerned with qualitative phenomena than computational comparison can apply them collectively and obtain a more comprehensive understanding how the texts change over time. This paper proposes an automatic system for dictionary discovery, and having reviewed the four component systems of word frequency, concordance\/collocation, scoring dictionaries and probabilistic modelling, we can now turn to their integration in an iterative process of content analysis.<\/p>\n<p>In a large, English language corpus, the words used most often exhibit a power-law distribution of frequency. Jonathan Harris\u2019s Wordcount.org project (2003) visually relays this phenomenon by graphing the frequencies of the 86,800 most common words in the British National Corpus (Harris, 2003), and confirms the often repeated observation that the most common words are those which contribute the least specific meaning in context. The top thirty ranked words are of grammatical, not semantic role: the, of, and, to, a, in, that, is, was, I, for, on, you, he, be, with, as, by, at, have, are, this, not, but, had, his, they, from, she.<\/p>\n<p>If we applied the Pareto Principle (or \u201c80-20 Rule\u201d) we would suggest that 80% of the word count in a corpus will come from the 20% most frequently used words and, if the Wordscores system can be abstracted, that the remaining 80% of words are more likely to meaningfully distinguish between texts. If we were to produce word frequency lists from snap-shots of a very large corpus (i.e. from Section A of the New York Times for each month over the course of eight years) we would expect the top-20% of words on the monthly rank-ordered lists to appear static from month to month while high-impact news stories cause substantial movement in the ranking of relevant key words. Those volatile, low-frequency terms may be used as subject tags for texts or, as suggested by the Catastrophic Frequency pilot project, they may indicate historical terms with symbolic value (e.g. \u201c9\/11\u201d or \u201cNuremberg\u201d) that should be incorporated into scoring dictionaries.<\/p>\n<p>If we were then to investigate a specific topic within the full corpus (i.e. \u201cnetwork neutrality\u201d in the New York Times) we would probably start by limiting our corpus through the use of keywords. Our corpus might express researcher bias by excluding relevant texts simply because they failed to contain our initial keywords. We might over-come this risk by comparing word lists of the limited corpus to that of the general corpus in order to identify all keywords which show a relatively high frequency within our sample, and so discover similes and political euphemisms (i.e. \u201ctraffic management\u201d).<\/p>\n<p>The examples in the previous paragraph use noun phrases (\u201cnetwork neutrality\u201d and \u201ctraffic management\u201d) which pose equal challenge to machine-assisted content analysis systems as proper nouns (\u201cNew York Times\u201d and \u201cWorld Trade Center\u201d):<\/p>\n<blockquote><p>Only identifying occurrences of george w. bush, for example, would ignore equally valid references to president bush and george walker bush. Yet, a general query for bush would fail to distinguish the president&#8217;s last name from references to wilderness areas or woody perennial plants. (Scharl &amp; Weichselbaun, 2008)<\/p><\/blockquote>\n<p>The methodology described for the U.S. Election 2004 Web Monitor addresses this issue by listing the alternative equivalents of candidates\u2019 names, but results reveal the weakness of not expanding in the method to other noun phrases: the top ten keywords for coverage of President George W. Bush include both \u201ciraq\u201d and \u201cwar\u201d as distinct terms, but not the phrase \u201cIraq War\u201d (Scharl, Weichselbraun, &amp; Bauer, 2004). If the principles of collocation are valid, then this deficiency could be addressed by running word frequencies on small KWIC samples; if the probability that \u201cIraq\u201d will precede \u201cWar\u201d is greater than 25% then the researcher may want to investigate whether \u201cIraq War\u201d is a third term which merits a separate entry in dictionary. If repeated for multiple iterations or with varying KWIC samples, clusters of concordance could identify larger noun phrases (i.e. \u201cweapons of mass destruction\u201d or \u201cenhanced interrogation technique\u201d), wordplay or a popular metaphor. Deviations from normal concordance may indicate the development of a similar but unrelated subject, as would be expected if Wordscores for Hurricane Gustav (2008 in Texas) were compared with those of Hurricane Katrina (2005 in Louisiana), and suggest a possible application of Bayesian statistics (Schrodt, 2006) to dictionary development.<\/p>\n<h3><a name=\"ipdc-8\"><\/a>Conclusion<\/h3>\n<p>While machine-assisted content analysis generally relies on one of four systems, the procedures described in this paper attempt to integrate frequency, co-occurrence, pre-coded dictionaries, and comparisons between texts into a single flexible package for qualitative measurement of extremely large and dynamic corpuses. The iterative process begins with the creation of a massive (and potentially growing) date-stamped corpus, from which frequency-ranked word lists can be extracted and compared; the full vocabulary of the corpus constitutes the un-scored dictionary of known words. High-frequency\/low-volatility words are classified as value-less and highly volatile words are classified as such in the dictionary. The corpus is sub-divided by volatile keyword and the identification of high- and low- volatility words is repeated. High-volatility keywords may identify changes in the subject\u2019s treatment over time and highly concordant words, along with irregular punctuation (i.e. capital letters found mid-sentence), may indicate the presence of noun phrases. In this fashion, a custom word list complete with Berry-Rogghe-approved context samples suitable for human-scoring can be produced very rapidly, and new, current event-related additions can be immediately brought to the attention of the researcher. If initial scoring is left to existing General Inquirer-inspired dictionaries, the human labour required is reduced to coding noun phrases, coinages, wordplays and metaphors which are machine-identified and reported as unknown but influential to the texts.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Drawing inspiration from General Inquirer (1966) and KWIC, this post proposes an iterative hybrid of available methods in a quest for a more flexible and robust machine-assisted content analysis system.<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"webmentions_disabled_pings":false,"webmentions_disabled":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":""},"categories":[210],"tags":[101,43,42,44,45,23,102],"class_list":["post-1715","post","type-post","status-publish","format-standard","hentry","category-essays","tag-iterative-dictionary-construction","tag-programming","tag-scripting","tag-software","tag-unix","tag-usa","tag-word-frequency"],"_links":{"self":[{"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/posts\/1715","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/comments?post=1715"}],"version-history":[{"count":1,"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/posts\/1715\/revisions"}],"predecessor-version":[{"id":1841,"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/posts\/1715\/revisions\/1841"}],"wp:attachment":[{"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/media?parent=1715"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/categories?post=1715"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/opendna.com\/blog\/wp-json\/wp\/v2\/tags?post=1715"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}