An Iterative Process for Dictionary Construction

Drawing inspiration from General Inquirer (1966) and KWIC, this post proposes an iterative hybrid of available methods in a quest for a more flexible and robust machine-assisted content analysis system.

The Mechanics of Iterative Dictionary Discovery

When using automated techniques to assist in large-scale content analysis, particularly when engaging with dynamic corpuses, the researcher would be well served by tools which reported changes and identified anomalies in the corpus. While the four systems described above can be (and are) used independently in content analysis studies, a researcher more concerned with qualitative phenomena than computational comparison can apply them collectively and obtain a more comprehensive understanding how the texts change over time. This paper proposes an automatic system for dictionary discovery, and having reviewed the four component systems of word frequency, concordance/collocation, scoring dictionaries and probabilistic modelling, we can now turn to their integration in an iterative process of content analysis.

In a large, English language corpus, the words used most often exhibit a power-law distribution of frequency. Jonathan Harris’s Wordcount.org project (2003) visually relays this phenomenon by graphing the frequencies of the 86,800 most common words in the British National Corpus (Harris, 2003), and confirms the often repeated observation that the most common words are those which contribute the least specific meaning in context. The top thirty ranked words are of grammatical, not semantic role: the, of, and, to, a, in, that, is, was, I, for, on, you, he, be, with, as, by, at, have, are, this, not, but, had, his, they, from, she.

If we applied the Pareto Principle (or “80-20 Rule”) we would suggest that 80% of the word count in a corpus will come from the 20% most frequently used words and, if the Wordscores system can be abstracted, that the remaining 80% of words are more likely to meaningfully distinguish between texts. If we were to produce word frequency lists from snap-shots of a very large corpus (i.e. from Section A of the New York Times for each month over the course of eight years) we would expect the top-20% of words on the monthly rank-ordered lists to appear static from month to month while high-impact news stories cause substantial movement in the ranking of relevant key words. Those volatile, low-frequency terms may be used as subject tags for texts or, as suggested by the Catastrophic Frequency pilot project, they may indicate historical terms with symbolic value (e.g. “9/11” or “Nuremberg”) that should be incorporated into scoring dictionaries.

If we were then to investigate a specific topic within the full corpus (i.e. “network neutrality” in the New York Times) we would probably start by limiting our corpus through the use of keywords. Our corpus might express researcher bias by excluding relevant texts simply because they failed to contain our initial keywords. We might over-come this risk by comparing word lists of the limited corpus to that of the general corpus in order to identify all keywords which show a relatively high frequency within our sample, and so discover similes and political euphemisms (i.e. “traffic management”).

The examples in the previous paragraph use noun phrases (“network neutrality” and “traffic management”) which pose equal challenge to machine-assisted content analysis systems as proper nouns (“New York Times” and “World Trade Center”):

Only identifying occurrences of george w. bush, for example, would ignore equally valid references to president bush and george walker bush. Yet, a general query for bush would fail to distinguish the president’s last name from references to wilderness areas or woody perennial plants. (Scharl & Weichselbaun, 2008)

The methodology described for the U.S. Election 2004 Web Monitor addresses this issue by listing the alternative equivalents of candidates’ names, but results reveal the weakness of not expanding in the method to other noun phrases: the top ten keywords for coverage of President George W. Bush include both “iraq” and “war” as distinct terms, but not the phrase “Iraq War” (Scharl, Weichselbraun, & Bauer, 2004). If the principles of collocation are valid, then this deficiency could be addressed by running word frequencies on small KWIC samples; if the probability that “Iraq” will precede “War” is greater than 25% then the researcher may want to investigate whether “Iraq War” is a third term which merits a separate entry in dictionary. If repeated for multiple iterations or with varying KWIC samples, clusters of concordance could identify larger noun phrases (i.e. “weapons of mass destruction” or “enhanced interrogation technique”), wordplay or a popular metaphor. Deviations from normal concordance may indicate the development of a similar but unrelated subject, as would be expected if Wordscores for Hurricane Gustav (2008 in Texas) were compared with those of Hurricane Katrina (2005 in Louisiana), and suggest a possible application of Bayesian statistics (Schrodt, 2006) to dictionary development.

Conclusion

While machine-assisted content analysis generally relies on one of four systems, the procedures described in this paper attempt to integrate frequency, co-occurrence, pre-coded dictionaries, and comparisons between texts into a single flexible package for qualitative measurement of extremely large and dynamic corpuses. The iterative process begins with the creation of a massive (and potentially growing) date-stamped corpus, from which frequency-ranked word lists can be extracted and compared; the full vocabulary of the corpus constitutes the un-scored dictionary of known words. High-frequency/low-volatility words are classified as value-less and highly volatile words are classified as such in the dictionary. The corpus is sub-divided by volatile keyword and the identification of high- and low- volatility words is repeated. High-volatility keywords may identify changes in the subject’s treatment over time and highly concordant words, along with irregular punctuation (i.e. capital letters found mid-sentence), may indicate the presence of noun phrases. In this fashion, a custom word list complete with Berry-Rogghe-approved context samples suitable for human-scoring can be produced very rapidly, and new, current event-related additions can be immediately brought to the attention of the researcher. If initial scoring is left to existing General Inquirer-inspired dictionaries, the human labour required is reduced to coding noun phrases, coinages, wordplays and metaphors which are machine-identified and reported as unknown but influential to the texts.

Leave a Reply

Your email address will not be published. Required fields are marked *