Thursday, September 14, 2017

Text-Mining Project

This fall I am working with Pradeep Maddipatla, a graduate student in Computer Science at Northern Illinois University, on a text mining project involving my field of historical research - nineteenth century American economic and social policymaking, namely the protective tariff. Our project will use topic modeling to explore how American legislators discussed this policy, but we also hope to shed light on the broader question of how they characterized state involvement in the economy and society, in positive and negative terms.

This work uses a database of text materials drawn from the Congressional Record, 1876-1896, which was organized and made ready for text mining activities by Adam Frieberg, a graduate student in Geography at Northern Illinois University who is also employed full-time as a programmer/developer.

Pradeep Mattipatla is assisted in this work by Professor Hamed Alhoori of Northern Illinois University's Department of Computer Science.

We are working with the following proposal:

“Topic Modeling Tariff Debates in the United States Congress, 1876-1896”

Drew VandeCreek, Northern Illinois University Libraries
Adam Frieberg, Northern Illinois University Department of Geography
Pradeep Maddipatla, Northern Illinois University Department of Computer Science

This project will employ text-mining technology to explore the arguments that members of the United States Congress used to support and promote legislation setting tariffs in the period 1876-1896. Historians and political scientists have identified tariffs, which set a fee or tax to be paid on imported goods, as a significant political issue in the nineteenth-century United States.  One has called it “the most important economic policy of the nineteenth-century federal government” and, save slavery, the most consequential matter facing the American state in the nineteenth century overall.[1] Questions of tariff policy often captured Americans’ ambitions and anxieties about the nation’s future course of economic and political development. They also provided an opportunity to discuss about the federal government’s proper role in society. 

The United States Congress considered major tariff bills on many occasions in the nineteenth century, but the issue took a central place in American political discourse after the Civil War. The Union’s need for revenue (and Southern legislators’ absence from Congress) led Lincoln and congressional Republicans to make the high tariff law during the conflict. Postwar Republicans took an increasingly assertive protectionist stance, and successfully resisted Democrats’ corresponding attempts to reduce tariffs. In this context the policy became a virtual litmus test of party identification. Republicans repulsed reformers’ attempts to cut tariffs in the mid-1880s, and pushed still higher duties through Congress in 1890 and, after a modest setback in 1894, again in 1897.

Although the tariff played a prominent role in the late nineteenth century’s electoral politics, scholars have paid relatively scant attention to protectionists’ and their opponents’ arguments. Of those considering the matter, the political scientist Judith Goldstein has asserted that postwar tariff proponents relied on what scholars have called Free Labor appeals, which maintained that tariff-protected industrial workers’ high wages allowed them to save the money necessary to open their own businesses, thus achieving social mobility, or what Abraham Lincoln called the “right to rise.”[2] A leading intellectual historian has suggested that this argument became discredited and was abandoned in this period, however.[3] The political scientist John Gerring has emphasized Republicans’ other appeals to labor, as well as neo-mercantilism and statism, in defense of the policy, providing brief lists of words associated with each argument.[4] Scholars analyzing tariff reformers’ attacks on the policy have mentioned their description of it as a federal grant of special privilege to manufacturers at the expense of other members of the national community, especially in the postwar period’s context of industrial consolidation and increasingly public political corruption. Some nineteenth-century tariff critics also attacked the measure as undermining individual responsibility and encouraging workers to expect something for nothing.[5]  

These interpretations of tariff debates are built on a limited evidentiary base. The Congressional Record’s verbatim account of remarks on the floor of Congress begins in 1873. It consists of well over two million individual speeches or other utterances, totaling over 2.5 million sentences. Any scholar trained in the traditional analysis of political texts (i.e., reading them her or himself) would be hard-pressed to review, much less consider and evaluate, this mass of data in the period of time traditionally devoted to a dissertation or book project. In this light, scholars’ analyses of arguments and debates over the protective tariff have focused on assorted individual works of tariff boosters and opponents, including speeches in Congress and works of journalism, as well as early works of economics and the period’s broader discourse of social science. 

The Congressional Record is today available as a database of digital full-text materials, and scholars of literature and humanities computing programmer/developers have in recent years developed a methodology that can provide a new perspective on it. Using an approach that has proved useful in the analysis of a broad range of other very large data sets, they have turned computing power and algorithms to the examination of digital text collections, comprised of many thousands of titles, that have recently become available from a number of sources.  Where traditional practitioners devoted to the close reading of a limited number of selected texts have focused on specific, particular uses of language and shades of meaning to produce detailed, highly nuanced accounts and interpretations of the texts’ arguments, advocates of what Franco Moretti has called “distant reading” and Matthew Jockers “macroanalysis” seek to discover, visualize and explore quantifiable evidence of significant patterns within these much larger collections.[6] Jockers has emphasized that the analysis of literary work at scale allows researchers to move their studies beyond a focus on the very few works that critics and scholars have acclaimed as classic or otherwise outstanding examples of literary craft to include a larger cross-section of materials, “an aggregated ecosystem or `economy’ of texts.”[7] He goes on to conclude that computational work often supports what many perceive to be common knowledge about literary works, yet provides evidence for it, as opposed to casual observations.[8] He emphasizes the prospect of using close and distant reading together, exploring the relationships between specific expressions of belief or creativity and the larger context in which individual authors situate their arguments or stories.

Intellectual historians have long turned their attention to the close reading of specific texts, often focusing especially on individuals and works for which they can demonstrate subsequent influence. Political historians and political scientists have consistently studied beliefs and ideologies as important aspects of the history of electoral activity and governance, with an equal emphasis on tracing their genealogy and influence. The proposed project will use text-mining technology to build on these disciplines’ traditional practice in several ways.

The project will build on and use of a set of applications and scripts developed in R by Adam Frieberg, as follows. 

Congressional Record text materials prepared by ProQuest are stored in a relational database with an internal index system, built on Microsoft SQL Server Express with Advanced Services.  The R code is written in modules that have already structured much of the data.
Module: Ingester – R scripts have done pattern matching using regular expressions to recursively search the directory of files to find all .xml files in the ProQuest data source that match peer full text PDF files.  From what we can tell, the ProQuest XML files contain the full text of the PDFs that were generated via OCR (Optical Character Recognition).  The R code then built an index of the files by date and focused on the entire Congressional Record from 1876 to 1896.  These two decades were chosenbecause of the “full text”/verbatim nature of the printed Congressional Record at the time, as well as their being the zenith of tariff debates in the late nineteenth century.  The R code combed each speech and identified speakers as well as the content of their speeches.  This identification relied on the reliability of speeches always starting with the string: “Mr. “.  Candidates for speeches were then filtered to exclude the sections that began with procedural words (examples: “presented”, “introduced”, “submitted”, “a bill”, “petition”, “by unanimous”).  The separated speeches were stored in a database table called Speeches1876to1896 and indexed both by their date, the names of the speakers, as well as the full text of the speeches.  They were also run as a single-threaded process in order for their data storage to preserve and resemble their order within the Congressional Record.

Module: Sentiment Analyzer – R scripts produced a more granular resolution that separated every speech by sentence.   The sentences were split by using the standard period (“.”) character.  The sentences were quality controlled by filtering out abbreviations and other places with OCR errors.  The exclusionary rules included filtering out any sentences that began with numeric characters (H.R. 234 was the typical designation for “House Resolution 234”).  It also excluded sentences beginning with the standard Congressional Record headers (“CONGRESSIONAL RECORD – SENATE” and “Also, a bill”).  Sentences were then filtered to only the sentences longer than 10 characters in length.  This was a subjective way to ensure it would retain sentences such as “Mr. COGHLAN: I concur” but not include shorter utterances such as “Mr. Smith: Aye”.  The R script then used an external 3rd-party API (Microsoft’s Cognitive Services API) to generate sentiment analysis scores for every sentence surviving those filters in the 20 years of the Congressional Record.  Those sentences are stored in the SpeechFragments20Yr database table and the sentiment analysis scores are stored in the SpeechFragments20YrSentimentAnalysis table.

Module: Index Database Views - The combination of the three prior-mentioned database tables yields a corpus of text that is indexed by speaker, time, and sentiment.  Many of the over two million individual speeches reflected in the speech indexes are clearly portions of back-and-forth utterances. This module provides a way to diagnose these speeches. The views link individual fragments of speech with parent speech objects that are then identifiable by speaker. Records have ID fields to keep them in the sequence they appeared within the print version of the Congressional Record, moving forward in time. 

Module: Topic Modeler – Pradeep will investigate modern topic modeling approaches, including Mallet and Gensim. He will consult with Dr. VandeCreek and provide sample output. Together, they will select the approach to be used in the final analysis. The goals of this topic modeling are 1) inform Dr. VandeCreek’s navigation of the full corpus in further research; 2) identify prominent topics as they may correspond to existing historical and Political Science scholarship’s description of pro- and anti-tariff arguments in this period; 3) determine if the prominence of specific topics changes over time; 4) use visualization applications to illustrate these changes for an audience unfamiliar with data science. 

                Using the above techniques, the project will first address the challenge of identifying which of the available congressional text materials discussed tariff legislation, and whether each supported or opposed a tariff bill, by using basic word search functionality, text classification, sentiment analysis, and a freely available API providing information about members of Congress and their voting histories. A machine-generated review of the Congressional Record for the period under consideration has identified a specific set of speeches, inserted documents and other utterances including the word “tariff” and/or several synonymous or related terms, including “duty/duties,” “impost(s),” “levy,” and “excise,” as well as the words “protection” and “protective,” which scholarship in History and Political Science shows were widely used to describe the policy.  Project participants will next move to create two sets of documents: those supporting the tariff and those opposing it. In the first case, Dr. VandeCreek will assemble training sets of speeches and other documents known to express pro- and anti-tariff arguments, and then ask text mining software (which?) to compare the words and patterns of words in each to those found in a set of unclassified works. This will produce a result in which the software predicts the likelihood that each unclassified document argues for or against the tariff. In the second case, the use of Microsoft Azure’s sentiment analysis application will measure the degree to which speeches discussing the tariff express positive or negative sentiment, with the working hypothesis that pro-tariff speeches will express more positive sentiment and anti-tariff speeches more negative sentiment.  Project participants will check these results against each other and make use of the ProPublica Congress API ( to ascertain how the member of Congress responsible for a given speech, utterance or other text voted on the legislation that it addressed. Dr. VandeCreek will also make close readings of a number of randomly selected texts in the sets produced by the above means in order to determine if they have produced sufficiently accurate collections of pro- and anti-tariff text. 

                Having produced a set of pro- and anti-tariff documents, project staff members will next use the topic modeling software Mallet ( and/or Gensim ( to examine the sets of words that tariff proponents and opponents used to praise or condemn the policy in the period 1876-1896. Project staff members will identify individual pieces of tariff legislation that came to the floor of Congress for debate in this period, and separate those texts identified as discussing the tariff into sub-sets of materials specifically pertaining to each bill (for example, The Tariff of 1883, also known as the Mongrel Tariff due to its tepid reforms; the Mills Bill of 1888, which unsuccessfully proposed lower tariffs; and the McKinley Tariff of 1890, which produced dramatically increased tariffs). This will produce a division of materials reflecting the progress of tariff debates over time.  

Project participants will construct several topic models for pro- and anti-tariff speeches for each bill, and analyze if and, if appropriate, how members of Congress’ arguments for and against the policy changed over time. Using visualization software, they will present this data for review by historians and other interested parties who are likely to be unfamiliar with topic modeling or other text mining technologies. 

More specific research questions to be explored may include:

What topics most characterized pro- and anti-tariff arguments in the period 1876-1896?

Did these topics or arguments change over time?

Of the topics produced from a review of pro-tariff texts, do any reflect the influence of what Goldstein describes as the Free Labor appeal? If so, how many? Does their prominence change over time?

Of the topics produced from a review of pro-tariff texts, do any reflect the influence of what Gerring describes as the labor, neo-mercantilist and statist appeals? If so, how many? Does their prominence change over time?

Of the topics produced from a review of anti-tariff texts, do any include references to special privilege? To political corruption? To the undermining of individual responsibility and self-reliance? If so, how many? Does their prominence change over time?

These results will provide an opportunity to explore how postwar members of Congress discussed the prospect of a federal activity directing the course of economic and social change in the United States as it related to a policy that historians and political scientists have identified as among the century’s most significant. Project participants will present data addressing the above questions in a series of conference presentations, publications and/or reports to an audience of historians, political scientists and digital humanities scholars. They will use visualization software to present findings and illustrate interpretive discussion, especially in work directed toward the first two groups, members of which are likely to be unfamiliar with topic modeling or other text mining technologies. 

[1] J. J. Pincus “Tariffs” Encyclopedia of American Economic History (New York: Charles Scribner’s Sons, 1980) 439; “Tariff Policies” Encyclopedia of American Political History (New York: Charles Scribner’s Sons, 1984) 1259. Other works emphasizing the tariff’s importance in nineteenth-century American politics include Charles and Mary Beard The Rise of American Civilization (New York: Macmillan) 1927; H. Wayne Morgan From Hayes to McKinley (Syracuse: Syracuse University Press, 1969); Lewis Gould “The Republican Search for a National Majority” in The Gilded Age: A Reappraisal H. Wayne Morgan, ed., (Syracuse: Syracuse University Press, 1970); Morton Keller Regulating a New Economy: Public Policy and Economic Change in America, 1900-1933 (Cambridge: Harvard University Press, 1990); Richard F. Bensel Yankee Leviathan: The Origins of Central State Authority in America, 1859-1877 (New York: Cambridge University Press, 1990) and The Political Economy of American Industrialization, 1877-1900 (New York: Cambridge University Press, 2000); Judith Goldstein Ideas, Interests and American Trade Policy (Ithaca: Cornell University Press, 1993); Joanne Reitano The Tariff Question: The Great Debate of 1888 (University Park, PA: Penn State University Press, 1994); John Gerring “Party Ideology in America: The National Republican Chapter, 1828-1924” Studies in American Political Development, 11 (Spring, 1997) 44-108; Rebecca Edwards Angels in the Machinery: Gender in American Party Politics from the Civil War to the Progressive Era (New York: Oxford University Press, 1997); Morton Keller “Trade Policy in Historical Perspective” in Taking Stock: American Government in the Twentieth Century, Morton Keller and R. Shep Melnick, eds. (New York: Cambridge University Press, 1999); Charles W. Calhoun “James G. Blaine and the Republican Party Vision” in The Human Tradition in the Gilded Age and Progressive Era, Ballard Campbell, ed., (Wilmington, DE: SR Books, 2000).

[2] Authors emphasizing the Free Labor argument for the tariff include Goldstein Ideas, Interests and American Trade Policy; George B. Mangold, “The Labor Argument in the American Protective Tariff Discussion.” Bulletin of the University of Wisconsin, no. 246 (1906): passim; Frank Taussig, The Tariff History of the United States, 8th ed. (New York, 1931), 65–6; Eric Foner, Free Soil, Free Labor, Free Men: The Ideology of the Republican Party Before the Civil War (New York, 1970), 20–1; Dorothy Ross, Origins of American Social Science (New York, 1990), 47–8; Michael Holt, The Rise and Fall of the American Whig Party (New York, 1999), 69–70, 952 (quotation at 69); Gabor Borritt, Lincoln and the Economics of the American Dream (Memphis, 1978), 99, 113, 139. 

[3]  Dorothy Ross states that Free Labor ideology quickly faded from use after the Civil War in Origins of American Social Science, 48.  

[4] Gerring, "Party Ideology in America: The National Republican Chapter"

[5] Keller “Trade Policy in Historical Perspective” 19. 

[6] Matthew Jockers Macroanalysis: Digital Methods and Literary History (Urbana: University of Illinois Press, 2013) 20; Franco Moretti Distant Reading (London: Verso, 2013).

[7] Jockers Macroanalysis, 32.

[8] Jockers Macroanalysis, 30.

No comments:

Post a Comment