Thursday, September 14, 2017

Text-Mining Project

This fall I am working with Pradeep Maddipatla, a graduate student in Computer Science at Northern Illinois University, on a text mining project involving my field of historical research - nineteenth century American economic and social policymaking, namely the protective tariff. Our project will use topic modeling to explore how American legislators discussed this policy, but we also hope to shed light on the broader question of how they characterized state involvement in the economy and society, in positive and negative terms.

This work uses a database of text materials drawn from the Congressional Record, 1876-1896, which was organized and made ready for text mining activities by Adam Frieberg, a graduate student in Geography at Northern Illinois University who is also employed full-time as a programmer/developer.

Pradeep Mattipatla is assisted in this work by Professor Hamed Alhoori of Northern Illinois University's Department of Computer Science.

We are working with the following proposal:

“Topic Modeling Tariff Debates in the United States Congress, 1876-1896”

Drew VandeCreek, Northern Illinois University Libraries
Adam Frieberg, Northern Illinois University Department of Geography
Pradeep Maddipatla, Northern Illinois University Department of Computer Science

This project will employ text-mining technology to explore the arguments that members of the United States Congress used to support and promote legislation setting tariffs in the period 1876-1896. Historians and political scientists have identified tariffs, which set a fee or tax to be paid on imported goods, as a significant political issue in the nineteenth-century United States.  One has called it “the most important economic policy of the nineteenth-century federal government” and, save slavery, the most consequential matter facing the American state in the nineteenth century overall.[1] Questions of tariff policy often captured Americans’ ambitions and anxieties about the nation’s future course of economic and political development. They also provided an opportunity to discuss about the federal government’s proper role in society. 

The United States Congress considered major tariff bills on many occasions in the nineteenth century, but the issue took a central place in American political discourse after the Civil War. The Union’s need for revenue (and Southern legislators’ absence from Congress) led Lincoln and congressional Republicans to make the high tariff law during the conflict. Postwar Republicans took an increasingly assertive protectionist stance, and successfully resisted Democrats’ corresponding attempts to reduce tariffs. In this context the policy became a virtual litmus test of party identification. Republicans repulsed reformers’ attempts to cut tariffs in the mid-1880s, and pushed still higher duties through Congress in 1890 and, after a modest setback in 1894, again in 1897.

Although the tariff played a prominent role in the late nineteenth century’s electoral politics, scholars have paid relatively scant attention to protectionists’ and their opponents’ arguments. Of those considering the matter, the political scientist Judith Goldstein has asserted that postwar tariff proponents relied on what scholars have called Free Labor appeals, which maintained that tariff-protected industrial workers’ high wages allowed them to save the money necessary to open their own businesses, thus achieving social mobility, or what Abraham Lincoln called the “right to rise.”[2] A leading intellectual historian has suggested that this argument became discredited and was abandoned in this period, however.[3] The political scientist John Gerring has emphasized Republicans’ other appeals to labor, as well as neo-mercantilism and statism, in defense of the policy, providing brief lists of words associated with each argument.[4] Scholars analyzing tariff reformers’ attacks on the policy have mentioned their description of it as a federal grant of special privilege to manufacturers at the expense of other members of the national community, especially in the postwar period’s context of industrial consolidation and increasingly public political corruption. Some nineteenth-century tariff critics also attacked the measure as undermining individual responsibility and encouraging workers to expect something for nothing.[5]  

These interpretations of tariff debates are built on a limited evidentiary base. The Congressional Record’s verbatim account of remarks on the floor of Congress begins in 1873. It consists of well over two million individual speeches or other utterances, totaling over 2.5 million sentences. Any scholar trained in the traditional analysis of political texts (i.e., reading them her or himself) would be hard-pressed to review, much less consider and evaluate, this mass of data in the period of time traditionally devoted to a dissertation or book project. In this light, scholars’ analyses of arguments and debates over the protective tariff have focused on assorted individual works of tariff boosters and opponents, including speeches in Congress and works of journalism, as well as early works of economics and the period’s broader discourse of social science. 

The Congressional Record is today available as a database of digital full-text materials, and scholars of literature and humanities computing programmer/developers have in recent years developed a methodology that can provide a new perspective on it. Using an approach that has proved useful in the analysis of a broad range of other very large data sets, they have turned computing power and algorithms to the examination of digital text collections, comprised of many thousands of titles, that have recently become available from a number of sources.  Where traditional practitioners devoted to the close reading of a limited number of selected texts have focused on specific, particular uses of language and shades of meaning to produce detailed, highly nuanced accounts and interpretations of the texts’ arguments, advocates of what Franco Moretti has called “distant reading” and Matthew Jockers “macroanalysis” seek to discover, visualize and explore quantifiable evidence of significant patterns within these much larger collections.[6] Jockers has emphasized that the analysis of literary work at scale allows researchers to move their studies beyond a focus on the very few works that critics and scholars have acclaimed as classic or otherwise outstanding examples of literary craft to include a larger cross-section of materials, “an aggregated ecosystem or `economy’ of texts.”[7] He goes on to conclude that computational work often supports what many perceive to be common knowledge about literary works, yet provides evidence for it, as opposed to casual observations.[8] He emphasizes the prospect of using close and distant reading together, exploring the relationships between specific expressions of belief or creativity and the larger context in which individual authors situate their arguments or stories.

Intellectual historians have long turned their attention to the close reading of specific texts, often focusing especially on individuals and works for which they can demonstrate subsequent influence. Political historians and political scientists have consistently studied beliefs and ideologies as important aspects of the history of electoral activity and governance, with an equal emphasis on tracing their genealogy and influence. The proposed project will use text-mining technology to build on these disciplines’ traditional practice in several ways.

The project will build on and use of a set of applications and scripts developed in R by Adam Frieberg, as follows. 

Congressional Record text materials prepared by ProQuest are stored in a relational database with an internal index system, built on Microsoft SQL Server Express with Advanced Services.  The R code is written in modules that have already structured much of the data.
Module: Ingester – R scripts have done pattern matching using regular expressions to recursively search the directory of files to find all .xml files in the ProQuest data source that match peer full text PDF files.  From what we can tell, the ProQuest XML files contain the full text of the PDFs that were generated via OCR (Optical Character Recognition).  The R code then built an index of the files by date and focused on the entire Congressional Record from 1876 to 1896.  These two decades were chosenbecause of the “full text”/verbatim nature of the printed Congressional Record at the time, as well as their being the zenith of tariff debates in the late nineteenth century.  The R code combed each speech and identified speakers as well as the content of their speeches.  This identification relied on the reliability of speeches always starting with the string: “Mr. “.  Candidates for speeches were then filtered to exclude the sections that began with procedural words (examples: “presented”, “introduced”, “submitted”, “a bill”, “petition”, “by unanimous”).  The separated speeches were stored in a database table called Speeches1876to1896 and indexed both by their date, the names of the speakers, as well as the full text of the speeches.  They were also run as a single-threaded process in order for their data storage to preserve and resemble their order within the Congressional Record.

Module: Sentiment Analyzer – R scripts produced a more granular resolution that separated every speech by sentence.   The sentences were split by using the standard period (“.”) character.  The sentences were quality controlled by filtering out abbreviations and other places with OCR errors.  The exclusionary rules included filtering out any sentences that began with numeric characters (H.R. 234 was the typical designation for “House Resolution 234”).  It also excluded sentences beginning with the standard Congressional Record headers (“CONGRESSIONAL RECORD – SENATE” and “Also, a bill”).  Sentences were then filtered to only the sentences longer than 10 characters in length.  This was a subjective way to ensure it would retain sentences such as “Mr. COGHLAN: I concur” but not include shorter utterances such as “Mr. Smith: Aye”.  The R script then used an external 3rd-party API (Microsoft’s Cognitive Services API) to generate sentiment analysis scores for every sentence surviving those filters in the 20 years of the Congressional Record.  Those sentences are stored in the SpeechFragments20Yr database table and the sentiment analysis scores are stored in the SpeechFragments20YrSentimentAnalysis table.

Module: Index Database Views - The combination of the three prior-mentioned database tables yields a corpus of text that is indexed by speaker, time, and sentiment.  Many of the over two million individual speeches reflected in the speech indexes are clearly portions of back-and-forth utterances. This module provides a way to diagnose these speeches. The views link individual fragments of speech with parent speech objects that are then identifiable by speaker. Records have ID fields to keep them in the sequence they appeared within the print version of the Congressional Record, moving forward in time. 

Module: Topic Modeler – Pradeep will investigate modern topic modeling approaches, including Mallet and Gensim. He will consult with Dr. VandeCreek and provide sample output. Together, they will select the approach to be used in the final analysis. The goals of this topic modeling are 1) inform Dr. VandeCreek’s navigation of the full corpus in further research; 2) identify prominent topics as they may correspond to existing historical and Political Science scholarship’s description of pro- and anti-tariff arguments in this period; 3) determine if the prominence of specific topics changes over time; 4) use visualization applications to illustrate these changes for an audience unfamiliar with data science. 

                Using the above techniques, the project will first address the challenge of identifying which of the available congressional text materials discussed tariff legislation, and whether each supported or opposed a tariff bill, by using basic word search functionality, text classification, sentiment analysis, and a freely available API providing information about members of Congress and their voting histories. A machine-generated review of the Congressional Record for the period under consideration has identified a specific set of speeches, inserted documents and other utterances including the word “tariff” and/or several synonymous or related terms, including “duty/duties,” “impost(s),” “levy,” and “excise,” as well as the words “protection” and “protective,” which scholarship in History and Political Science shows were widely used to describe the policy.  Project participants will next move to create two sets of documents: those supporting the tariff and those opposing it. In the first case, Dr. VandeCreek will assemble training sets of speeches and other documents known to express pro- and anti-tariff arguments, and then ask text mining software (which?) to compare the words and patterns of words in each to those found in a set of unclassified works. This will produce a result in which the software predicts the likelihood that each unclassified document argues for or against the tariff. In the second case, the use of Microsoft Azure’s sentiment analysis application will measure the degree to which speeches discussing the tariff express positive or negative sentiment, with the working hypothesis that pro-tariff speeches will express more positive sentiment and anti-tariff speeches more negative sentiment.  Project participants will check these results against each other and make use of the ProPublica Congress API ( to ascertain how the member of Congress responsible for a given speech, utterance or other text voted on the legislation that it addressed. Dr. VandeCreek will also make close readings of a number of randomly selected texts in the sets produced by the above means in order to determine if they have produced sufficiently accurate collections of pro- and anti-tariff text. 

                Having produced a set of pro- and anti-tariff documents, project staff members will next use the topic modeling software Mallet ( and/or Gensim ( to examine the sets of words that tariff proponents and opponents used to praise or condemn the policy in the period 1876-1896. Project staff members will identify individual pieces of tariff legislation that came to the floor of Congress for debate in this period, and separate those texts identified as discussing the tariff into sub-sets of materials specifically pertaining to each bill (for example, The Tariff of 1883, also known as the Mongrel Tariff due to its tepid reforms; the Mills Bill of 1888, which unsuccessfully proposed lower tariffs; and the McKinley Tariff of 1890, which produced dramatically increased tariffs). This will produce a division of materials reflecting the progress of tariff debates over time.  

Project participants will construct several topic models for pro- and anti-tariff speeches for each bill, and analyze if and, if appropriate, how members of Congress’ arguments for and against the policy changed over time. Using visualization software, they will present this data for review by historians and other interested parties who are likely to be unfamiliar with topic modeling or other text mining technologies. 

More specific research questions to be explored may include:

What topics most characterized pro- and anti-tariff arguments in the period 1876-1896?

Did these topics or arguments change over time?

Of the topics produced from a review of pro-tariff texts, do any reflect the influence of what Goldstein describes as the Free Labor appeal? If so, how many? Does their prominence change over time?

Of the topics produced from a review of pro-tariff texts, do any reflect the influence of what Gerring describes as the labor, neo-mercantilist and statist appeals? If so, how many? Does their prominence change over time?

Of the topics produced from a review of anti-tariff texts, do any include references to special privilege? To political corruption? To the undermining of individual responsibility and self-reliance? If so, how many? Does their prominence change over time?

These results will provide an opportunity to explore how postwar members of Congress discussed the prospect of a federal activity directing the course of economic and social change in the United States as it related to a policy that historians and political scientists have identified as among the century’s most significant. Project participants will present data addressing the above questions in a series of conference presentations, publications and/or reports to an audience of historians, political scientists and digital humanities scholars. They will use visualization software to present findings and illustrate interpretive discussion, especially in work directed toward the first two groups, members of which are likely to be unfamiliar with topic modeling or other text mining technologies. 

[1] J. J. Pincus “Tariffs” Encyclopedia of American Economic History (New York: Charles Scribner’s Sons, 1980) 439; “Tariff Policies” Encyclopedia of American Political History (New York: Charles Scribner’s Sons, 1984) 1259. Other works emphasizing the tariff’s importance in nineteenth-century American politics include Charles and Mary Beard The Rise of American Civilization (New York: Macmillan) 1927; H. Wayne Morgan From Hayes to McKinley (Syracuse: Syracuse University Press, 1969); Lewis Gould “The Republican Search for a National Majority” in The Gilded Age: A Reappraisal H. Wayne Morgan, ed., (Syracuse: Syracuse University Press, 1970); Morton Keller Regulating a New Economy: Public Policy and Economic Change in America, 1900-1933 (Cambridge: Harvard University Press, 1990); Richard F. Bensel Yankee Leviathan: The Origins of Central State Authority in America, 1859-1877 (New York: Cambridge University Press, 1990) and The Political Economy of American Industrialization, 1877-1900 (New York: Cambridge University Press, 2000); Judith Goldstein Ideas, Interests and American Trade Policy (Ithaca: Cornell University Press, 1993); Joanne Reitano The Tariff Question: The Great Debate of 1888 (University Park, PA: Penn State University Press, 1994); John Gerring “Party Ideology in America: The National Republican Chapter, 1828-1924” Studies in American Political Development, 11 (Spring, 1997) 44-108; Rebecca Edwards Angels in the Machinery: Gender in American Party Politics from the Civil War to the Progressive Era (New York: Oxford University Press, 1997); Morton Keller “Trade Policy in Historical Perspective” in Taking Stock: American Government in the Twentieth Century, Morton Keller and R. Shep Melnick, eds. (New York: Cambridge University Press, 1999); Charles W. Calhoun “James G. Blaine and the Republican Party Vision” in The Human Tradition in the Gilded Age and Progressive Era, Ballard Campbell, ed., (Wilmington, DE: SR Books, 2000).

[2] Authors emphasizing the Free Labor argument for the tariff include Goldstein Ideas, Interests and American Trade Policy; George B. Mangold, “The Labor Argument in the American Protective Tariff Discussion.” Bulletin of the University of Wisconsin, no. 246 (1906): passim; Frank Taussig, The Tariff History of the United States, 8th ed. (New York, 1931), 65–6; Eric Foner, Free Soil, Free Labor, Free Men: The Ideology of the Republican Party Before the Civil War (New York, 1970), 20–1; Dorothy Ross, Origins of American Social Science (New York, 1990), 47–8; Michael Holt, The Rise and Fall of the American Whig Party (New York, 1999), 69–70, 952 (quotation at 69); Gabor Borritt, Lincoln and the Economics of the American Dream (Memphis, 1978), 99, 113, 139. 

[3]  Dorothy Ross states that Free Labor ideology quickly faded from use after the Civil War in Origins of American Social Science, 48.  

[4] Gerring, "Party Ideology in America: The National Republican Chapter"

[5] Keller “Trade Policy in Historical Perspective” 19. 

[6] Matthew Jockers Macroanalysis: Digital Methods and Literary History (Urbana: University of Illinois Press, 2013) 20; Franco Moretti Distant Reading (London: Verso, 2013).

[7] Jockers Macroanalysis, 32.

[8] Jockers Macroanalysis, 30.

Phrenological View of Black Hawk

Phrenological Bust of Black Hawk, 1838 | Lincoln/Net | NIU Digital Library
This page from the American Phrenological Journal purports to discuss personality traits of the Sac and Fox Chief Black Hawk, who led his nation in the Black Hawk War of 1832....

Phrenological View of Black Hawk, 1838

( )

This page from the American Phrenological Journal purports to discuss personality traits of the Sac and Fox Chief Black Hawk, who led his nation in the Black Hawk War of 1832. Many Americans of the time considered phrenology, which arrived at conclusions based on the measurements of the human head, to be a science. Phrenology’s founder, the German physician Franz Joseph Gall, suggested that individual brain functions took place in specific physical locations within the brain. He mixed this observation with his period’s emphasis on “faculty psychology,” which viewed the mind as a set of separate elements related to discrete personal characteristics. In the text accompanying the above illustration we can see the author naming some of these characteristics as “secretiveness,” “combativeness,” cautiousness,” and “ideality,” as well as “intellect” and feeling.”

Gall’s emphasis on individual mental functions’ location in specific parts of the brain remains a proposition not without scientific, medical foundation, but, as the above item shows, his followers often took it to a level of specificity that far exceeded the modest data - skull measurements - to which they had access.

Monday, September 11, 2017

"A Study in Hats": William Jennings Bryan and the Presidential Campaign of 1896

“A Study in Hats: William Jennings Bryan Campaign Event, 1896″ | Illinois During the Gilded Age | NIU Digital Library
While his Republican competitor William McKinley conducted a studied “front porch campaign” bringing hand-picked groups of supporters to his Canton, Ohio residence, the 1896 Democratic and Populist presidential nominee William Jennings Bryan set out on a grueling speaking tour. This photograph depicts Bryan (standing at center of platform) at an unknown event on that tour. A powerful orator, Bryan emphasized the inflation of the national currency, principally by the monetization of silver, in his campaign. Many Americans in debt believed that this policy would benefit them because it would make the currency in which they paid what they owed less valuable than the currency they had borrowed. Many creditors supported a currency backed by gold, which they believed stood to retain its value, for the same reason. McKinley defeated Bryan in 1896 by a margin of 271 electoral votes to 176. The Republican found his greatest support in the northeast and Great Lake States, while Bryan swept the South and West, with the exception of Oregon and California. Historians have often characterized the election of 1896 as one of the most pivotal in American history. McKinley’s assassination in 1901 made Vice-President Theodore Roosevelt his successor, and the Republican Party did not cede the presidency until Woodrow Wilson’s victory in 1912.