This fall I am working with Pradeep Maddipatla, a graduate student in Computer Science at Northern Illinois University, on a text mining project involving my field of historical research - nineteenth century American economic and social policymaking, namely the protective tariff. Our project will use topic modeling to explore how American legislators discussed this policy, but we also hope to shed light on the broader question of how they characterized state involvement in the economy and society, in positive and negative terms.
This work uses a database of text materials drawn from the
Congressional Record, 1876-1896, which was organized and made ready for text mining activities by Adam Frieberg, a graduate student in Geography at Northern Illinois University who is also employed full-time as a programmer/developer.
Pradeep Mattipatla is assisted in this work by Professor Hamed Alhoori of Northern Illinois University's Department of Computer Science.
We are working with the following proposal:
“Topic
Modeling Tariff Debates in the United States Congress, 1876-1896”
Drew
VandeCreek, Northern Illinois University Libraries
Adam
Frieberg, Northern Illinois University Department of Geography
Pradeep
Maddipatla, Northern Illinois University Department of Computer Science
This project will employ
text-mining technology to explore the arguments that members of the United
States Congress used to support and promote legislation setting tariffs in the
period 1876-1896. Historians and political scientists have identified tariffs,
which set a fee or tax to be paid on imported goods, as a significant political
issue in the nineteenth-century United States.
One has called it “the most important economic policy of the
nineteenth-century federal government” and, save slavery, the most
consequential matter facing the American state in the nineteenth century
overall.
Questions of tariff policy often captured Americans’ ambitions and anxieties about
the nation’s future course of economic and political development. They also
provided an opportunity to discuss about the federal government’s proper role
in society.
The United States Congress
considered major tariff bills on many occasions in the nineteenth century, but
the issue took a central place in American political discourse after the Civil
War. The Union’s need for revenue (and Southern legislators’ absence from
Congress) led Lincoln and congressional Republicans to make the high tariff law
during the conflict. Postwar Republicans took an increasingly assertive
protectionist stance, and successfully resisted Democrats’ corresponding
attempts to reduce tariffs. In this context the policy became a virtual litmus
test of party identification. Republicans repulsed reformers’ attempts to cut
tariffs in the mid-1880s, and pushed still higher duties through Congress in
1890 and, after a modest setback in 1894, again in 1897.
Although the tariff played a
prominent role in the late nineteenth century’s electoral politics, scholars
have paid relatively scant attention to protectionists’ and their opponents’
arguments. Of those considering the matter, the political scientist Judith
Goldstein has asserted that postwar tariff proponents relied on what scholars
have called Free Labor appeals, which maintained that tariff-protected
industrial workers’ high wages allowed them to save the money necessary to open
their own businesses, thus achieving social mobility, or what Abraham Lincoln
called the “right to rise.”
A leading intellectual historian has suggested that this argument became
discredited and was abandoned in this period, however.
The political scientist John Gerring has emphasized Republicans’ other appeals
to labor, as well as neo-mercantilism and statism, in defense of the policy,
providing brief lists of words associated with each argument.
Scholars analyzing tariff reformers’ attacks on the policy have mentioned their
description of it as a federal grant of special privilege to manufacturers at
the expense of other members of the national community, especially in the
postwar period’s context of industrial consolidation and increasingly public
political corruption. Some nineteenth-century tariff critics also attacked the
measure as undermining individual responsibility and encouraging workers to
expect something for nothing.
These interpretations of tariff
debates are built on a limited evidentiary base. The Congressional Record’s verbatim account of remarks on the floor of
Congress begins in 1873. It consists of well over two million individual
speeches or other utterances, totaling over 2.5 million sentences. Any scholar
trained in the traditional analysis of political texts (i.e., reading them her
or himself) would be hard-pressed to review, much less consider and evaluate,
this mass of data in the period of time traditionally devoted to a dissertation
or book project. In this light, scholars’ analyses of arguments and debates
over the protective tariff have focused on assorted individual works of tariff
boosters and opponents, including speeches in Congress and works of journalism,
as well as early works of economics and the period’s broader discourse of
social science.
The
Congressional Record is today available as a database of digital full-text
materials, and scholars of literature and humanities computing
programmer/developers have in recent years developed a methodology that can
provide a new perspective on it. Using an approach that has proved useful in
the analysis of a broad range of other very large data sets, they have turned
computing power and algorithms to the examination of digital text collections,
comprised of many thousands of titles, that have recently become available from
a number of sources.
Where traditional
practitioners devoted to the close reading of a limited number of selected
texts have focused on specific, particular uses of language and shades of
meaning to produce detailed, highly nuanced accounts and interpretations of the
texts’ arguments, advocates of what Franco Moretti has called “distant reading”
and Matthew Jockers “macroanalysis” seek to discover, visualize and explore
quantifiable evidence of significant patterns within these much larger
collections.
Jockers has emphasized that the analysis of literary work at scale allows
researchers to move their studies beyond a focus on the very few works that
critics and scholars have acclaimed as classic or otherwise outstanding
examples of literary craft to include a larger cross-section of materials, “an
aggregated ecosystem or `economy’ of texts.”
He goes on to conclude that computational work often supports what many
perceive to be common knowledge about literary works, yet provides evidence for
it, as opposed to casual observations.
He emphasizes the prospect of using close and distant reading together, exploring
the relationships between specific expressions of belief or creativity and the
larger context in which individual authors situate their arguments or stories.
Intellectual historians have long
turned their attention to the close reading of specific texts, often focusing
especially on individuals and works for which they can demonstrate subsequent
influence. Political historians and political scientists have consistently
studied beliefs and ideologies as important aspects of the history of electoral
activity and governance, with an equal emphasis on tracing their genealogy and
influence. The proposed project will use text-mining technology to build on
these disciplines’ traditional practice in several ways.
The project will build on and use
of a set of applications and scripts developed in R by Adam Frieberg, as
follows.
Congressional
Record text materials prepared by ProQuest are stored in a relational
database with an internal index system, built on Microsoft SQL Server Express
with Advanced Services. The R code is
written in modules that have already structured much of the data.
Module: Ingester – R scripts have done pattern matching
using regular expressions to recursively search the directory of files to find
all .xml files in the ProQuest data source that match peer full text PDF
files. From what we can tell, the
ProQuest XML files contain the full text of the PDFs that were generated via
OCR (Optical Character Recognition). The
R code then built an index of the files by date and focused on the entire
Congressional Record from 1876 to 1896.
These two decades were chosenbecause of the “full text”/verbatim nature
of the printed Congressional Record at the time, as well as their being the
zenith of tariff debates in the late nineteenth century. The R code combed each speech and identified speakers
as well as the content of their speeches.
This identification relied on the reliability of speeches always
starting with the string: “Mr. “.
Candidates for speeches were then filtered to exclude the sections that
began with procedural words (examples: “presented”, “introduced”, “submitted”,
“a bill”, “petition”, “by unanimous”). The
separated speeches were stored in a database table called Speeches1876to1896
and indexed both by their date, the names of the speakers, as well as the full
text of the speeches. They were also run
as a single-threaded process in order for their data storage to preserve and
resemble their order within the Congressional Record.
Module: Sentiment Analyzer – R scripts produced a more
granular resolution that separated every speech by sentence. The sentences were split by using the
standard period (“.”) character. The
sentences were quality controlled by filtering out abbreviations and other
places with OCR errors. The exclusionary
rules included filtering out any sentences that began with numeric characters
(H.R. 234 was the typical designation for “House Resolution 234”). It also excluded sentences beginning with the
standard Congressional Record headers (“CONGRESSIONAL RECORD – SENATE” and
“Also, a bill”). Sentences were then
filtered to only the sentences longer than 10 characters in length. This was a subjective way to ensure it would
retain sentences such as “Mr. COGHLAN: I concur” but not include shorter
utterances such as “Mr. Smith: Aye”. The
R script then used an external 3rd-party API (Microsoft’s Cognitive Services
API) to generate sentiment analysis scores for every sentence surviving those
filters in the 20 years of the Congressional Record. Those sentences are stored in the
SpeechFragments20Yr database table and the sentiment analysis scores are stored
in the SpeechFragments20YrSentimentAnalysis table.
Module: Index Database Views - The combination of the three
prior-mentioned database tables yields a corpus of text that is indexed by
speaker, time, and sentiment. Many of
the over two million individual speeches reflected in the speech indexes are
clearly portions of back-and-forth utterances. This module provides a way to
diagnose these speeches. The views link individual fragments of speech with
parent speech objects that are then identifiable by speaker. Records have ID
fields to keep them in the sequence they appeared within the print version of
the Congressional Record, moving
forward in time.
Module: Topic Modeler – Pradeep will investigate modern
topic modeling approaches, including Mallet and Gensim. He will consult with
Dr. VandeCreek and provide sample output. Together, they will select the
approach to be used in the final analysis. The goals of this topic modeling are
1) inform Dr. VandeCreek’s navigation of the full corpus in further research;
2) identify prominent topics as they may correspond to existing historical and
Political Science scholarship’s description of pro- and anti-tariff arguments
in this period; 3) determine if the prominence of specific topics changes over
time; 4) use visualization applications to illustrate these changes for an
audience unfamiliar with data science.
Using
the above techniques, the project will first address the challenge of identifying
which of the available congressional text materials discussed tariff
legislation, and whether each supported or opposed a tariff bill, by using
basic word search functionality, text classification, sentiment analysis, and a
freely available API providing information about members of Congress and their
voting histories. A machine-generated review of the
Congressional Record for the period under consideration has identified
a specific set of speeches, inserted documents and other utterances including
the word “tariff” and/or several synonymous or related terms, including
“duty/duties,” “impost(s),” “levy,” and “excise,” as well as the words
“protection” and “protective,” which scholarship in History and Political
Science shows were widely used to describe the policy.
Project participants will next move to create
two sets of documents: those supporting the tariff and those opposing it. In
the first case, Dr. VandeCreek will assemble training sets of speeches and
other documents known to express pro- and anti-tariff arguments, and then ask
text mining software (which?) to compare the words and patterns of words in
each to those found in a set of unclassified works. This will produce a result
in which the software predicts the likelihood that each unclassified document
argues for or against the tariff. In the second case, the use of Microsoft
Azure’s sentiment analysis application will measure the degree to which
speeches discussing the tariff express positive or negative sentiment, with the
working hypothesis that pro-tariff speeches will express more positive sentiment
and anti-tariff speeches more negative sentiment.
Project participants will check these results
against each other and make use of the ProPublica Congress API (
https://projects.propublica.org/api-docs/congress-api/)
to ascertain how the member of Congress responsible for a given speech,
utterance or other text voted on the legislation that it addressed. Dr.
VandeCreek will also make close readings of a number of randomly selected texts
in the sets produced by the above means in order to determine if they have produced
sufficiently accurate collections of pro- and anti-tariff text.
Having
produced a set of pro- and anti-tariff documents, project staff members will
next use the topic modeling software Mallet (
http://mallet.cs.umass.edu/topics.php)
and/or Gensim (
https://radimrehurek.com/gensim/)
to examine the sets of words that tariff proponents and opponents used to
praise or condemn the policy in the period 1876-1896. Project staff members
will identify individual pieces of tariff legislation that came to the floor of
Congress for debate in this period, and separate those texts identified as
discussing the tariff into sub-sets of materials specifically pertaining to
each bill (for example, The Tariff of 1883, also known as the Mongrel Tariff
due to its tepid reforms; the Mills Bill of 1888, which unsuccessfully proposed
lower tariffs; and the McKinley Tariff of 1890, which produced dramatically
increased tariffs). This will produce a division of materials reflecting the
progress of tariff debates over time.
Project participants will construct
several topic models for pro- and anti-tariff speeches for each bill, and
analyze if and, if appropriate, how members of Congress’ arguments for and
against the policy changed over time. Using visualization software, they will
present this data for review by historians and other interested parties who are
likely to be unfamiliar with topic modeling or other text mining technologies.
More specific research questions to be explored may include:
What topics most characterized pro- and anti-tariff
arguments in the period 1876-1896?
Did these topics or arguments change over time?
Of the topics produced from a review of pro-tariff texts, do
any reflect the influence of what Goldstein describes as the Free Labor appeal?
If so, how many? Does their prominence change over time?
Of the topics produced from a review of pro-tariff texts, do
any reflect the influence of what Gerring describes as the labor,
neo-mercantilist and statist appeals? If so, how many? Does their prominence
change over time?
Of the topics produced from a review of anti-tariff texts,
do any include references to special privilege? To political corruption? To the
undermining of individual responsibility and self-reliance? If so, how many?
Does their prominence change over time?
These results will provide an
opportunity to explore how postwar members of Congress discussed the prospect
of a federal activity directing the course of economic and social change in the
United States as it related to a policy that historians and political
scientists have identified as among the century’s most significant. Project
participants will present data addressing the above questions in a series of conference
presentations, publications and/or reports to an audience of historians,
political scientists and digital humanities scholars. They will use visualization
software to present findings and illustrate interpretive discussion, especially
in work directed toward the first two groups, members of which are likely to be
unfamiliar with topic modeling or other text mining technologies.