Friday, April 9, 2021

ITHAKA Constellate: Text-Mining Product in Development

I have been invited to evaluate a beta version of ITHAKA' s text-mining product, which is tentatively titled Constellate. I'm thankful for the opportunity. 

I have some knowledge of other text-mining products made available by library materials vendors like ProQuest and Gale. In my experience they work well, but they only offer the use of text materials found in those portions of individual vendors' available collections to which your particular institution has a subscription. If you want access to more materials for your data set, your institution needs to subscribe to more collections. 

This type of product in general would be very helpful in teaching text data analysis at scale to non-programmers. I believe that humanities students can benefit from activities helping them to learn how to formulate hypotheses and evaluate evidence found in very large data sets. As individuals already receiving training in the critical evaluation of materials, they could make a valuable contribution to data-driven organizational activities in a number of fields. Put another way, employers of course need programmers able to build and adjust text-mining applications or sets of applications. But they also need critical thinkers to evaluate and results.

Access to a relatively limited number of text data sets is not a problem for this type of experiential learning, but it does present a large obstacle to original scholarly research. A paper making an argument based on the analysis of a data set that only contains those nineteenth-century text materials appearing in a ProQuest or Gale data set will very likely overlook a large part of the available historical record. Researchers need to be able to upload their own data sets into online text-mining services.  

It is also my impression that the code and algorithms that do the data analysis for vendor-served text mining project remain proprietary, which means that researchers and collaborating programmers would be unable to download the code and customize it for their own use. Since in my experience effective text-mining often requires a great deal of adjustment and customization, this presents another problem.

 Sales representatives for the above companies have made general statements about how their programmers were and are working on a function that would allow subscribers to upload their own data, but to my knowledge that has not happened. If any representatives of ProQuest, Gale, or other library vendors making similar products available have information to the contrary, please contact me and I will be happy evaluate your product.

I am very interested in Constellate because the ITHAKA representative with whom I spoke emphasized that their organization plans to present the service as A) able to analyze outside data sets, and B) willing to allow outside programmers to access its Python code for the purpose of customization. They hope to build a collection or set of open-source code applications that various Constellate users have constructed. 

This would be a very promising situation for researchers, teachers and learners situated at R2 and smaller institutions lacking large financial resources.

I will spend the next few months working with Constellate and report on what I discover.

"Some Assembly Required: Low-Cost Digitization of Materials from Magnetic Tape Formats for Preservation and Access"

 Earlier this year three colleagues and I published an article discussing the digitization of sound materials from magnetic tape formats. 

Please find the abstract and a link to the journal below. It is my understanding that the individual article will be embargoed until March, 2022, so the link to the individual article itself probably will not work until then.

"Some Assembly Required: Low-Cost Digitization of Materials from Magnetic Tape Formats for Preservation and Access"

Preservation, Digital Technology, and Culture 49 (3) October, 2020, 89-98


Recent work discussing the digitization and preservation of magnetic tape materials has maintained that it should be left to expert practitioners and that the resulting digital materials should be stored in digital repositories. This article suggests that librarians and archivists lacking extensive technical skills or access to expertise can digitize these materials themselves. It provides a detailed account, including challenges faced, of how a team of practitioners without prior training or experience digitized historical audio recordings on cassette and open reel tape at Northern Illinois University Libraries. The discussion reviews the assembly of equipment and software that the team used for digitization work, discussing each element’s significance and how they came together as a functioning workflow. The authors also emphasize the fact that while the digitization of fragile and/or degraded magnetic tape materials may contribute to the preservation of their contents, this action also creates a new set of materials with their own preservation needs. Realizing that many practitioners serving medium-sized and smaller institutions lacking large financial resources may not have access to a full-fledged digital repository, they suggest the use of the National Digital Stewardship Alliance’s Levels of Digital Preservation rubric as a means by which practitioners may incrementally increase the probability that digital materials made from magnetic tapes will remain accessible.

Friday, January 10, 2020

Where Are They Now?: Curation and Preservation of Early Online Digital Humanities Materials

I am currently doing research, in collaboration with my colleague Jaime Schumacher, on the present status of sixty-five online digital humanities projects funded by the National Endowment for the Humanities' Division of Education Programs Development and Demonstration competition in the period 1993-2005. This was a major source of early funding for projects in this field, providing support to the University of Virginia's Valley of the Shadow Project, the Perseus Digital Library at Tufts University, the Women and Social Movements Project at SUNY Binghamton, and many others.

I should note that I did not receive any funding from this program during this period, nor did any of my colleagues at Northern Illinois University.

Having experience in the creation of grant-funded, online digital humanities materials during this period as well as in the investigation of digital preservation issues in libraries and archives, I became aware that many of these online resources were likely at risk of loss. I know that we struggled mightily to devise a way to keep our online projects (Lincoln/Net, Mark Twain's Mississippi, Southeast Asia Digital Library) functioning and available, so it stood to reason that other practitioners and institutions in similar situations would do so as well.

Based on our own experience, I identified major threats to the preservation and online presentation of these projects as

1) the lack of long-term funding inherent in grant-funded work. Unlike discrete research projects commonly funded and performed in colleges, universities and cultural heritage institutions, which typically produce results and publish findings as part of a mutually agreed upon timeline, these projects proposed to make materials available to the public for an indefinite period of time. Who was to pay for their support after the grant period?

2) the demands of online presentation in light of technical infrastucture's limited lifespan in a rapidly changing technical environment. It became clear rather early in this period that computers used to serve websites must be replaced every three or four years in order to provide acceptable levels of online availability. Also, software companies continued to push new versions of software (and eventually stopped supporting old versions), and produced new products very quickly, leading to rapid obsolescence of software arrangements.

3) the widespread assumption among practitioners (including myself) that digital materials were in fact more durable, and hence less subject to loss than analog materials. This turned out to be false, as research has often shown that for reasons including those listed above, digital materials are very likely to be lost in the absence of detailed preservation policies and sustained attention to their curation. Thus the support of digital projects included much more than paying for their ongoing online availability. It also included organizing, maintaining, and securing the archive of digitized or born-digital materials that the project had created. The fact that practitioners responsible for the creation of digital projects often did not realize this considerable risk of loss exposed their materials to additional risk.  

Our research will show how many of these online resources are still available online today; what technical platform (hardware and software) they employ, as well as the institutional arrangements behind this platform (i.e. is the project still presented online by the institution that received the original grant? if so, what part of the institution has assumed responsibility for it? and, is this the unit of the institution which originally received the grant?); and how many and which of these institutions have discussed and/or implemented a binding plan for their continued preservation and online presentation.

In addition, we hope to speak to representatives of the National Endowment for the Humanities to determine what their original expectations for projects' preservation and ongoing availability might have been, and if these expectations evolved or changed over the time period in question.

In the end, we hope to determine which factors have affected online availability of early digital humanities projects. At this point we believe that the dynamics mentioned above will very likely appear in this list of factors, but only the research itself will tell.

I'm sure that other research questions will occur to us as we review the data we have collected. We may or may not be able to integrate them into this discrete study.

We hope to publish the results of the study in approximately eighteen to twenty-four months, perhaps addressing the Library Science and Digital Humanities communities in separate articles that discuss our data and findings from their respective viewpoints.

Wednesday, February 21, 2018

Another Text-Mining Project: CIA Materials

This semester I am working with a team of four Northern Illinois University student interns to explore a large collection of text materials brought to us by Dr. Eric Jones of our university's Center for Southeast Asian Studies. The materials consist of the Central Intelligence Agency's President's Daily Briefings for the period 1961-1977, or roughly the period of the United States' military engagement in Vietnam, including the several years leading up to a following the war itself. These materials have been declassified and are available on the CIA's online reading room.

Two Northern Illinois University graduate students have expressed interest in working with President's Daily Briefing materials from this period in their dissertation research, but are unable to devote the time necessary to read this very large collection of documents without some knowledge of its contents.

To date, the student team has used a script to download the 5,292 individual daily briefings, and  Optical Character Recognition to convert the documents, available in PDF format, into machine-readable text.

Text mining technology will allow the student intern to provide Dr. Jones and his students with an overview of the materials, including topics (combinations of words that frequently appear together) therein.

As the request for this information has come from students of twentieth-century Southeast Asian history and politics, we will especially focus on topics including the names of Southeast Asian nations, cities, geographical features, and public figures.

We will also provide a review of sentiment analysis (scoring their positive or negative character) of the Daily Briefings and attempt to group them into sets (or clusters) of like documents, based on the words contained therein.

Upon its completion, this work will provide NIU researchers with a new data set heretofore unavailable to them: the machine readable text of the President’s Daily Briefings for the period under consideration. The University Libraries may choose to make this data set available for future research via its digital repository. 

The work will also provide NIU researchers with a detailed report of 
A) topics appearing in the collection, showing how individual topics may become more or less prominent within the larger collection at different periods in time; 

B) how the reports expressed positive or negative sentiments regarding national security concerns; and

C) and how the individual briefings relate to each other in terms of words used in common.

The project team will also share the machine-readable text data set and report of findings with other researchers by submitting it to an open-access Digital Humanities publication and/or data repository for the humanities and/or social sciences.

Tuesday, November 28, 2017

"Bleeding Kansas"

Kansas, from The United States Illustrated, 1855 | Mark Twain’s Mississippi |NIU Digital Library
This 1855 illustration depicts an idyllic scene in Kansas, most likely along the Kansas River. In this period Kansas was anything but idyllic, however....

"Kansas" - from The United States Illustrated, 1855 | Mark Twain’s Mississippi |NIU Digital Library

This 1855 illustration depicts an idyllic scene in Kansas, most likely along the Kansas River. In this period Kansas was anything but idyllic, however. Kansas became a territory with the implementation of the Kansas-Nebraska Act on May 30, 1854. The act, which left occupants of the Kansas and Nebraska territories seeking statehood to decide if human slavery would exist in their jurisdiction, set aside the Missouri Compromise of 1820, by which Congress had sought to maintain a rough balance between slave and free states in the Union by restricting the former to territory south of the 36 30′ parallel, excluding Missouri itself.  Pro- and anti-slavery settlers poured into Kansas, many with the explicit goal of establishing their preferred policy there. Political controversy ensued.
Pro-slavery settlers dominated the initial territorial legislature elected on March 30, 1855. This body would determine if Kansas would enter the Union as a slave or free state. Opponents of slavery around the Union argued that widespread voter fraud made the election’s results illegitimate, and the territorial governor invalidated results in several districts. New elections gave anti-slavery settlers greater representation, but they remained in a decided minority.

The United States Congress sent a special committee to Kansas, which concluded that the territorial legislature was an illegally constituted body without authority.  The territorial legislature convened in spite of the finding, rejected the credentials of those who had won the new elections, and passed laws paving the way for Kansas to enter the Union as a slave state. Anti-slavery Kansans rejected this government and formed their own, which in January, 1856, President Franklin Pierce declared illegal. Violence broke out between pro- and anti-slavery settlers, resulting in the shooting death of a free stater near Lawrence in December of 1855.

Political controversy produced physical violence. On May 21, 1856, pro-slavery forces stormed Lawrence, destroying a hotel and two newspaper offices, and sacking homes and businesses. Republican Senator Charles Sumner of Massachusetts soon delivered a speech on the Senate floor depicting pro-slavery views and actions as akin to the rape of a virgin. Sumner’s speech especially singled out the South Carolina Senator Andrew Butler for criticism. The next day Butler’s cousin, the South Carolina Congressman Preston Brooks, attacked Sumner on the Senate floor with a cane, inflicting grave injuries.

In Kansas, the anti-slavery activist John Brown led his sons and other followers to murder five pro-slavery settlers at Pottawatomie Creek on May 24, 1856. On the Fourth of July President Pierce sent U.S. Army troops to remove the Free State Legislature at Topeka. In August pro-slavery forces burned the Free State town of  Osawatomie, Kansas after driving off defenders led by Brown. The last major outbreak of violence occurred in the Marais des Cygnes massacre of 1858, in which pro-slavery forces killed five Free State men. In all, approximately 56 people died in “Bleeding Kansas” in the years before the Civil War began.

Tuesday, November 7, 2017

Text Mining at an Institution with Lesser Financial Resources, Revisited

I am presently moving forward with a research program in text-mining at Northern Illinois University Libraries, but have encountered an unexpected obstacle.

About a year ago I ordered a copy of ProQuest's American Periodicals data set for local use. Our library subscribes to ProQuest's hosted version of this product, but the product's design/technical infrastructure does not allow text-mining activities and our license for its use prohibits the downloading of anything but the most insignificant amount of materials. When I contacted ProQuest about the matter, they informed me that I would need to pay an additional $1000 for the preparation and delivery of the entire data set (approximately five terabytes) to me. I could then use the data on my local infrastructure.

For the past two years I have worked with members of my university's Computer Science Department, principally providing graduate students in Data Science with access to relatively large humanities text data sets that I have created myself and questions that they may use to inform text mining activities. Prior to my purchase of the American Periodicals data set, I secured an agreement with that department whereby they would host the materials on their high-capacity computing cluster and make them available for ongoing Data Science research.  I would take delivery of the materials from ProQuest, then transfer them to the cluster for processing and future use.

I still do not have the data. The first six months or so of delays were the result of my Library's mistake in attempting to charge the expense for the materials to the wrong account. Once we resolved that, I struggled to get ProQuest to review my university legal department's proposed (unremarkable) revisions to the contract for several months. Upon resolving that, I was able to forward payment to ProQuest in August, and looked forward to the delivery of the materials.

At this point I learned that ProQuest expected to deliver the full data set to a server of my choosing via the Internet. Since my library does not have 5 TB of extra capacity readily available, I asked for the data to be delivered on a hard drive or hard drives. ProQuest agreed.

A month passed, and I heard nothing from ProQuest. My contact with the company asked me to bear with him as he had staff members absent from the office while on holiday. Another month passed, and after another inquiry I learned that the company reserves the right to deliver the materials on hard drive any time within a period of six months after payment. I see no mention of this reservation in my contract with the company.

It seems likely to me that ProQuest is accustomed to working with institutions large enough, and possessed of enough material resources, to take delivery of such a large data set in this manner quite easily. My institution does not fit that description. After a period of two years without any state support, we recently began to receive payments from the State of Illinois again. Needless to say, our digital infrastructure is far from robust.

If I had known that the delivery of this data by hard drive would prove to be such a difficult matter, I would have made the necessary arrangements with my university's Department of Computer Science to have the data delivered directly to their cluster via the Internet. As this is an inter-divisional matter within the university, it will take some time. I initially intended to take delivery of the American Periodicals materials as quickly as possible, leaving time to work out these arrangements.

But, alas, ProQuest's representatives raised no caveats about hard-drive delivery until I actually started to inquire about the whereabouts of the materials my university had purchased.

Thus my warning: if you are attempting to do text mining research at an institution that doesn't have five terabytes of storage immediately at hand, and want to work with ProQuest data, be aware that they will take up to six months to deliver your data. 

Friday, October 27, 2017

"The City of Cairo Schottish"

This is the cover page of a piece of mid-nineteenth century sheet music entitled “The City of Cairo Schottish.” The City of Cairo was a steamboat, depicted in the illustration at the foot of the page. A schottish is a form of music popular in the nineteenth century, which musicologists identify as a country dance originating in central Europe.

In a time before phonographs or other forms of recorded music became widely available to the public, Americans typically experienced music by live performance and/or participation. Sheet music like this was widely distributed and allowed individual musicians to keep up with the latest musical trends. According to the Oxford Companion to Music, the schottish (or schottische) became popular in the England in 1878 with the publication of Tom Turner’s “Dancing in the Barn Schottisch,“ and Americans tended to favor a variety of the form identified as a “military schottische.”

Mark Twain mentions this form of music and dance in one of his letters to his brother Orion, and indicates the extent to which Americans of the mid-nineteenth century made use of sheet music, performed music in their own homes, and often danced to it.

“Ma was delighted with her trip, but she was disgusted with the girlsfor allowing me to embrace and kiss them–and she was horrified at the Schottische as performed by Miss Castle and me. She was perfectly willing for me to dance until 12 o'clock at the imminent peril of my going to sleep on the after watch–but then she would top off with a very inconsistent sermon on dancing in general; ending with a terrific broadside aimed at the heresy of heresies, the Schottische.”

       - Letter to Orion Clemens, 18 March 1861    

Thursday, October 26, 2017

Map - Early Settlement of Illinois

 Map of Illinois, 1818 | Lincoln/Net | NIU Digital Library
This map, provided by the Chicago History Museum, depicts Illinois at the time that it became the twenty-first state. In 1809 the area that became the state of Illinois was organized as the...

This map, provided by the Chicago History Museum, depicts Illinois at the time that it became the twenty-first state. In 1809 the area that became the state of Illinois was organized as the Illinois territory, with its capital at Kaskaskia. That city is visible on this map on Illinois’ southwestern border, across the Mississippi River from St. Genevieve, Missouri. Kaskaskia remained the capital of Illinois for a year, until the government removed to Vandalia, some 120 miles to the northeast. Vandalia was a very small town, not even represented on the above map, but Kaskaskia had proved unsuitable as a seat of government due to the Mississippi’s persistent threat of flooding. Vandalia also promised a more central location for a state eager to grow toward the north and east. As the map shows, much of what is now the most heavily-populated part of the state of Illinois had not even been divided into quarter sections, much less counties, at the time of statehood. 

The Illinois country was not settled by parties moving across the land from east to west. In a time of very few roads, this would have been an extremely difficult task. Instead, immigrants came to Illinois by way of the Ohio and Mississippi Rivers, moving largely from south to north. The parts of this map depicted as settled, organized territory were, and are, largely inhabited by people who came to Illinois from Virginia, Kentucky and Tennessee. The wedge of land making up Illinois’ westernmost parts was also settled by way of river travel, but it was unique in that it had been set aside by Congress for settlement by veterans of the War of 1812. Note that it is identified on the map as “Military Bounty Land.”

Settlers did not come to northern, central and eastern Illinois in large numbers until the completion of the Erie Canal in 1825 made that land more accessible by way of the Great Lakes and Chicago. Today those portions of the state retain a significant population descended from immigrants who came to the state from New England and the middle states, like New York and Pennsylvania.

Thursday, September 14, 2017

Text-Mining Project

This fall I am working with Pradeep Maddipatla, a graduate student in Computer Science at Northern Illinois University, on a text mining project involving my field of historical research - nineteenth century American economic and social policymaking, namely the protective tariff. Our project will use topic modeling to explore how American legislators discussed this policy, but we also hope to shed light on the broader question of how they characterized state involvement in the economy and society, in positive and negative terms.

This work uses a database of text materials drawn from the Congressional Record, 1876-1896, which was organized and made ready for text mining activities by Adam Frieberg, a graduate student in Geography at Northern Illinois University who is also employed full-time as a programmer/developer.

Pradeep Mattipatla is assisted in this work by Professor Hamed Alhoori of Northern Illinois University's Department of Computer Science.

We are working with the following proposal:

“Topic Modeling Tariff Debates in the United States Congress, 1876-1896”

Drew VandeCreek, Northern Illinois University Libraries
Adam Frieberg, Northern Illinois University Department of Geography
Pradeep Maddipatla, Northern Illinois University Department of Computer Science

This project will employ text-mining technology to explore the arguments that members of the United States Congress used to support and promote legislation setting tariffs in the period 1876-1896. Historians and political scientists have identified tariffs, which set a fee or tax to be paid on imported goods, as a significant political issue in the nineteenth-century United States.  One has called it “the most important economic policy of the nineteenth-century federal government” and, save slavery, the most consequential matter facing the American state in the nineteenth century overall.[1] Questions of tariff policy often captured Americans’ ambitions and anxieties about the nation’s future course of economic and political development. They also provided an opportunity to discuss about the federal government’s proper role in society. 

The United States Congress considered major tariff bills on many occasions in the nineteenth century, but the issue took a central place in American political discourse after the Civil War. The Union’s need for revenue (and Southern legislators’ absence from Congress) led Lincoln and congressional Republicans to make the high tariff law during the conflict. Postwar Republicans took an increasingly assertive protectionist stance, and successfully resisted Democrats’ corresponding attempts to reduce tariffs. In this context the policy became a virtual litmus test of party identification. Republicans repulsed reformers’ attempts to cut tariffs in the mid-1880s, and pushed still higher duties through Congress in 1890 and, after a modest setback in 1894, again in 1897.

Although the tariff played a prominent role in the late nineteenth century’s electoral politics, scholars have paid relatively scant attention to protectionists’ and their opponents’ arguments. Of those considering the matter, the political scientist Judith Goldstein has asserted that postwar tariff proponents relied on what scholars have called Free Labor appeals, which maintained that tariff-protected industrial workers’ high wages allowed them to save the money necessary to open their own businesses, thus achieving social mobility, or what Abraham Lincoln called the “right to rise.”[2] A leading intellectual historian has suggested that this argument became discredited and was abandoned in this period, however.[3] The political scientist John Gerring has emphasized Republicans’ other appeals to labor, as well as neo-mercantilism and statism, in defense of the policy, providing brief lists of words associated with each argument.[4] Scholars analyzing tariff reformers’ attacks on the policy have mentioned their description of it as a federal grant of special privilege to manufacturers at the expense of other members of the national community, especially in the postwar period’s context of industrial consolidation and increasingly public political corruption. Some nineteenth-century tariff critics also attacked the measure as undermining individual responsibility and encouraging workers to expect something for nothing.[5]  

These interpretations of tariff debates are built on a limited evidentiary base. The Congressional Record’s verbatim account of remarks on the floor of Congress begins in 1873. It consists of well over two million individual speeches or other utterances, totaling over 2.5 million sentences. Any scholar trained in the traditional analysis of political texts (i.e., reading them her or himself) would be hard-pressed to review, much less consider and evaluate, this mass of data in the period of time traditionally devoted to a dissertation or book project. In this light, scholars’ analyses of arguments and debates over the protective tariff have focused on assorted individual works of tariff boosters and opponents, including speeches in Congress and works of journalism, as well as early works of economics and the period’s broader discourse of social science. 

The Congressional Record is today available as a database of digital full-text materials, and scholars of literature and humanities computing programmer/developers have in recent years developed a methodology that can provide a new perspective on it. Using an approach that has proved useful in the analysis of a broad range of other very large data sets, they have turned computing power and algorithms to the examination of digital text collections, comprised of many thousands of titles, that have recently become available from a number of sources.  Where traditional practitioners devoted to the close reading of a limited number of selected texts have focused on specific, particular uses of language and shades of meaning to produce detailed, highly nuanced accounts and interpretations of the texts’ arguments, advocates of what Franco Moretti has called “distant reading” and Matthew Jockers “macroanalysis” seek to discover, visualize and explore quantifiable evidence of significant patterns within these much larger collections.[6] Jockers has emphasized that the analysis of literary work at scale allows researchers to move their studies beyond a focus on the very few works that critics and scholars have acclaimed as classic or otherwise outstanding examples of literary craft to include a larger cross-section of materials, “an aggregated ecosystem or `economy’ of texts.”[7] He goes on to conclude that computational work often supports what many perceive to be common knowledge about literary works, yet provides evidence for it, as opposed to casual observations.[8] He emphasizes the prospect of using close and distant reading together, exploring the relationships between specific expressions of belief or creativity and the larger context in which individual authors situate their arguments or stories.

Intellectual historians have long turned their attention to the close reading of specific texts, often focusing especially on individuals and works for which they can demonstrate subsequent influence. Political historians and political scientists have consistently studied beliefs and ideologies as important aspects of the history of electoral activity and governance, with an equal emphasis on tracing their genealogy and influence. The proposed project will use text-mining technology to build on these disciplines’ traditional practice in several ways.

The project will build on and use of a set of applications and scripts developed in R by Adam Frieberg, as follows. 

Congressional Record text materials prepared by ProQuest are stored in a relational database with an internal index system, built on Microsoft SQL Server Express with Advanced Services.  The R code is written in modules that have already structured much of the data.
Module: Ingester – R scripts have done pattern matching using regular expressions to recursively search the directory of files to find all .xml files in the ProQuest data source that match peer full text PDF files.  From what we can tell, the ProQuest XML files contain the full text of the PDFs that were generated via OCR (Optical Character Recognition).  The R code then built an index of the files by date and focused on the entire Congressional Record from 1876 to 1896.  These two decades were chosenbecause of the “full text”/verbatim nature of the printed Congressional Record at the time, as well as their being the zenith of tariff debates in the late nineteenth century.  The R code combed each speech and identified speakers as well as the content of their speeches.  This identification relied on the reliability of speeches always starting with the string: “Mr. “.  Candidates for speeches were then filtered to exclude the sections that began with procedural words (examples: “presented”, “introduced”, “submitted”, “a bill”, “petition”, “by unanimous”).  The separated speeches were stored in a database table called Speeches1876to1896 and indexed both by their date, the names of the speakers, as well as the full text of the speeches.  They were also run as a single-threaded process in order for their data storage to preserve and resemble their order within the Congressional Record.

Module: Sentiment Analyzer – R scripts produced a more granular resolution that separated every speech by sentence.   The sentences were split by using the standard period (“.”) character.  The sentences were quality controlled by filtering out abbreviations and other places with OCR errors.  The exclusionary rules included filtering out any sentences that began with numeric characters (H.R. 234 was the typical designation for “House Resolution 234”).  It also excluded sentences beginning with the standard Congressional Record headers (“CONGRESSIONAL RECORD – SENATE” and “Also, a bill”).  Sentences were then filtered to only the sentences longer than 10 characters in length.  This was a subjective way to ensure it would retain sentences such as “Mr. COGHLAN: I concur” but not include shorter utterances such as “Mr. Smith: Aye”.  The R script then used an external 3rd-party API (Microsoft’s Cognitive Services API) to generate sentiment analysis scores for every sentence surviving those filters in the 20 years of the Congressional Record.  Those sentences are stored in the SpeechFragments20Yr database table and the sentiment analysis scores are stored in the SpeechFragments20YrSentimentAnalysis table.

Module: Index Database Views - The combination of the three prior-mentioned database tables yields a corpus of text that is indexed by speaker, time, and sentiment.  Many of the over two million individual speeches reflected in the speech indexes are clearly portions of back-and-forth utterances. This module provides a way to diagnose these speeches. The views link individual fragments of speech with parent speech objects that are then identifiable by speaker. Records have ID fields to keep them in the sequence they appeared within the print version of the Congressional Record, moving forward in time. 

Module: Topic Modeler – Pradeep will investigate modern topic modeling approaches, including Mallet and Gensim. He will consult with Dr. VandeCreek and provide sample output. Together, they will select the approach to be used in the final analysis. The goals of this topic modeling are 1) inform Dr. VandeCreek’s navigation of the full corpus in further research; 2) identify prominent topics as they may correspond to existing historical and Political Science scholarship’s description of pro- and anti-tariff arguments in this period; 3) determine if the prominence of specific topics changes over time; 4) use visualization applications to illustrate these changes for an audience unfamiliar with data science. 

                Using the above techniques, the project will first address the challenge of identifying which of the available congressional text materials discussed tariff legislation, and whether each supported or opposed a tariff bill, by using basic word search functionality, text classification, sentiment analysis, and a freely available API providing information about members of Congress and their voting histories. A machine-generated review of the Congressional Record for the period under consideration has identified a specific set of speeches, inserted documents and other utterances including the word “tariff” and/or several synonymous or related terms, including “duty/duties,” “impost(s),” “levy,” and “excise,” as well as the words “protection” and “protective,” which scholarship in History and Political Science shows were widely used to describe the policy.  Project participants will next move to create two sets of documents: those supporting the tariff and those opposing it. In the first case, Dr. VandeCreek will assemble training sets of speeches and other documents known to express pro- and anti-tariff arguments, and then ask text mining software (which?) to compare the words and patterns of words in each to those found in a set of unclassified works. This will produce a result in which the software predicts the likelihood that each unclassified document argues for or against the tariff. In the second case, the use of Microsoft Azure’s sentiment analysis application will measure the degree to which speeches discussing the tariff express positive or negative sentiment, with the working hypothesis that pro-tariff speeches will express more positive sentiment and anti-tariff speeches more negative sentiment.  Project participants will check these results against each other and make use of the ProPublica Congress API ( to ascertain how the member of Congress responsible for a given speech, utterance or other text voted on the legislation that it addressed. Dr. VandeCreek will also make close readings of a number of randomly selected texts in the sets produced by the above means in order to determine if they have produced sufficiently accurate collections of pro- and anti-tariff text. 

                Having produced a set of pro- and anti-tariff documents, project staff members will next use the topic modeling software Mallet ( and/or Gensim ( to examine the sets of words that tariff proponents and opponents used to praise or condemn the policy in the period 1876-1896. Project staff members will identify individual pieces of tariff legislation that came to the floor of Congress for debate in this period, and separate those texts identified as discussing the tariff into sub-sets of materials specifically pertaining to each bill (for example, The Tariff of 1883, also known as the Mongrel Tariff due to its tepid reforms; the Mills Bill of 1888, which unsuccessfully proposed lower tariffs; and the McKinley Tariff of 1890, which produced dramatically increased tariffs). This will produce a division of materials reflecting the progress of tariff debates over time.  

Project participants will construct several topic models for pro- and anti-tariff speeches for each bill, and analyze if and, if appropriate, how members of Congress’ arguments for and against the policy changed over time. Using visualization software, they will present this data for review by historians and other interested parties who are likely to be unfamiliar with topic modeling or other text mining technologies. 

More specific research questions to be explored may include:

What topics most characterized pro- and anti-tariff arguments in the period 1876-1896?

Did these topics or arguments change over time?

Of the topics produced from a review of pro-tariff texts, do any reflect the influence of what Goldstein describes as the Free Labor appeal? If so, how many? Does their prominence change over time?

Of the topics produced from a review of pro-tariff texts, do any reflect the influence of what Gerring describes as the labor, neo-mercantilist and statist appeals? If so, how many? Does their prominence change over time?

Of the topics produced from a review of anti-tariff texts, do any include references to special privilege? To political corruption? To the undermining of individual responsibility and self-reliance? If so, how many? Does their prominence change over time?

These results will provide an opportunity to explore how postwar members of Congress discussed the prospect of a federal activity directing the course of economic and social change in the United States as it related to a policy that historians and political scientists have identified as among the century’s most significant. Project participants will present data addressing the above questions in a series of conference presentations, publications and/or reports to an audience of historians, political scientists and digital humanities scholars. They will use visualization software to present findings and illustrate interpretive discussion, especially in work directed toward the first two groups, members of which are likely to be unfamiliar with topic modeling or other text mining technologies. 

[1] J. J. Pincus “Tariffs” Encyclopedia of American Economic History (New York: Charles Scribner’s Sons, 1980) 439; “Tariff Policies” Encyclopedia of American Political History (New York: Charles Scribner’s Sons, 1984) 1259. Other works emphasizing the tariff’s importance in nineteenth-century American politics include Charles and Mary Beard The Rise of American Civilization (New York: Macmillan) 1927; H. Wayne Morgan From Hayes to McKinley (Syracuse: Syracuse University Press, 1969); Lewis Gould “The Republican Search for a National Majority” in The Gilded Age: A Reappraisal H. Wayne Morgan, ed., (Syracuse: Syracuse University Press, 1970); Morton Keller Regulating a New Economy: Public Policy and Economic Change in America, 1900-1933 (Cambridge: Harvard University Press, 1990); Richard F. Bensel Yankee Leviathan: The Origins of Central State Authority in America, 1859-1877 (New York: Cambridge University Press, 1990) and The Political Economy of American Industrialization, 1877-1900 (New York: Cambridge University Press, 2000); Judith Goldstein Ideas, Interests and American Trade Policy (Ithaca: Cornell University Press, 1993); Joanne Reitano The Tariff Question: The Great Debate of 1888 (University Park, PA: Penn State University Press, 1994); John Gerring “Party Ideology in America: The National Republican Chapter, 1828-1924” Studies in American Political Development, 11 (Spring, 1997) 44-108; Rebecca Edwards Angels in the Machinery: Gender in American Party Politics from the Civil War to the Progressive Era (New York: Oxford University Press, 1997); Morton Keller “Trade Policy in Historical Perspective” in Taking Stock: American Government in the Twentieth Century, Morton Keller and R. Shep Melnick, eds. (New York: Cambridge University Press, 1999); Charles W. Calhoun “James G. Blaine and the Republican Party Vision” in The Human Tradition in the Gilded Age and Progressive Era, Ballard Campbell, ed., (Wilmington, DE: SR Books, 2000).

[2] Authors emphasizing the Free Labor argument for the tariff include Goldstein Ideas, Interests and American Trade Policy; George B. Mangold, “The Labor Argument in the American Protective Tariff Discussion.” Bulletin of the University of Wisconsin, no. 246 (1906): passim; Frank Taussig, The Tariff History of the United States, 8th ed. (New York, 1931), 65–6; Eric Foner, Free Soil, Free Labor, Free Men: The Ideology of the Republican Party Before the Civil War (New York, 1970), 20–1; Dorothy Ross, Origins of American Social Science (New York, 1990), 47–8; Michael Holt, The Rise and Fall of the American Whig Party (New York, 1999), 69–70, 952 (quotation at 69); Gabor Borritt, Lincoln and the Economics of the American Dream (Memphis, 1978), 99, 113, 139. 

[3]  Dorothy Ross states that Free Labor ideology quickly faded from use after the Civil War in Origins of American Social Science, 48.  

[4] Gerring, "Party Ideology in America: The National Republican Chapter"

[5] Keller “Trade Policy in Historical Perspective” 19. 

[6] Matthew Jockers Macroanalysis: Digital Methods and Literary History (Urbana: University of Illinois Press, 2013) 20; Franco Moretti Distant Reading (London: Verso, 2013).

[7] Jockers Macroanalysis, 32.

[8] Jockers Macroanalysis, 30.