Friday, May 12, 2017

Stephen Douglas, the Little Giant

The Little Giant in the character of a Gladiator | Abraham Lincoln Historical Digitization Project | NIU Digital Library
This cartoon, dating from the late 1850s or 1860, depicts Illinois Senator Stephen Douglas, popularly known as the Little Giant,...

Image appears in Lincoln/Net, courtesy of the Chicago History Museum

This cartoon, dating from the late 1850s or 1860, depicts Illinois Senator Stephen Douglas, popularly known as the Little Giant, as a Roman gladiator armed with his doctrine of Popular Sovereignty. In Douglas’ usage, Popular Sovereignty suggested that citizens of territories seeking to become states should determine for themselves if slavery would be permitted there. This proved very controversial in the northern states because the Missouri Compromise of 1820 had forbidden slavery in territory acquired in the Louisiana Purchase located north of the 36 30′ (with the exception of Missouri). Douglas’ proposal potentially threw the entire West open for slavery, and served to intensify the sectional crisis that led to the Civil War.

See in NIU Digital Library

The Haymarket Riot

 “The Haymarket Riot. The Explosion and the Conflict“ by W. Ottman, 1889 | Illinois During the Gilded Age | NIU Digital Library
On the evening of May 4, 1886, an unknown individual lobbed a dynamite bomb into a formation of Chicago police officers...

The above image is a contemporary artist’s imagining of the moment of the bomb’s explosion, found in Anarchy and Anarchists: A History of the Red Terror in America and Europe by Michael Schaack (Chicago: F.J. Schulte and Co., 1889). It appears in Illinois During the Gilded Age.

On the evening of May 4, 1886, an unknown individual lobbed a dynamite bomb into a formation of Chicago police officers sent to disperse an anarchist meeting in Chicago’s Haymarket Square. The panicked police responded with a hail of gunfire directed into the crowd attending the meeting. When order once again prevailed, seven police officers and at least that many private citizens lay dead, with many more wounded. These events touched off a wave of civic upheaval as Americans discussed the Haymarket bomb in light of the period’s rapidly changing economic and social conditions. It also led to a celebrated trial of eight avowed anarchists, the execution or death in prison of five of them, and Illinois Governor John Peter Altgeld’s bold pardon of the remaining three.

See in NIU Digital Library 

Owen Lovejoy

Owen Lovejoy | Abraham Lincoln Historical Digitization Project |NIU Digital Library
Owen Lovejoy (January 6, 1811 – March 25, 1864) was a Congregationalist minister and abolitionist who won election to the United States Congress in 1856. In an 1859...

Image: Northern Illinois University Libraries

Owen Lovejoy (January 6, 1811 – March 25, 1864) was a Congregationalist minister and abolitionist who won election to the United States Congress in 1856. In an 1859 speech to the House of Representatives, he declared his opposition to the Fugitive Slave Act, a federal law that required all Americans to assist in the capture of escaped bondsmen, in the following terms:
“Proclaim it upon the house-tops! Write it upon every leaf that trembles in the forest! Make it blaze from the sun at high noon and shine forth in the radiance of every star that bedecks the firmament of God. Let it echo through all the arches of heaven, and reverberate and bellow through all the deep gorges of hell, where slave catchers will be very likely to hear it. Owen Lovejoy lives at Princeton, Illinois, three-quarters of a mile east of the village, and he aids every fugitive that comes to his door and asks it. Thou invisible demon of slavery! Dost thou think to cross my humble threshold, and forbid me to give bread to the hungry and shelter to the houseless? I bid you defiance in the name of my God.”

See NIU Digital Library

"The Last Refuge" by Thomas Cole

Image from Lincoln/Net, courtesy of Newberry Library

This 1855 engraving of Thomas Cole’s “The Last Refuge” depicts a Native American man pursued to the top of single pillar of rock in the wilderness, his “last refuge” from the encroachment of American settlement. Although all American citizens contributed to this dynamic to some degree, a significant number, especially Whigs in the urban North, regretted its impact on Native Americans. Hoping that their country would devote its energies to the more intensive development of territory east of the Mississippi River, or even east of the Appalachian Mountains, they associated rapid western settlement with the spread of cotton agriculture, slavery, and an American future as an agricultural nation dependent upon industrial Britain to buy its raw materials. They also feared it would undermine Christianity’s influence on Americans’ lives, especially those living on the frontier. 

See NIU Digital Library

US Gunboat Cairo

“ U.S. Gunboat Cairo - Courtesy of Tulane University Libraries Robert M. Jones Steamboat Collection | Mark Twain’s Mississippi Project | NIU Digital Library
“Cairo, an ironclad river gunboat, was built in 1861 by James Eads and Co., Mound...

Image appears in NIU's Mark Twain's Mississippi Project, courtesy of Tulane University Libraries' Robert M. Jones Steamboat Collection

 “Cairo, an ironclad river gunboat, was built in 1861 by James Eads and Co., Mound City, Ill., under an Army contract; and commissioned as an Army ship 25 January 1862, naval Lieutenant James M. Prichett in command.”
“Cairo served with the Army’s Western Gunboat Fleet, commanded by Flag Officer A. H. Foote, on the Mississippi and Ohio Rivers and their tributaries until transferred to the Navy 1 October 1862 with the other river gunboats. Active in the occupation of Clarksville, Tenn., 17 February 1862, and of Nashville, Tenn., 25 February, Cairo stood down the river 12 April escorting mortar boats to begin the lengthy operations against Fort Pillow, Tenn. An engagement with Confederate gunboats at Plum Point Bend on 11 May marked a series of blockading and bombardment activities which culminated in the abandonment of the Fort by its defenders on 4 June.”
“Two days later, 6 June 1862, Cairo joined in the triumph of seven Union ships and a tug over eight Confederate gunboats off Memphis, Tenn., an action in which five of the opposing gunboats were sunk or run ashore, two seriously damaged, and only one managed to escape. That night Union forces occupied the city. Cairo returned to patrol on the Mississippi until 21 November when she joined the Yazoo Expedition. On 12 December 1862, while clearing mines from the river preparatory to the attack on Haines Bluff, Miss., Cairo struck a torpedo and sank.” – Dictionary of American Naval Fighting Ships.

See NIU Digital Library

A New Type of Post

This spring I have begun posting individual images from Northern Illinois University Libraries' Digital Library, principally from digital projects exploring American history and culture that I have developed, to the Digital Library's TUMBLR account. I typically offer a few words of explanation or analysis to accompany the image.

I post materials for one week out of every month, every day of that week. My colleagues and I have agreed to re-post materials via our own blogs, so here goes...

Tuesday, January 24, 2017

The Fragility of Digital History (at least as I practiced it)

My recent turn to projects emphasizing the curation and preservation of digital data, like that contained in the historically oriented web sites we have developed at Northern Illinois University Libraries, has led me to recognize the many ways in which these materials can become compromised or otherwise lost to use. Backing up materials is of course very important, but is not a cure-all. If we back up materials in formats that eventually become so obsolete that no available software can open them, the data is still lost.

I have also become aware of other, less obvious, factors that have compromised Lincoln/Net, Mark Twain's Mississippi, and several of our other web sites. We built these in the early 2000s, with available open-source technology (Linux/Apache/MySQL/PHP - commonly called a LAMP set-up). It allowed our sites to combine searchable archives of primary sources (mostly text and images, but also latter-day versions of primary source materials in different media) with original interpretive materials in an effective manner. We simply laid out the web sites on two perpendicular axes, with a bar presenting links to primary source materials running horizontally near the top of the page, and a bar presenting links to interpretive materials running vertically along the left edge.

Of course this approach had its drawbacks. As we built a series of websites, we found we had no way to manage them systematically, together. If we wanted to make changes to our sites, which generally had a similar look at feel, someone had to edit their code, individually, by hand. We also had no way of monitoring our data in order to verify its continuing viability, nor did we have a way of pushing our data, as a block, or a series of blocks, into a backup device. This became increasingly time consuming.

We decided to migrate all of our data to a new platform made up of Fedora Commons repository software, a Drupal web interface, and Islandora, a Drupal module that allowed them to interact. This proved very difficult in our context of limited resources, and today we run that combination of applications on a shoestring thanks to the efforts of one talented and dedicated librarian.

The move to the new platform made data management and curation much easier, but it also cost us something.

The Fedora/Drupal/Islandora stack functions on the assumption that those implementing it intend to make digital objects available online by search or browsing, and manage their collections in a coordinated manner. It allows the users and providers of data to do these things very well, much better than a LAMP set-up would. But it leaves little room for interpretive materials. Put another way, it reduces our interpretive materials from a place of considerable importance on our websites, presented as equally important as the primary sources, to a side light. Links to them appear in the tool bar running horizontally near the top of the page, but from my perspective they become just another type of available data. Uninitiated users have little reason to perceive that buttons labeled "Essays" or "Videos" lead to interpretive materials. The "Lesson Plans" button is certainly effective, however.

Why could we not adapt the technology to preserve our two-axis presentation?  To be brief, because a more sophisticated and  manipulable search interface occupies the entire left edge (approximately one-quarter of width) of the page. This is in many ways a good thing, as we provide increasingly knowledgeable and experienced user groups in educational institutions with the features they have come to expect.  It also makes it impossible to put anything else along the left edge of the page.

Why could we not simply invert our approach and present primary sources on the vertical axis and interpretive materials on the horizontal. Again, to be brief, because the more powerful search apparatus that we now use provides a preliminary ("faceted") level of access to the data it retrieves there, occupying the remainder of the page (below the horizontal tool bar) with access to individual resources. All other functions, including "browse" "home" and "about" reside on the horizontal bar, along with access to essays, videos, maps, and lesson plans.

To be clear, I am not complaining that my library forced me and my colleagues to use software that we don't like. I originally led the push to make the change to a new software stack. If we had retained our original interface, now nearly twenty years old, our web sites would have taken on the appearance of obsolescence. Despite the apparent superficiality, almost triviality, of this concern, I believe that experienced web users immediately assess a site's usefulness and legitimacy by its  appearance - the first impression it makes. I know I certainly do. If we had retained our original interface, we not only would have continued to limp along with web sites that remained difficult to administer and data unsuited to modern curation techniques, we also would have produced a first impression marked by obviously outdated technology.

So what am I saying? This: the technical platforms necessary to present a web site including searchable access to primary source materials in a sophisticated and credible manner today reflect the assumptions and priorities of the library and archives community. They emphasize providing access to data, period. Technological developments like those we have employed (faceted search, for example) make that increasingly easy and powerful.

These assumptions and priorities give little notice to matters of outreach and interpretative assistance. They are aimed at users who want to search data in order to reach their own interpretations, in an essay, a research paper, or a book.  These users presumably already have access to information helping them to understand the primary source materials via classroom instruction or interpretive works available elsewhere (other web sites, books, articles, etc.). These users especially exist in schools and on college/university campuses.

Our LAMP-based web sites tried to provide a user group that we presumed did not have ready access to these forms of interpretive material - members of the general public - with a chance to build an interpretive framework to inform their searches. Perhaps these individuals could have gone to a library and read interpretive works, but we attempted to use the web's immense reach and flexibility to make interpretive materials more readily accessible - online, right next to the primary sources, in text and video formats.

We still do this, but the recent improvements in online indexing and search technology have made it increasingly difficult and, I suspect, ineffective.

My colleagues and I developed our sites in the web's early days, before information professionals had had a chance to assess it and refine it for their purposes by the development of progressively more effective search and retrieval technology. They have given us a great deal. But something has been lost, too.

I do not blame librarians and archivists for this loss. They are not being shortsighted. They are simply doing their jobs, as they are defined by the conventions of their profession. These conventions are  worthwhile and to be applauded. I suspect that attempts to devise an interface accommodating my preferred two-axis approach would likely compromise the efficacy of the available search and retrieval technology in some way. Were it possible to design such an interface without negative trade-offs, I suspect that it would require a considerable amount of financial resources and technical expertise, which are seldom available in the present political climate.

Our present web sites do make interpretive materials available for use, albeit not in the precise manner I originally envisioned.

As a historian, I have come to understand the versions of Lincoln/Net, Mark Twain's Mississippi, and other web sites that we developed with LAMP technology as artifacts, expressions of their time, especially the available technology. You can still see them on the Wayback Machine.

Thursday, May 26, 2016

Text Mining and Library Cataloging

During the spring semester of 2016 I supervised a team of students (Marcos Quezada, a graduate student in Operations Management and Information Systems; Fredrik Stark, a PhD candidate in NIU's English Department, and Mitchell Zaretsky, a junior Computer Science major) as they explored text mining in the context of Northern Illinois University Libraries' large online collection of late nineteenth and early twentieth century dime novels ( We worked in the format of an experiential learning activity, meaning that we addressed a problem brought to us by a client. In this case Matthew Short, NIU Libraries Metadata Librarian and Cataloger, served as the client.

In the experiential learning format, the client presents the student team with a set of goals. Mr. Short asked the team to develop a text classification application or tool to help library catalogers to determine the genre of the approximately 1,900 digitized text in the collection. In traditional cataloging activities, the cataloger inspects a work manually in order to derive basic information necessary to catalog it accurately. This can be a lengthy process. Perhaps text-mining technology could help catalogers to improve the speed and efficiency with which they cataloged a very large collection.

Mr. Short's goals also included the compilation of a list of genres and related subject terms for possible use in reclassifying online digitized collections; investigating text-mining tools for the future development of the prototype classifier application and future studies of the collections.

The team began work by using Weka, an open-source data and text-mining application. Mr. Short selected it because it enables users to acquaint themselves with the separate activities that make up text mining and construct original applications using blocks of existing Java code.

Mr. Short introduced the students to a typical text-mining work flow. He had been working to achieve his goals prior to engaging with this group, and for all intents and purposes led the team's activities. As the team's official coach, I attempted to facilitate discussion, scheduled activities, and completed paperwork.

The students began by gathering text files of digitized dime novels cataloged as belonging to the collection's better-represented genres. These genres included detective and mystery stories; western stories; sea stories; historical fiction; adventure stories; and bildungsromans (coming of age) stories.

The team next engaged in pre-processing activities in order to produce the most accurate text possible. NIU Libraries staff members originally produced the digital texts in the digital dime novel collection by the use of Optical Character Recognition software and did not attempt to correct any mistakes within them. Pre-processing began with the removal of stop words (such as the, an, and, etc.) and also included tokenization (identifying groups of characters as words) and stemming (reducing different inflections of a word to their root form) of words.  We also used Weka to render the text materials as a bag of words (i.e., set aside grammar and word order) and transform words into vectors, or numerical representations.

The team then moved on to text classification. They began by using a set of already-cataloged works to train Weka to identify specific words or sets of words with the individual genres mentioned above. Of the algorithms available in Weka, Naive Bayes proved most effective. They found that in 65% of works examined Weka's classification agreed with that of a human cataloger. Investigating this discrepancy, the team found that the use of additional filtering techniques, including the use of TF-IDF (a process to determine how important a word is to a document in a collection or corpus); a better stemmer (the open-source product Snowball); a list of nineteenth-century stop words composed by Matthew Jockers, a scholar of the period's literature; rendering all letters in lower-case; and setting the number of words in each text to be analyzed to 500 improved accuracy, i.e., Weka agreeing with a human cataloger's genre classification, to 75 %. They also discovered that a number of texts in the training set had been cataloged as belonging in two different genres. Removal of these works improved accuracy to 83%.

With the information above, Mitchell Zaretsky used Weka's Java API to construct an original classifier application. It reported the probability of a work fitting in one of the several genres. Working with a new test corpus of 214 digitized dime novels, the team found that their classifier agreed with human catalogers 71% of the time.

On the basis of this test, the team determined that their application can help catalogers to determine a dime novel's genre. It can also serve as an effective tool for evaluating the genre determinations of catalogers not using the application in their work. They also suggested that text-mining activities uncovered details about the form and content of works in NIU's digitized dime novel collection that invites further research.

Friday, May 6, 2016

Text Mining at an Institution with Lesser Financial Resources

Text Mining at an Institution with Lesser Financial Resources

I have periodically described my experiences with text mining in this blog. Today I want to raise a significant point that has only recently become clear to me. It happened in the wake of my participation in the University of Michigan's "Beyond Cntrl+F" workshop on February 1st of this year.

I want to begin by thanking the University of Michigan Libraries for organizing and hosting the event. It must have taken a great deal of work.

When I first found out about the workshop, I noticed that participants could attend at no charge. This was too good to be true. Working at a state university in the bankrupt State of Illinois, I of course had access to no financial support for professional development activities. I happily drove to Ann Arbor and stayed overnight at my own expense, then took part in the workshop. Without the free-admission policy, I might have passed on the event.

The workshop began with a session devoted to "finding your corpus." Fair enough. No one can do text mining until they have some text. The session featured representatives of several vendors of subscription products providing access to large amounts of textual materials: ProQuest, JSTOR, Gale, Alexander Street Press (full disclosure - I edited an online product for Alexander Street Press and have cashed their checks). It dawned on me that the no-charge policy of course resulted from these vendors' sponsorship of the event. As sponsors, they enjoyed the opportunity to pitch their products to members of a captive audience who had expressed an interest in text mining.

Vendor representatives described how scholars and students might use their products for text-mining projects. One upshot: vendors do not want text miners to attempt to download very large amounts of text through their subscription portal. They want text miners to submit a request for a specific corpus, which they will then prepare and deliver for an extra fee in the range of $500-$1000.

This made something very apparent: text mining is in many cases only practicable at its intended scale at institutions commanding the financial resources necessary to 1) subscribe to these products, and 2) go on to pay the additional fee.

Now to my situation: I am interested in working with the Congressional Record from the nineteenth century. My university does not subscribe to that portion of ProQuest's Congressional product that includes the Record for that period, nor any other product. Working through my library's acquisitions department, I was able to secure a quote for the use of the above-mentioned ProQuest text materials: it amounted to the annual subscription fee for the database product in which the CR resided.

This sum was a complete non-starter at my financially strapped university. I want to emphasize that it would have been a non-starter even if we had a budget, in more prosperous times (NB - ten months into our fiscal year, Northern Illinois University had not received any funds from the State of Illinois. Last month the legislature voted to provide stopgap funding designed to tide institutions over until the larger state budget impasse is resolved).

I understand that library vendors are private concerns and need to make a profit. Their representatives sell that product in order to earn a living.

Nevertheless, my experience suggested that vendors' current pricing structures effectively rule scholars at medium-sized and smaller institutions with modest financial resources out of the text mining game - or perhaps more accurately, rule them out of the opportunity to do text mining without spending a great deal of time and effort producing a corpus at the scale that makes text mining effective.

I can only speak to my own experience in the study of nineteenth-century American history.
Text preparation often involves the digitization of analog materials. It can also include the use of digitized text gathered from existing online collections. If a student or scholar wants to work with relatively clean text - i.e., text with significantly fewer OCR errors - they must figure out how to correct errors, either by using scripts or by hand.

Vendors often produce clean(er) text by having materials double hand-keyed by another vendor.  It costs a lot of money.

When vendors charge fees for materials in the public domain, which the Congressional Record certainly is, they in effect charge for access to this cleaned-up, digitized text.

So, here's how I resolved the problem. I asked vendors if they would sell me my preferred chunk of data itself at a more reasonable price.

ProQuest declined to negotiate, but Hein Online (another vendor of digitized government documents) agreed, so I bought, at my own expense, the text of the Congressional Record for the period 1873-1896 for a price I could accept. I now have it available for research. (Upon completing this transaction, I discovered that the University of North Texas Libraries, which present digitized version of the Congressional Record online ( ) would provide me with their data at no charge. There is only one catch: they use uncorrected OCR text, so I will have to spend some time finding ways to correct common errors within the corpus.

I thank the University of North Texas Libraries for the use of their data, and recommend them to other students and scholars. Their collections include a large amount of digitized Texas newspapers, as well as records of the Federal Communication Commission.

My experience with Hein Online led me to draw a parallel to another experience I have had with a vendor. In the past several years I have taken part in the activities of the Digital POWRR Project, which produced a study of digital preservation challenges and potential solutions at medium-sized and smaller colleges and universities lacking large financial resources. The study included the review of a number of applications or tools available for use in digital preservation activities. Among them we found a comprehensive, all-in-one product called Preservica. They made no pricing information available online. We had to call for a quote.

When we contacted their sales representative to ask if they might make the product available for testing at little or no cost, they immediately rejected us, explaining that they targeted very large institutions with suitable budgets, ranging from universities to state and national governments. Preservica is a version of a digital preservation product that the company originally sold to large corporations like banks. They sought out the deep pockets. 

Our white paper recommended that institutions unable to afford a product like Preservica adopt a one-step-at-a-time approach to digital preservation activities using sets of open-source tools in combinations suited to their particular needs.

One other thing happened in the process of doing the study, however. Through a frank and open exchange of views with members of the Digital POWRR team, Preservica executives became aware  that they were leaving money on the table by adopting a call-for-quote stance and pricing their product at a level that put it well out of reach of smaller, less prosperous institutions. We urged them to adopt a more transparent pricing policy and become aware of this other market, which is vast. There are only so many institutions with the resources necessary to buy Preservica at their initial price level. What happens when they all have acquired or constructed a digital preservation application? Where is the growth then?

Preservica executives changed their tune. They made their product available for testing at no charge. They have instituted a transparent, online pricing policy, and have devised versions of their product priced to suit more modest budgets.

I want to suggest that vendors of large sets of text materials do the same.

As more scholars and students, including those at institutions like my own, seek to take part in text mining activities they will confront a shortage of digitized, clean text due to the fact that their institutions cannot afford subscriptions to many online text products, much less an additional fee for preparation and delivery.

Vendors can reach these customers by making text materials available at more reasonable prices on an a la carte basis, as Hein Online did to me. 

If they do not, I fear that they will make a powerful contribution to the perpetuation of the existing situation: students and scholars at the wealthiest colleges and universities can do text mining with access to very large collections of suitable materials, while others may never find their corpus.

There is money on the table.