Thursday, May 26, 2016

Text Mining and Library Cataloging

During the spring semester of 2016 I supervised a team of students (Marcos Quezada, a graduate student in Operations Management and Information Systems; Fredrik Stark, a PhD candidate in NIU's English Department, and Mitchell Zaretsky, a junior Computer Science major) as they explored text mining in the context of Northern Illinois University Libraries' large online collection of late nineteenth and early twentieth century dime novels ( We worked in the format of an experiential learning activity, meaning that we addressed a problem brought to us by a client. In this case Matthew Short, NIU Libraries Metadata Librarian and Cataloger, served as the client.

In the experiential learning format, the client presents the student team with a set of goals. Mr. Short asked the team to develop a text classification application or tool to help library catalogers to determine the genre of the approximately 1,900 digitized text in the collection. In traditional cataloging activities, the cataloger inspects a work manually in order to derive basic information necessary to catalog it accurately. This can be a lengthy process. Perhaps text-mining technology could help catalogers to improve the speed and efficiency with which they cataloged a very large collection.

Mr. Short's goals also included the compilation of a list of genres and related subject terms for possible use in reclassifying online digitized collections; investigating text-mining tools for the future development of the prototype classifier application and future studies of the collections.

The team began work by using Weka, an open-source data and text-mining application. Mr. Short selected it because it enables users to acquaint themselves with the separate activities that make up text mining and construct original applications using blocks of existing Java code.

Mr. Short introduced the students to a typical text-mining work flow. He had been working to achieve his goals prior to engaging with this group, and for all intents and purposes led the team's activities. As the team's official coach, I attempted to facilitate discussion, scheduled activities, and completed paperwork.

The students began by gathering text files of digitized dime novels cataloged as belonging to the collection's better-represented genres. These genres included detective and mystery stories; western stories; sea stories; historical fiction; adventure stories; and bildungsromans (coming of age) stories.

The team next engaged in pre-processing activities in order to produce the most accurate text possible. NIU Libraries staff members originally produced the digital texts in the digital dime novel collection by the use of Optical Character Recognition software and did not attempt to correct any mistakes within them. Pre-processing began with the removal of stop words (such as the, an, and, etc.) and also included tokenization (identifying groups of characters as words) and stemming (reducing different inflections of a word to their root form) of words.  We also used Weka to render the text materials as a bag of words (i.e., set aside grammar and word order) and transform words into vectors, or numerical representations.

The team then moved on to text classification. They began by using a set of already-cataloged works to train Weka to identify specific words or sets of words with the individual genres mentioned above. Of the algorithms available in Weka, Naive Bayes proved most effective. They found that in 65% of works examined Weka's classification agreed with that of a human cataloger. Investigating this discrepancy, the team found that the use of additional filtering techniques, including the use of TF-IDF (a process to determine how important a word is to a document in a collection or corpus); a better stemmer (the open-source product Snowball); a list of nineteenth-century stop words composed by Matthew Jockers, a scholar of the period's literature; rendering all letters in lower-case; and setting the number of words in each text to be analyzed to 500 improved accuracy, i.e., Weka agreeing with a human cataloger's genre classification, to 75 %. They also discovered that a number of texts in the training set had been cataloged as belonging in two different genres. Removal of these works improved accuracy to 83%.

With the information above, Mitchell Zaretsky used Weka's Java API to construct an original classifier application. It reported the probability of a work fitting in one of the several genres. Working with a new test corpus of 214 digitized dime novels, the team found that their classifier agreed with human catalogers 71% of the time.

On the basis of this test, the team determined that their application can help catalogers to determine a dime novel's genre. It can also serve as an effective tool for evaluating the genre determinations of catalogers not using the application in their work. They also suggested that text-mining activities uncovered details about the form and content of works in NIU's digitized dime novel collection that invites further research.

Friday, May 6, 2016

Text Mining at an Institution with Lesser Financial Resources

Text Mining at an Institution with Lesser Financial Resources

I have periodically described my experiences with text mining in this blog. Today I want to raise a significant point that has only recently become clear to me. It happened in the wake of my participation in the University of Michigan's "Beyond Cntrl+F" workshop on February 1st of this year.

I want to begin by thanking the University of Michigan Libraries for organizing and hosting the event. It must have taken a great deal of work.

When I first found out about the workshop, I noticed that participants could attend at no charge. This was too good to be true. Working at a state university in the bankrupt State of Illinois, I of course had access to no financial support for professional development activities. I happily drove to Ann Arbor and stayed overnight at my own expense, then took part in the workshop. Without the free-admission policy, I might have passed on the event.

The workshop began with a session devoted to "finding your corpus." Fair enough. No one can do text mining until they have some text. The session featured representatives of several vendors of subscription products providing access to large amounts of textual materials: ProQuest, JSTOR, Gale, Alexander Street Press (full disclosure - I edited an online product for Alexander Street Press and have cashed their checks). It dawned on me that the no-charge policy of course resulted from these vendors' sponsorship of the event. As sponsors, they enjoyed the opportunity to pitch their products to members of a captive audience who had expressed an interest in text mining.

Vendor representatives described how scholars and students might use their products for text-mining projects. One upshot: vendors do not want text miners to attempt to download very large amounts of text through their subscription portal. They want text miners to submit a request for a specific corpus, which they will then prepare and deliver for an extra fee in the range of $500-$1000.

This made something very apparent: text mining is in many cases only practicable at its intended scale at institutions commanding the financial resources necessary to 1) subscribe to these products, and 2) go on to pay the additional fee.

Now to my situation: I am interested in working with the Congressional Record from the nineteenth century. My university does not subscribe to that portion of ProQuest's Congressional product that includes the Record for that period, nor any other product. Working through my library's acquisitions department, I was able to secure a quote for the use of the above-mentioned ProQuest text materials: it amounted to the annual subscription fee for the database product in which the CR resided.

This sum was a complete non-starter at my financially strapped university. I want to emphasize that it would have been a non-starter even if we had a budget, in more prosperous times (NB - ten months into our fiscal year, Northern Illinois University had not received any funds from the State of Illinois. Last month the legislature voted to provide stopgap funding designed to tide institutions over until the larger state budget impasse is resolved).

I understand that library vendors are private concerns and need to make a profit. Their representatives sell that product in order to earn a living.

Nevertheless, my experience suggested that vendors' current pricing structures effectively rule scholars at medium-sized and smaller institutions with modest financial resources out of the text mining game - or perhaps more accurately, rule them out of the opportunity to do text mining without spending a great deal of time and effort producing a corpus at the scale that makes text mining effective.

I can only speak to my own experience in the study of nineteenth-century American history.
Text preparation often involves the digitization of analog materials. It can also include the use of digitized text gathered from existing online collections. If a student or scholar wants to work with relatively clean text - i.e., text with significantly fewer OCR errors - they must figure out how to correct errors, either by using scripts or by hand.

Vendors often produce clean(er) text by having materials double hand-keyed by another vendor.  It costs a lot of money.

When vendors charge fees for materials in the public domain, which the Congressional Record certainly is, they in effect charge for access to this cleaned-up, digitized text.

So, here's how I resolved the problem. I asked vendors if they would sell me my preferred chunk of data itself at a more reasonable price.

ProQuest declined to negotiate, but Hein Online (another vendor of digitized government documents) agreed, so I bought, at my own expense, the text of the Congressional Record for the period 1873-1896 for a price I could accept. I now have it available for research. (Upon completing this transaction, I discovered that the University of North Texas Libraries, which present digitized version of the Congressional Record online ( ) would provide me with their data at no charge. There is only one catch: they use uncorrected OCR text, so I will have to spend some time finding ways to correct common errors within the corpus.

I thank the University of North Texas Libraries for the use of their data, and recommend them to other students and scholars. Their collections include a large amount of digitized Texas newspapers, as well as records of the Federal Communication Commission.

My experience with Hein Online led me to draw a parallel to another experience I have had with a vendor. In the past several years I have taken part in the activities of the Digital POWRR Project, which produced a study of digital preservation challenges and potential solutions at medium-sized and smaller colleges and universities lacking large financial resources. The study included the review of a number of applications or tools available for use in digital preservation activities. Among them we found a comprehensive, all-in-one product called Preservica. They made no pricing information available online. We had to call for a quote.

When we contacted their sales representative to ask if they might make the product available for testing at little or no cost, they immediately rejected us, explaining that they targeted very large institutions with suitable budgets, ranging from universities to state and national governments. Preservica is a version of a digital preservation product that the company originally sold to large corporations like banks. They sought out the deep pockets. 

Our white paper recommended that institutions unable to afford a product like Preservica adopt a one-step-at-a-time approach to digital preservation activities using sets of open-source tools in combinations suited to their particular needs.

One other thing happened in the process of doing the study, however. Through a frank and open exchange of views with members of the Digital POWRR team, Preservica executives became aware  that they were leaving money on the table by adopting a call-for-quote stance and pricing their product at a level that put it well out of reach of smaller, less prosperous institutions. We urged them to adopt a more transparent pricing policy and become aware of this other market, which is vast. There are only so many institutions with the resources necessary to buy Preservica at their initial price level. What happens when they all have acquired or constructed a digital preservation application? Where is the growth then?

Preservica executives changed their tune. They made their product available for testing at no charge. They have instituted a transparent, online pricing policy, and have devised versions of their product priced to suit more modest budgets.

I want to suggest that vendors of large sets of text materials do the same.

As more scholars and students, including those at institutions like my own, seek to take part in text mining activities they will confront a shortage of digitized, clean text due to the fact that their institutions cannot afford subscriptions to many online text products, much less an additional fee for preparation and delivery.

Vendors can reach these customers by making text materials available at more reasonable prices on an a la carte basis, as Hein Online did to me. 

If they do not, I fear that they will make a powerful contribution to the perpetuation of the existing situation: students and scholars at the wealthiest colleges and universities can do text mining with access to very large collections of suitable materials, while others may never find their corpus.

There is money on the table.