Friday, May 6, 2016

Text Mining at an Institution with Lesser Financial Resources


Text Mining at an Institution with Lesser Financial Resources

I have periodically described my experiences with text mining in this blog. Today I want to raise a significant point that has only recently become clear to me. It happened in the wake of my participation in the University of Michigan's "Beyond Cntrl+F" workshop on February 1st of this year.

I want to begin by thanking the University of Michigan Libraries for organizing and hosting the event. It must have taken a great deal of work.

When I first found out about the workshop, I noticed that participants could attend at no charge. This was too good to be true. Working at a state university in the bankrupt State of Illinois, I of course had access to no financial support for professional development activities. I happily drove to Ann Arbor and stayed overnight at my own expense, then took part in the workshop. Without the free-admission policy, I might have passed on the event.

The workshop began with a session devoted to "finding your corpus." Fair enough. No one can do text mining until they have some text. The session featured representatives of several vendors of subscription products providing access to large amounts of textual materials: ProQuest, JSTOR, Gale, Alexander Street Press (full disclosure - I edited an online product for Alexander Street Press and have cashed their checks). It dawned on me that the no-charge policy of course resulted from these vendors' sponsorship of the event. As sponsors, they enjoyed the opportunity to pitch their products to members of a captive audience who had expressed an interest in text mining.

Vendor representatives described how scholars and students might use their products for text-mining projects. One upshot: vendors do not want text miners to attempt to download very large amounts of text through their subscription portal. They want text miners to submit a request for a specific corpus, which they will then prepare and deliver for an extra fee in the range of $500-$1000.

This made something very apparent: text mining is in many cases only practicable at its intended scale at institutions commanding the financial resources necessary to 1) subscribe to these products, and 2) go on to pay the additional fee.

Now to my situation: I am interested in working with the Congressional Record from the nineteenth century. My university does not subscribe to that portion of ProQuest's Congressional product that includes the Record for that period, nor any other product. Working through my library's acquisitions department, I was able to secure a quote for the use of the above-mentioned ProQuest text materials: it amounted to the annual subscription fee for the database product in which the CR resided.

This sum was a complete non-starter at my financially strapped university. I want to emphasize that it would have been a non-starter even if we had a budget, in more prosperous times (NB - ten months into our fiscal year, Northern Illinois University had not received any funds from the State of Illinois. Last month the legislature voted to provide stopgap funding designed to tide institutions over until the larger state budget impasse is resolved).

I understand that library vendors are private concerns and need to make a profit. Their representatives sell that product in order to earn a living.

Nevertheless, my experience suggested that vendors' current pricing structures effectively rule scholars at medium-sized and smaller institutions with modest financial resources out of the text mining game - or perhaps more accurately, rule them out of the opportunity to do text mining without spending a great deal of time and effort producing a corpus at the scale that makes text mining effective.

I can only speak to my own experience in the study of nineteenth-century American history.
Text preparation often involves the digitization of analog materials. It can also include the use of digitized text gathered from existing online collections. If a student or scholar wants to work with relatively clean text - i.e., text with significantly fewer OCR errors - they must figure out how to correct errors, either by using scripts or by hand.

Vendors often produce clean(er) text by having materials double hand-keyed by another vendor.  It costs a lot of money.

When vendors charge fees for materials in the public domain, which the Congressional Record certainly is, they in effect charge for access to this cleaned-up, digitized text.

So, here's how I resolved the problem. I asked vendors if they would sell me my preferred chunk of data itself at a more reasonable price.





ProQuest declined to negotiate, but Hein Online (another vendor of digitized government documents) agreed, so I bought, at my own expense, the text of the Congressional Record for the period 1873-1896 for a price I could accept. I now have it available for research. (Upon completing this transaction, I discovered that the University of North Texas Libraries, which present digitized version of the Congressional Record online (http://digital.library.unt.edu/explore/collections/CONGR/ ) would provide me with their data at no charge. There is only one catch: they use uncorrected OCR text, so I will have to spend some time finding ways to correct common errors within the corpus.


I thank the University of North Texas Libraries for the use of their data, and recommend them to other students and scholars. Their collections include a large amount of digitized Texas newspapers, as well as records of the Federal Communication Commission.

My experience with Hein Online led me to draw a parallel to another experience I have had with a vendor. In the past several years I have taken part in the activities of the Digital POWRR Project, which produced a study of digital preservation challenges and potential solutions at medium-sized and smaller colleges and universities lacking large financial resources. The study included the review of a number of applications or tools available for use in digital preservation activities. Among them we found a comprehensive, all-in-one product called Preservica. They made no pricing information available online. We had to call for a quote.

When we contacted their sales representative to ask if they might make the product available for testing at little or no cost, they immediately rejected us, explaining that they targeted very large institutions with suitable budgets, ranging from universities to state and national governments. Preservica is a version of a digital preservation product that the company originally sold to large corporations like banks. They sought out the deep pockets. 

Our white paper recommended that institutions unable to afford a product like Preservica adopt a one-step-at-a-time approach to digital preservation activities using sets of open-source tools in combinations suited to their particular needs.

One other thing happened in the process of doing the study, however. Through a frank and open exchange of views with members of the Digital POWRR team, Preservica executives became aware  that they were leaving money on the table by adopting a call-for-quote stance and pricing their product at a level that put it well out of reach of smaller, less prosperous institutions. We urged them to adopt a more transparent pricing policy and become aware of this other market, which is vast. There are only so many institutions with the resources necessary to buy Preservica at their initial price level. What happens when they all have acquired or constructed a digital preservation application? Where is the growth then?

Preservica executives changed their tune. They made their product available for testing at no charge. They have instituted a transparent, online pricing policy, and have devised versions of their product priced to suit more modest budgets.

I want to suggest that vendors of large sets of text materials do the same.

As more scholars and students, including those at institutions like my own, seek to take part in text mining activities they will confront a shortage of digitized, clean text due to the fact that their institutions cannot afford subscriptions to many online text products, much less an additional fee for preparation and delivery.

Vendors can reach these customers by making text materials available at more reasonable prices on an a la carte basis, as Hein Online did to me. 

If they do not, I fear that they will make a powerful contribution to the perpetuation of the existing situation: students and scholars at the wealthiest colleges and universities can do text mining with access to very large collections of suitable materials, while others may never find their corpus.

There is money on the table.