Tuesday, November 7, 2017

Text Mining at an Institution with Lesser Financial Resources, Revisited

I am presently moving forward with a research program in text-mining at Northern Illinois University Libraries, but have encountered an unexpected obstacle.

About a year ago I ordered a copy of ProQuest's American Periodicals data set for local use. Our library subscribes to ProQuest's hosted version of this product, but the product's design/technical infrastructure does not allow text-mining activities and our license for its use prohibits the downloading of anything but the most insignificant amount of materials. When I contacted ProQuest about the matter, they informed me that I would need to pay an additional $1000 for the preparation and delivery of the entire data set (approximately five terabytes) to me. I could then use the data on my local infrastructure.

For the past two years I have worked with members of my university's Computer Science Department, principally providing graduate students in Data Science with access to relatively large humanities text data sets that I have created myself and questions that they may use to inform text mining activities. Prior to my purchase of the American Periodicals data set, I secured an agreement with that department whereby they would host the materials on their high-capacity computing cluster and make them available for ongoing Data Science research.  I would take delivery of the materials from ProQuest, then transfer them to the cluster for processing and future use.

I still do not have the data. The first six months or so of delays were the result of my Library's mistake in attempting to charge the expense for the materials to the wrong account. Once we resolved that, I struggled to get ProQuest to review my university legal department's proposed (unremarkable) revisions to the contract for several months. Upon resolving that, I was able to forward payment to ProQuest in August, and looked forward to the delivery of the materials.

At this point I learned that ProQuest expected to deliver the full data set to a server of my choosing via the Internet. Since my library does not have 5 TB of extra capacity readily available, I asked for the data to be delivered on a hard drive or hard drives. ProQuest agreed.

A month passed, and I heard nothing from ProQuest. My contact with the company asked me to bear with him as he had staff members absent from the office while on holiday. Another month passed, and after another inquiry I learned that the company reserves the right to deliver the materials on hard drive any time within a period of six months after payment. I see no mention of this reservation in my contract with the company.

It seems likely to me that ProQuest is accustomed to working with institutions large enough, and possessed of enough material resources, to take delivery of such a large data set in this manner quite easily. My institution does not fit that description. After a period of two years without any state support, we recently began to receive payments from the State of Illinois again. Needless to say, our digital infrastructure is far from robust.

If I had known that the delivery of this data by hard drive would prove to be such a difficult matter, I would have made the necessary arrangements with my university's Department of Computer Science to have the data delivered directly to their cluster via the Internet. As this is an inter-divisional matter within the university, it will take some time. I initially intended to take delivery of the American Periodicals materials as quickly as possible, leaving time to work out these arrangements.

But, alas, ProQuest's representatives raised no caveats about hard-drive delivery until I actually started to inquire about the whereabouts of the materials my university had purchased.

Thus my warning: if you are attempting to do text mining research at an institution that doesn't have five terabytes of storage immediately at hand, and want to work with ProQuest data, be aware that they will take up to six months to deliver your data. 

No comments:

Post a Comment