Friday, April 9, 2021

ITHAKA Constellate: Text-Mining Product in Development

I have been invited to evaluate a beta version of ITHAKA' s text-mining product, which is tentatively titled Constellate. I'm thankful for the opportunity. 

I have some knowledge of other text-mining products made available by library materials vendors like ProQuest and Gale. In my experience they work well, but they only offer the use of text materials found in those portions of individual vendors' available collections to which your particular institution has a subscription. If you want access to more materials for your data set, your institution needs to subscribe to more collections. 

This type of product in general would be very helpful in teaching text data analysis at scale to non-programmers. I believe that humanities students can benefit from activities helping them to learn how to formulate hypotheses and evaluate evidence found in very large data sets. As individuals already receiving training in the critical evaluation of materials, they could make a valuable contribution to data-driven organizational activities in a number of fields. Put another way, employers of course need programmers able to build and adjust text-mining applications or sets of applications. But they also need critical thinkers to evaluate and results.

Access to a relatively limited number of text data sets is not a problem for this type of experiential learning, but it does present a large obstacle to original scholarly research. A paper making an argument based on the analysis of a data set that only contains those nineteenth-century text materials appearing in a ProQuest or Gale data set will very likely overlook a large part of the available historical record. Researchers need to be able to upload their own data sets into online text-mining services.  

It is also my impression that the code and algorithms that do the data analysis for vendor-served text mining project remain proprietary, which means that researchers and collaborating programmers would be unable to download the code and customize it for their own use. Since in my experience effective text-mining often requires a great deal of adjustment and customization, this presents another problem.

 Sales representatives for the above companies have made general statements about how their programmers were and are working on a function that would allow subscribers to upload their own data, but to my knowledge that has not happened. If any representatives of ProQuest, Gale, or other library vendors making similar products available have information to the contrary, please contact me and I will be happy evaluate your product.

I am very interested in Constellate because the ITHAKA representative with whom I spoke emphasized that their organization plans to present the service as A) able to analyze outside data sets, and B) willing to allow outside programmers to access its Python code for the purpose of customization. They hope to build a collection or set of open-source code applications that various Constellate users have constructed. 

This would be a very promising situation for researchers, teachers and learners situated at R2 and smaller institutions lacking large financial resources.

I will spend the next few months working with Constellate and report on what I discover.

"Some Assembly Required: Low-Cost Digitization of Materials from Magnetic Tape Formats for Preservation and Access"

 Earlier this year three colleagues and I published an article discussing the digitization of sound materials from magnetic tape formats. 

Please find the abstract and a link to the journal below. It is my understanding that the individual article will be embargoed until March, 2022, so the link to the individual article itself probably will not work until then.

"Some Assembly Required: Low-Cost Digitization of Materials from Magnetic Tape Formats for Preservation and Access"

Preservation, Digital Technology, and Culture 49 (3) October, 2020, 89-98


Recent work discussing the digitization and preservation of magnetic tape materials has maintained that it should be left to expert practitioners and that the resulting digital materials should be stored in digital repositories. This article suggests that librarians and archivists lacking extensive technical skills or access to expertise can digitize these materials themselves. It provides a detailed account, including challenges faced, of how a team of practitioners without prior training or experience digitized historical audio recordings on cassette and open reel tape at Northern Illinois University Libraries. The discussion reviews the assembly of equipment and software that the team used for digitization work, discussing each element’s significance and how they came together as a functioning workflow. The authors also emphasize the fact that while the digitization of fragile and/or degraded magnetic tape materials may contribute to the preservation of their contents, this action also creates a new set of materials with their own preservation needs. Realizing that many practitioners serving medium-sized and smaller institutions lacking large financial resources may not have access to a full-fledged digital repository, they suggest the use of the National Digital Stewardship Alliance’s Levels of Digital Preservation rubric as a means by which practitioners may incrementally increase the probability that digital materials made from magnetic tapes will remain accessible.