I have been invited to evaluate a beta version of ITHAKA' s text-mining product, which is tentatively titled Constellate. I'm thankful for the opportunity.
I have some knowledge of other text-mining products made available by library materials vendors like ProQuest and Gale. In my experience they work well, but they only offer the use of text materials found in those portions of individual vendors' available collections to which your particular institution has a subscription. If you want access to more materials for your data set, your institution needs to subscribe to more collections.
This type of product in general would be very helpful in teaching text data analysis at scale to non-programmers. I believe that humanities students can benefit from activities helping them to learn how to formulate hypotheses and evaluate evidence found in very large data sets. As individuals already receiving training in the critical evaluation of materials, they could make a valuable contribution to data-driven organizational activities in a number of fields. Put another way, employers of course need programmers able to build and adjust text-mining applications or sets of applications. But they also need critical thinkers to evaluate and results.
Access to a relatively limited number of text data sets is not a problem for this type of experiential learning, but it does present a large obstacle to original scholarly research. A paper making an argument based on the analysis of a data set that only contains those nineteenth-century text materials appearing in a ProQuest or Gale data set will very likely overlook a large part of the available historical record. Researchers need to be able to upload their own data sets into online text-mining services.
It is also my impression that the code and algorithms that do the data analysis for vendor-served text mining project remain proprietary, which means that researchers and collaborating programmers would be unable to download the code and customize it for their own use. Since in my experience effective text-mining often requires a great deal of adjustment and customization, this presents another problem.
Sales representatives for the above companies have made general statements about how their programmers were and are working on a function that would allow subscribers to upload their own data, but to my knowledge that has not happened. If any representatives of ProQuest, Gale, or other library vendors making similar products available have information to the contrary, please contact me and I will be happy evaluate your product.
I am very interested in Constellate because the ITHAKA representative with whom I spoke emphasized that their organization plans to present the service as A) able to analyze outside data sets, and B) willing to allow outside programmers to access its Python code for the purpose of customization. They hope to build a collection or set of open-source code applications that various Constellate users have constructed.
This would be a very promising situation for researchers, teachers and learners situated at R2 and smaller institutions lacking large financial resources.
I will spend the next few months working with Constellate and report on what I discover.