Thursday, December 4, 2014

Text Mining for Beginners, redux

This week a team of Northern Illinois University students and their faculty coach presented their findings after a semester devoted to investigating text-mining from the perspective of a novice.

They have produced a report in which they provide a basic description of text mining itself, including a review of some of the types of procedures used to detect patterns within a very large body text; the types of text available for text-mining work; a discussion of structured, unstructured, and semi-structured data as they pertain to text-mining work; the importance of preparing digital texts (especially those created by Optical Character Recognition Software) for mining activities; and reviews of three well-known text-mining applications: Mallet, Weka, and RapidMiner. Of these, the first two are freely-available open-source software; the third is available in a free demonstration version but requires purchase in order to make use of its most powerful capabilities. These reviews include brief discussions of the types of text-mining activities (i.e., topic modeling, document clustering, sentiment analysis, etc.) that each makes possible. Finally, the report describes the team's activities in using the three applications to perform analyses on sample bodies of text, and the results produced.

I hope to work with the team and their coaches to round the report out into a resource that I can distribute to interested members of the NIU community. We also hope to make it available via Huskie Commons, the university's institutional repository (http://commons.lib.niu.edu ).


Monday, November 3, 2014

Text mining for beginners

I am now Director of Digital Scholarship at Northern Illinois University Libraries. This means that it is now my job to work with faculty members seeking to employ technologies like Geographic Information Systems, text-mining and data visualization - helping those with little experience in such work find a way to put the technology to work.

As I really do not know very much at all about these activities, my job is now an exercise in learning something new. To this end, I have sought some help.

This fall I am working with a team of three Northern Illinois University students and their faculty coach, who will provide me with an evaluation of several open-source text-mining utilities, as well as a more general review of resources available for a scholar or other practitioner who might want to take up text-mining but lacks any experience in the work.

I spent last summer trying to identify and prepare text materials for their use in the evaluation of the utilities, and found very little information explaining how to begin a text mining project - i.e., finding digitized texts, selecting texts for research, and working them into a format suitable for use with the software - available.

I am looking forward to the students' report, and will try to bring their findings to the attention of historians and other humanities scholars who might be interested in text-mining.

Thursday, January 23, 2014

Digital/Online Materials and their Place in Historical Scholarship

At the recent meeting of the American Historical Association in Washington, D.C., I made a presentation as part of a discussion session (i.e., not a regular panel - we sat in a circle and talked after very short presentations made by people sitting as part of the circle) exploring digital materials, ranging from blogs and web sites to social media, and the questions that they raise as scholars begin to make use of them as primary sources. Other presenters talked about the future of MOOCs and crowd-sourcing the search for elusive information about a relatively obscure historical figure. I discussed the work of the Digital POWRR project and the challenges presented by the fact that digital objects are generally subject to loss in the relatively short term due to a number of reasons, including hardware and software incompatibility and the degradation of storage media.

One major question that emerged in the discussion was the status of social media materials and other online, digital sources in light of the fact that they are so prone to loss. One presenter at the preceding panel (our discussion group was part of a linked set of two events) described how she had based her work on Pakistani women in part on a web site that no longer existed, apparently because of hacking activities undertaken by parties believing that Pakistani women should not express themselves in this format. The presenter said that she had printed out the sites pages for her own record and thus could document her use of the source. But this made me wonder about the future practice of history.

So, what of digital sources like blogs, web sites, and social media objects like tweets? Digital objects' intrinsic frailty and the complex, easily disrupted nature of the internet used to present them make them fundamentally unreliable as primary sources, at least by the standards developed for the use of analog/paper media materials.  

It seems to me that although history is certainly not a science in any way, historians are similar to scientists in at least one regard. Much like a scientific discovery can only be accepted and confirmed as other practitioners are able to repeat the experiment and yield the same result, historians are accustomed to being able to lay their hands on a paper source cited in a footnote. Manuscripts are usually unique items, but if one travels to the archive and looks in the box and folder number cited, the item will be there. There may be a very small number of copies of a book, but if one is willing to make the trip to the right library, the book will be there. Historians will of course debate a scholar's reading of a source, but the existence of the source itself is fundamental to the discipline. If the item is not there, practitioners may rightly begin to ask questions about the legitimacy of a work citing it.

Many of the participants in the AHA discussion emphasized the need to preserve online digital materials as fully as possible. I certainly concur. But a whole host of problems, not the least of which is the considerable expense involved in the curation/preservation of digital materials, make this impossible. We will have to face that fact that a considerable amount of online digital objects that future historians may want to use as evidence will simply disappear. 

In this situation, several questions occur to me: How will we evaluate work citing online materials that are no longer existent? What if scholars relying on such missing evidence can produce a print-out or other facsimile of the materials? Can we distinguish cases of vanished evidence in which legitimate facsimiles exist from cases of academic fraud?












Wednesday, August 21, 2013

Learning About Digital Scholarship

After fifteen years devoted to the digitization and presentation of historical materials, I have been asked to investigate the emerging field of digital scholarship with an eye toward supporting at least several of its constituent activities on the Northern Illinois University campus. As a historian, I will begin with the digital humanities and attempt to work my way toward understanding digital scholarship in other disciplines.

This will certainly involve familiarizing myself with the considerable literature discussing digital scholarship in general, as well as the bodies of work discussing major subdivisions in it, like digital publishing, data/text mining, Geographic Information Systems and other forms of data visualization, and the retrieval of digital data via Application Programming Interfaces.

My task presents a challenge very much like that confronted by our present IMLS-funded study of how medium-sized and smaller institutions lacking large financial resources might achieve increasingly high levels of preservation for digital objects. From my present perspective, without the benefit of great familiarity with the field, it appears that successful digital scholarship and/or digital humanities programs at universities and colleges require significant amounts of resources. Proprietary software requires the payment of purchase/subscription fees. Open-source software requires the contributions of skilled programmers and developers. Both require the contributions of other skilled professionals familiar with their use in the different specialties making up digital scholarship and/or digital humanities, as well as their relevance to existing, more traditional scholarly discourses. These are luxuries that I have reason to believe my university, dependent for funding upon the worst-governed state in the nation, cannot presently afford.

Thus I will undertake my new work with an eye toward discovering ways in which members of the university community might produce digital scholarship with the least possible outlay of financial and other institutional resources. In these early days, I am planning to attempt to discover those faculty members on the NIU campus already doing digital scholarship of one type or another, in the hope that I might learn from them and enable them to learn from and support each other.

Monday, June 24, 2013

Overview of digital preservation tools

A chart containing brief descriptions of fifty-five digital preservation tools, including ingest and storage functions, can be found at http://digitalpowrr.niu.edu/tool-grid/

Monday, May 13, 2013

Information about Digital Preservation Tools

This week the Digital POWRR project staff has posted a large amount of information describing fifty-seven tools used in digital preservation activities. See http://digitalpowrr.niu.edu/tool-grid/. They include back-end storage providers and ingest/processing ("front end") utilities.

While a relatively small number of general, integrated front end applications like Archivematica and Curator's Workbench are currently available, individuals and institutions pondering a digital preservation initiative can also bring a number of ingest/processing tools together to assemble an ingest workflow suited to their specific needs.

As we found in the considerable amount of time required to review each of these tools and its capabilities, accumulating the knowledge necessary to make informed decisions in this matter can be quite a challenge. Hopefully, our list of available tools can help to shorten the amount of time and effort required.

In the coming year the project collaborators will test and review two back-end solutions, DuraSpace and Meta-Archive, and front-end utilities Archivematica and Curator's Workbench. 

Friday, September 21, 2012

Web site/app review: Historypin

Historypin is an online resource (www.historypin.com) presenting a wide array of user-submitted photographs, videos, audio clips, and stories and  in a geo-spatial format using Google Maps. Founded in 2010, the Historypin web site and mobile phone apps present materials principally organized by their relationship to specific locations. For example, photographs of an event taking place in my home town of DeKalb, Illinois would be available via a "pin" (link) appearing at DeKalb's location on a Google Map.

Photographs appear to be much more common than other types of materials.

A variety of individuals and institutions have submitted materials to Historypin. These include individuals posting items relating to weddings, birthday parties, reunions, etc., - as well as museums and archives submitting materials from their collections.

The Historypin interface allows users an opportunity to search for materials by date and keyword via the main map interface. For example, a search using the keyword "soccer" revealed a photograph collection from St. Louis University, dated 1965, relating to that institution's soccer team, as well as 1975 photographs of Los Angeles Mayor Tom Bradley and the Brazilian soccer player Pele. 

Many materials in Historypin may be of interest to historians, but the site will likely prove frustrating. All of the resources discovered by my limited review presented very limited metadata/descriptive information. Materials  submitted by archives, museums, and professionally-staffed institutions generally included (in my review of available materials) more information, but generally less than one would hope to find in a visit to an archive, museum, or library itself. Resources submitted by individuals or non-professional groups generally included very little or no metadata, aside from a title and perhaps a date.


The archive/museum/library materials that I examined often were accompanied by contact information for their home institution, which could enable an interested scholar to track down additional information via email. But my review of individual submissions revealed no such contact information for individual contributors. These are then materials presented in geographical and (usually) temporal context, but lacking virtually all other types of helpful information.

I also wondered about the long-term status of materials appearing on Historypin. Might an institution or individual adding a single item or collection to Historypin withdraw those materials at some future date? If a scholar or other researcher wishes to refer to those materials in a publication, blog post, or in-person presentation, how can s/he be sure that they will still be available in the future? What if Historypin goes out of business or experiences a catastrophic technical failure.

These questions affect a scholar's willingness to use/cite materials found via Historypin in a publication or other type of presentation requiring formal documentation. 

In the end Historypin is a positive development in that it brings a wealth of historical materials to the attention of the vast public using the web. In this regard it is sure to stimulate historical thinking and discussion on a number of levels. But its emergence raises a number of problems and questions for users interested in more than casual browsing of available resources.