Friday, July 31, 2015

About that black hole...

This is funny.

In the course of preparing an article with my colleague Jaime Schumacher I came across Ross Harvey's "So Where's the Black Hole in our Collective Memory?: A Provocative Position Paper" (January, 2008), which suggests that the digital preservation community has been overly alarmist in contending that digital materials are succumbing to a variety of risk factors, rendering them unavailable for future use.

Harvey maintained that - at least in 2008 - researchers had not presented enough evidence to demonstrate that digital materials loss was taking place on a meaningful scale, and asked for further data. Our article provided such data, so I decided to include Harvey's request in the text.

This meant that I needed to provide a citation for his paper, of course. I had previously found it available online via the Digital Preservation Europe web site at, but on July 29, 2015 I could not find a copy of it online - at all. I tried again yesterday and today. No luck.

Just to be clear, I was unable to find a copy of a 2008 paper arguing that digital preservation advocates had overstated the threat of digital data loss, including that presented on the web. How ironic.

Remembering a certain pop singer's misuse of the word "ironic" in a hit song some twenty years ago, I turned to the Oxford English Dictionary for a definition of "irony."

I found, as the third meaning of the noun - "a state of affairs or an event that seems deliberately contrary to what was or might be expected; an outcome cruelly, humorously, or strangely at odds with assumptions or expectations."

I would note that this occurrence certainly seems to contrary to Mr. Harvey's expectations - at least from 2008, but it is not contrary to my own.

Our paper on digital data loss among university faculty will be published shortly by the International Journal of Digital Curation. It corroborates digital preservation advocates' familiar contention that data loss is indeed taking place.

Feel free to mention our findings in presentations to campus stakeholders and conversations with individuals unaware of the threat of digital data loss.

You might also use the Ross Harvey story for an icebreaker or a laugh midway through a talk.

 Ultimately I provided a citation for Harvey's paper from the Internet Archive's Wayback Machine, which seeks to address situations like this by providing access to an archive of web pages, organized chronologically. In effect, it seeks to provide snapshots of the web at given dates.

Mr. Harvey would certainly contend that his paper's existence on the Wayback Machine proves his point - that digital data disappearing from its original place of online presentation can very often be retrieved elsewhere. And so it was.

The Wayback Machine is far from comprehensive, however. It is also little-known among those outside the library and information science community.

Harvey also may have retreated from his intentionally provocative 2008 proclamation. Even if he has, this situation creates  a potentially useful anecdote in the ongoing effort to convince those outside the community of practitioners that the threat of digital data loss is real.

Thursday, December 4, 2014

Text Mining for Beginners, redux

This week a team of Northern Illinois University students and their faculty coach presented their findings after a semester devoted to investigating text-mining from the perspective of a novice.

They have produced a report in which they provide a basic description of text mining itself, including a review of some of the types of procedures used to detect patterns within a very large body text; the types of text available for text-mining work; a discussion of structured, unstructured, and semi-structured data as they pertain to text-mining work; the importance of preparing digital texts (especially those created by Optical Character Recognition Software) for mining activities; and reviews of three well-known text-mining applications: Mallet, Weka, and RapidMiner. Of these, the first two are freely-available open-source software; the third is available in a free demonstration version but requires purchase in order to make use of its most powerful capabilities. These reviews include brief discussions of the types of text-mining activities (i.e., topic modeling, document clustering, sentiment analysis, etc.) that each makes possible. Finally, the report describes the team's activities in using the three applications to perform analyses on sample bodies of text, and the results produced.

I hope to work with the team and their coaches to round the report out into a resource that I can distribute to interested members of the NIU community. We also hope to make it available via Huskie Commons, the university's institutional repository ( ).

Monday, November 3, 2014

Text mining for beginners

I am now Director of Digital Scholarship at Northern Illinois University Libraries. This means that it is now my job to work with faculty members seeking to employ technologies like Geographic Information Systems, text-mining and data visualization - helping those with little experience in such work find a way to put the technology to work.

As I really do not know very much at all about these activities, my job is now an exercise in learning something new. To this end, I have sought some help.

This fall I am working with a team of three Northern Illinois University students and their faculty coach, who will provide me with an evaluation of several open-source text-mining utilities, as well as a more general review of resources available for a scholar or other practitioner who might want to take up text-mining but lacks any experience in the work.

I spent last summer trying to identify and prepare text materials for their use in the evaluation of the utilities, and found very little information explaining how to begin a text mining project - i.e., finding digitized texts, selecting texts for research, and working them into a format suitable for use with the software - available.

I am looking forward to the students' report, and will try to bring their findings to the attention of historians and other humanities scholars who might be interested in text-mining.

Thursday, January 23, 2014

Digital/Online Materials and their Place in Historical Scholarship

At the recent meeting of the American Historical Association in Washington, D.C., I made a presentation as part of a discussion session (i.e., not a regular panel - we sat in a circle and talked after very short presentations made by people sitting as part of the circle) exploring digital materials, ranging from blogs and web sites to social media, and the questions that they raise as scholars begin to make use of them as primary sources. Other presenters talked about the future of MOOCs and crowd-sourcing the search for elusive information about a relatively obscure historical figure. I discussed the work of the Digital POWRR project and the challenges presented by the fact that digital objects are generally subject to loss in the relatively short term due to a number of reasons, including hardware and software incompatibility and the degradation of storage media.

One major question that emerged in the discussion was the status of social media materials and other online, digital sources in light of the fact that they are so prone to loss. One presenter at the preceding panel (our discussion group was part of a linked set of two events) described how she had based her work on Pakistani women in part on a web site that no longer existed, apparently because of hacking activities undertaken by parties believing that Pakistani women should not express themselves in this format. The presenter said that she had printed out the sites pages for her own record and thus could document her use of the source. But this made me wonder about the future practice of history.

So, what of digital sources like blogs, web sites, and social media objects like tweets? Digital objects' intrinsic frailty and the complex, easily disrupted nature of the internet used to present them make them fundamentally unreliable as primary sources, at least by the standards developed for the use of analog/paper media materials.  

It seems to me that although history is certainly not a science in any way, historians are similar to scientists in at least one regard. Much like a scientific discovery can only be accepted and confirmed as other practitioners are able to repeat the experiment and yield the same result, historians are accustomed to being able to lay their hands on a paper source cited in a footnote. Manuscripts are usually unique items, but if one travels to the archive and looks in the box and folder number cited, the item will be there. There may be a very small number of copies of a book, but if one is willing to make the trip to the right library, the book will be there. Historians will of course debate a scholar's reading of a source, but the existence of the source itself is fundamental to the discipline. If the item is not there, practitioners may rightly begin to ask questions about the legitimacy of a work citing it.

Many of the participants in the AHA discussion emphasized the need to preserve online digital materials as fully as possible. I certainly concur. But a whole host of problems, not the least of which is the considerable expense involved in the curation/preservation of digital materials, make this impossible. We will have to face that fact that a considerable amount of online digital objects that future historians may want to use as evidence will simply disappear. 

In this situation, several questions occur to me: How will we evaluate work citing online materials that are no longer existent? What if scholars relying on such missing evidence can produce a print-out or other facsimile of the materials? Can we distinguish cases of vanished evidence in which legitimate facsimiles exist from cases of academic fraud?

Wednesday, August 21, 2013

Learning About Digital Scholarship

After fifteen years devoted to the digitization and presentation of historical materials, I have been asked to investigate the emerging field of digital scholarship with an eye toward supporting at least several of its constituent activities on the Northern Illinois University campus. As a historian, I will begin with the digital humanities and attempt to work my way toward understanding digital scholarship in other disciplines.

This will certainly involve familiarizing myself with the considerable literature discussing digital scholarship in general, as well as the bodies of work discussing major subdivisions in it, like digital publishing, data/text mining, Geographic Information Systems and other forms of data visualization, and the retrieval of digital data via Application Programming Interfaces.

My task presents a challenge very much like that confronted by our present IMLS-funded study of how medium-sized and smaller institutions lacking large financial resources might achieve increasingly high levels of preservation for digital objects. From my present perspective, without the benefit of great familiarity with the field, it appears that successful digital scholarship and/or digital humanities programs at universities and colleges require significant amounts of resources. Proprietary software requires the payment of purchase/subscription fees. Open-source software requires the contributions of skilled programmers and developers. Both require the contributions of other skilled professionals familiar with their use in the different specialties making up digital scholarship and/or digital humanities, as well as their relevance to existing, more traditional scholarly discourses. These are luxuries that I have reason to believe my university, dependent for funding upon the worst-governed state in the nation, cannot presently afford.

Thus I will undertake my new work with an eye toward discovering ways in which members of the university community might produce digital scholarship with the least possible outlay of financial and other institutional resources. In these early days, I am planning to attempt to discover those faculty members on the NIU campus already doing digital scholarship of one type or another, in the hope that I might learn from them and enable them to learn from and support each other.

Monday, June 24, 2013

Overview of digital preservation tools

A chart containing brief descriptions of fifty-five digital preservation tools, including ingest and storage functions, can be found at

Monday, May 13, 2013

Information about Digital Preservation Tools

This week the Digital POWRR project staff has posted a large amount of information describing fifty-seven tools used in digital preservation activities. See They include back-end storage providers and ingest/processing ("front end") utilities.

While a relatively small number of general, integrated front end applications like Archivematica and Curator's Workbench are currently available, individuals and institutions pondering a digital preservation initiative can also bring a number of ingest/processing tools together to assemble an ingest workflow suited to their specific needs.

As we found in the considerable amount of time required to review each of these tools and its capabilities, accumulating the knowledge necessary to make informed decisions in this matter can be quite a challenge. Hopefully, our list of available tools can help to shorten the amount of time and effort required.

In the coming year the project collaborators will test and review two back-end solutions, DuraSpace and Meta-Archive, and front-end utilities Archivematica and Curator's Workbench.