Friday, January 15, 2016

Internships for Humanities Students

Douglas Baker, the president of Northern Illinois University, has recently urged university faculty and staff members to help students find internship opportunities. His announced goal is an internship for every student.

Dr. Baker has a background in business education, where internships have been shown to help students trained in specific business skills to find jobs. I have observed that engineering and computer science students can also often find internship opportunities in the private sector.

But what of humanities students?

They do not develop the types of specific skills (i.e., accounting, computer programming) that allow them to provide a business or organization providing an internship opportunity with an immediate contribution. In many cases educators have traditionally thought of humanities majors as training for executive work, because they teach the general critical-thinking skills needed to think strategically. That idea seems to be very much under siege now.

I presently work with NIU's Digital Convergence Lab to provide students with opportunities to explore how digital scholarship technology like text-mining and Geographic Information Systems (GIS) can facilitate new types of humanities work. I am about to start on my second such experiential learning activity his semester.

To date we have had trouble attracting interested humanities students, while computer science majors have been more interested in taking part.

I intend to spend this semester introducing humanities faculty members and administrators at NIU to the idea of digital humanities as internships for humanities majors.

I certainly do not intend to claim that participation in a single experiential learning activity devoted to text-mining or GIS will enable a Philosophy major to go out and get a job using that technology.

I do intend to introduce humanities faculty and students to the idea of working with source materials at scale, however.  

This type of work increasingly makes up a very important, even crucial, part of any relatively large business or organization's administrative activities.

At present most humanities majors or graduate students do something like this: read a specific number of texts very closely, then write a paper identifying and discussing a theme within them.

This may prepare individuals for law school, but in an age of big data, it may seem positively archaic to most employers.

If humanities students can become acquainted with how to work with data - any data - at scale, they will have benefited from such an internship. Even if they cannot master the technology in a semester, humanities students can begin to understand what types of questions the technology can help them to ask.

History Harvest at NIU

This semester I will also be working with several colleagues to plan how the University Libraries might enable NIU's History Department to include History Harvest activities in one of its class offerings.

A History Harvest is a collaborative activity in which teams of students and faculty coaches produce digital facsimiles of historical artifacts in community and present them on the web via an online exhibit. 

The idea of a history harvest has been around for a while. I recall that when I was a part-time student worker for the University of Virginia's Valley of the Shadow Project (, some of my colleagues organized one in order to find local historical materials in the two counties featured in the project web site, digitize them, and add them to the project archive.

The harvest usually takes place on a single day, at a single place, where students and faculty members have assembled scanners and other digital technology to ingest materials. Publicizing this event is always an important part of the larger activity, as the number and type of historical artifacts brought in for digitization determine the scope and nature of the students' future work.

In recent years the University of Nebraska, Lincoln has made the History Harvest a part of its curriculum.  For examples of online materials created by UNL teams, see,, and

UNL students have also used History Harvests to produce additional multimedia materials, which are available at

At Nebraska a History Harvest takes the form of a class in which students spend time early in the semester learning about the subject and period they will explore, then move on to organizing the harvest and producing the collections and exhibits drawn from it. 

While we do not intend to produce a stand-alone class around the idea of a History Harvest, we do look forward to working with Northern Illinois University's University Archives and Regional History Center, as well as Dr. Stanley Arnold of the university's history department, to integrate the above activities into one of their present class offerings. 

More text mining

This semester I will be coordinating the work of an experiential learning activity supported by NIU's Digital Convergence Lab. In it I will collaborate with Matthew Short, metadata librarian at Northern Illinois University Libraries, and a team of four NIU students to explore how text-mining technology might help Mr. Short to catalog our library's very large digital collection of dime novel materials ( .

Library catalogers describe books in a number of ways in order to help  users to find and enjoy them. One type of description involves a book's subject matter. Catalogers typically determine a book's subject matter by examining it themselves - not reading the whole thing, but reading enough to be able to describe it in very basic terms.  In the case of a collection that includes thousands of titles - some 14 million words - this is an impossible goal for a single cataloger. Hence, Mr. Short would like to look into how text mining technology might be able to help him to determine a book's content - in broad outline - with an eye toward streamlining the cataloging process.

Catalogers also try to identify a work's author. Scholars of nineteenth and early twentieth century dime novels know that in many cases these materials were published as the work of a fictitious author - like the Hardy Boys later were presented as the the work of "Franklin W. Dixon" - but were really written by unknown individuals. Scholars have identified some of these anonymous authors who wrote under different names, but they would like to be able to match up the authors with their works. One way to do this is to use some text known to be the work of an individual to train a text mining application to identify that author's style, and then compare it to other works of unknown authorship. Mr. Short is also interested in using this type of author attribution function to help him catalog dime novels.

We are interested in devising ways that text mining technology, in this case the open-source software application Weka, can make the type of determinations Matt needs.

Because this is new to me, I do not know how quickly the students (three computer science majors and a graduate student in English) can accomplish Matt's goals, so Matt and I are at work developing additional tasks for them should they complete his original inquiries well before the end of the semester.

I will describe the group's work in posts throughout the semester. 

Friday, July 31, 2015

About that black hole...

This is funny.

In the course of preparing an article with my colleague Jaime Schumacher I came across Ross Harvey's "So Where's the Black Hole in our Collective Memory?: A Provocative Position Paper" (January, 2008), which suggests that the digital preservation community has been overly alarmist in contending that digital materials are succumbing to a variety of risk factors, rendering them unavailable for future use.

Harvey maintained that - at least in 2008 - researchers had not presented enough evidence to demonstrate that digital materials loss was taking place on a meaningful scale, and asked for further data. Our article provided such data, so I decided to include Harvey's request in the text.

This meant that I needed to provide a citation for his paper, of course. I had previously found it available online via the Digital Preservation Europe web site at, but on July 29, 2015 I could not find a copy of it online - at all. I tried again yesterday and today. No luck.

Just to be clear, I was unable to find a copy of a 2008 paper arguing that digital preservation advocates had overstated the threat of digital data loss, including that presented on the web. How ironic.

Remembering a certain pop singer's misuse of the word "ironic" in a hit song some twenty years ago, I turned to the Oxford English Dictionary for a definition of "irony."

I found, as the third meaning of the noun - "a state of affairs or an event that seems deliberately contrary to what was or might be expected; an outcome cruelly, humorously, or strangely at odds with assumptions or expectations."

I would note that this occurrence certainly seems to contrary to Mr. Harvey's expectations - at least from 2008, but it is not contrary to my own.

Our paper on digital data loss among university faculty will be published shortly by the International Journal of Digital Curation. It corroborates digital preservation advocates' familiar contention that data loss is indeed taking place.

Feel free to mention our findings in presentations to campus stakeholders and conversations with individuals unaware of the threat of digital data loss.

You might also use the Ross Harvey story for an icebreaker or a laugh midway through a talk.

 Ultimately I provided a citation for Harvey's paper from the Internet Archive's Wayback Machine, which seeks to address situations like this by providing access to an archive of web pages, organized chronologically. In effect, it seeks to provide snapshots of the web at given dates.

Mr. Harvey would certainly contend that his paper's existence on the Wayback Machine proves his point - that digital data disappearing from its original place of online presentation can very often be retrieved elsewhere. And so it was.

The Wayback Machine is far from comprehensive, however. It is also little-known among those outside the library and information science community.

Harvey also may have retreated from his intentionally provocative 2008 proclamation. Even if he has, this situation creates  a potentially useful anecdote in the ongoing effort to convince those outside the community of practitioners that the threat of digital data loss is real.

Thursday, December 4, 2014

Text Mining for Beginners, redux

This week a team of Northern Illinois University students and their faculty coach presented their findings after a semester devoted to investigating text-mining from the perspective of a novice.

They have produced a report in which they provide a basic description of text mining itself, including a review of some of the types of procedures used to detect patterns within a very large body text; the types of text available for text-mining work; a discussion of structured, unstructured, and semi-structured data as they pertain to text-mining work; the importance of preparing digital texts (especially those created by Optical Character Recognition Software) for mining activities; and reviews of three well-known text-mining applications: Mallet, Weka, and RapidMiner. Of these, the first two are freely-available open-source software; the third is available in a free demonstration version but requires purchase in order to make use of its most powerful capabilities. These reviews include brief discussions of the types of text-mining activities (i.e., topic modeling, document clustering, sentiment analysis, etc.) that each makes possible. Finally, the report describes the team's activities in using the three applications to perform analyses on sample bodies of text, and the results produced.

I hope to work with the team and their coaches to round the report out into a resource that I can distribute to interested members of the NIU community. We also hope to make it available via Huskie Commons, the university's institutional repository ( ).

Monday, November 3, 2014

Text mining for beginners

I am now Director of Digital Scholarship at Northern Illinois University Libraries. This means that it is now my job to work with faculty members seeking to employ technologies like Geographic Information Systems, text-mining and data visualization - helping those with little experience in such work find a way to put the technology to work.

As I really do not know very much at all about these activities, my job is now an exercise in learning something new. To this end, I have sought some help.

This fall I am working with a team of three Northern Illinois University students and their faculty coach, who will provide me with an evaluation of several open-source text-mining utilities, as well as a more general review of resources available for a scholar or other practitioner who might want to take up text-mining but lacks any experience in the work.

I spent last summer trying to identify and prepare text materials for their use in the evaluation of the utilities, and found very little information explaining how to begin a text mining project - i.e., finding digitized texts, selecting texts for research, and working them into a format suitable for use with the software - available.

I am looking forward to the students' report, and will try to bring their findings to the attention of historians and other humanities scholars who might be interested in text-mining.

Thursday, January 23, 2014

Digital/Online Materials and their Place in Historical Scholarship

At the recent meeting of the American Historical Association in Washington, D.C., I made a presentation as part of a discussion session (i.e., not a regular panel - we sat in a circle and talked after very short presentations made by people sitting as part of the circle) exploring digital materials, ranging from blogs and web sites to social media, and the questions that they raise as scholars begin to make use of them as primary sources. Other presenters talked about the future of MOOCs and crowd-sourcing the search for elusive information about a relatively obscure historical figure. I discussed the work of the Digital POWRR project and the challenges presented by the fact that digital objects are generally subject to loss in the relatively short term due to a number of reasons, including hardware and software incompatibility and the degradation of storage media.

One major question that emerged in the discussion was the status of social media materials and other online, digital sources in light of the fact that they are so prone to loss. One presenter at the preceding panel (our discussion group was part of a linked set of two events) described how she had based her work on Pakistani women in part on a web site that no longer existed, apparently because of hacking activities undertaken by parties believing that Pakistani women should not express themselves in this format. The presenter said that she had printed out the sites pages for her own record and thus could document her use of the source. But this made me wonder about the future practice of history.

So, what of digital sources like blogs, web sites, and social media objects like tweets? Digital objects' intrinsic frailty and the complex, easily disrupted nature of the internet used to present them make them fundamentally unreliable as primary sources, at least by the standards developed for the use of analog/paper media materials.  

It seems to me that although history is certainly not a science in any way, historians are similar to scientists in at least one regard. Much like a scientific discovery can only be accepted and confirmed as other practitioners are able to repeat the experiment and yield the same result, historians are accustomed to being able to lay their hands on a paper source cited in a footnote. Manuscripts are usually unique items, but if one travels to the archive and looks in the box and folder number cited, the item will be there. There may be a very small number of copies of a book, but if one is willing to make the trip to the right library, the book will be there. Historians will of course debate a scholar's reading of a source, but the existence of the source itself is fundamental to the discipline. If the item is not there, practitioners may rightly begin to ask questions about the legitimacy of a work citing it.

Many of the participants in the AHA discussion emphasized the need to preserve online digital materials as fully as possible. I certainly concur. But a whole host of problems, not the least of which is the considerable expense involved in the curation/preservation of digital materials, make this impossible. We will have to face that fact that a considerable amount of online digital objects that future historians may want to use as evidence will simply disappear. 

In this situation, several questions occur to me: How will we evaluate work citing online materials that are no longer existent? What if scholars relying on such missing evidence can produce a print-out or other facsimile of the materials? Can we distinguish cases of vanished evidence in which legitimate facsimiles exist from cases of academic fraud?