Friday, January 15, 2016

More text mining

This semester I will be coordinating the work of an experiential learning activity supported by NIU's Digital Convergence Lab. In it I will collaborate with Matthew Short, metadata librarian at Northern Illinois University Libraries, and a team of four NIU students to explore how text-mining technology might help Mr. Short to catalog our library's very large digital collection of dime novel materials ( .

Library catalogers describe books in a number of ways in order to help  users to find and enjoy them. One type of description involves a book's subject matter. Catalogers typically determine a book's subject matter by examining it themselves - not reading the whole thing, but reading enough to be able to describe it in very basic terms.  In the case of a collection that includes thousands of titles - some 14 million words - this is an impossible goal for a single cataloger. Hence, Mr. Short would like to look into how text mining technology might be able to help him to determine a book's content - in broad outline - with an eye toward streamlining the cataloging process.

Catalogers also try to identify a work's author. Scholars of nineteenth and early twentieth century dime novels know that in many cases these materials were published as the work of a fictitious author - like the Hardy Boys later were presented as the the work of "Franklin W. Dixon" - but were really written by unknown individuals. Scholars have identified some of these anonymous authors who wrote under different names, but they would like to be able to match up the authors with their works. One way to do this is to use some text known to be the work of an individual to train a text mining application to identify that author's style, and then compare it to other works of unknown authorship. Mr. Short is also interested in using this type of author attribution function to help him catalog dime novels.

We are interested in devising ways that text mining technology, in this case the open-source software application Weka, can make the type of determinations Matt needs.

Because this is new to me, I do not know how quickly the students (three computer science majors and a graduate student in English) can accomplish Matt's goals, so Matt and I are at work developing additional tasks for them should they complete his original inquiries well before the end of the semester.

I will describe the group's work in posts throughout the semester. 

