During the spring semester of 2016 I supervised a team of students (Marcos Quezada, a graduate student in Operations Management and Information Systems; Fredrik Stark, a PhD candidate in NIU's English Department, and Mitchell Zaretsky, a junior Computer Science major) as they explored text mining in the context of Northern Illinois University Libraries' large online collection of late nineteenth and early twentieth century dime novels (http://dimenovels.lib.niu.edu). We worked in the format of an experiential learning activity, meaning that we addressed a problem brought to us by a client. In this case Matthew Short, NIU Libraries Metadata Librarian and Cataloger, served as the client.
In the experiential learning format, the client presents the student team with a set of goals. Mr. Short asked the team to develop a text classification application or tool to help library catalogers to determine the genre of the approximately 1,900 digitized text in the collection. In traditional cataloging activities, the cataloger inspects a work manually in order to derive basic information necessary to catalog it accurately. This can be a lengthy process. Perhaps text-mining technology could help catalogers to improve the speed and efficiency with which they cataloged a very large collection.
Mr. Short's goals also included the compilation of a list of genres and related subject terms for possible use in reclassifying online digitized collections; investigating text-mining tools for the future development of the prototype classifier application and future studies of the collections.
The team began work by using Weka, an open-source data and text-mining application. Mr. Short selected it because it enables users to acquaint themselves with the separate activities that make up text mining and construct original applications using blocks of existing Java code.
Mr. Short introduced the students to a typical text-mining work flow. He had been working to achieve his goals prior to engaging with this group, and for all intents and purposes led the team's activities. As the team's official coach, I attempted to facilitate discussion, scheduled activities, and completed paperwork.
The students began by gathering text files of digitized dime novels cataloged as belonging to the collection's better-represented genres. These genres included detective and mystery stories; western stories; sea stories; historical fiction; adventure stories; and bildungsromans (coming of age) stories.
The team next engaged in pre-processing activities in order to produce the most accurate text possible. NIU Libraries staff members originally produced the digital texts in the digital dime novel collection by the use of Optical Character Recognition software and did not attempt to correct any mistakes within them. Pre-processing began with the removal of stop words (such as the, an, and, etc.) and also included tokenization (identifying groups of characters as words) and stemming (reducing different inflections of a word to their root form) of words. We also used Weka to render the text materials as a bag of words (i.e., set aside grammar and word order) and transform words into vectors, or numerical representations.
The team then moved on to text classification. They began by using a set of already-cataloged works to train Weka to identify specific words or sets of words with the individual genres mentioned above. Of the algorithms available in Weka, Naive Bayes proved most effective. They found that in 65% of works examined Weka's classification agreed with that of a human cataloger. Investigating this discrepancy, the team found that the use of additional filtering techniques, including the use of TF-IDF (a process to determine how important a word is to a document in a collection or corpus); a better stemmer (the open-source product Snowball); a list of nineteenth-century stop words composed by Matthew Jockers, a scholar of the period's literature; rendering all letters in lower-case; and setting the number of words in each text to be analyzed to 500 improved accuracy, i.e., Weka agreeing with a human cataloger's genre classification, to 75 %. They also discovered that a number of texts in the training set had been cataloged as belonging in two different genres. Removal of these works improved accuracy to 83%.
With the information above, Mitchell Zaretsky used Weka's Java API to construct an original classifier application. It reported the probability of a work fitting in one of the several genres. Working with a new test corpus of 214 digitized dime novels, the team found that their classifier agreed with human catalogers 71% of the time.
On the basis of this test, the team determined that their application can help catalogers to determine a dime novel's genre. It can also serve as an effective tool for evaluating the genre determinations of catalogers not using the application in their work. They also suggested that text-mining activities uncovered details about the form and content of works in NIU's digitized dime novel collection that invites further research.
In the experiential learning format, the client presents the student team with a set of goals. Mr. Short asked the team to develop a text classification application or tool to help library catalogers to determine the genre of the approximately 1,900 digitized text in the collection. In traditional cataloging activities, the cataloger inspects a work manually in order to derive basic information necessary to catalog it accurately. This can be a lengthy process. Perhaps text-mining technology could help catalogers to improve the speed and efficiency with which they cataloged a very large collection.
Mr. Short's goals also included the compilation of a list of genres and related subject terms for possible use in reclassifying online digitized collections; investigating text-mining tools for the future development of the prototype classifier application and future studies of the collections.
The team began work by using Weka, an open-source data and text-mining application. Mr. Short selected it because it enables users to acquaint themselves with the separate activities that make up text mining and construct original applications using blocks of existing Java code.
Mr. Short introduced the students to a typical text-mining work flow. He had been working to achieve his goals prior to engaging with this group, and for all intents and purposes led the team's activities. As the team's official coach, I attempted to facilitate discussion, scheduled activities, and completed paperwork.
The students began by gathering text files of digitized dime novels cataloged as belonging to the collection's better-represented genres. These genres included detective and mystery stories; western stories; sea stories; historical fiction; adventure stories; and bildungsromans (coming of age) stories.
The team next engaged in pre-processing activities in order to produce the most accurate text possible. NIU Libraries staff members originally produced the digital texts in the digital dime novel collection by the use of Optical Character Recognition software and did not attempt to correct any mistakes within them. Pre-processing began with the removal of stop words (such as the, an, and, etc.) and also included tokenization (identifying groups of characters as words) and stemming (reducing different inflections of a word to their root form) of words. We also used Weka to render the text materials as a bag of words (i.e., set aside grammar and word order) and transform words into vectors, or numerical representations.
The team then moved on to text classification. They began by using a set of already-cataloged works to train Weka to identify specific words or sets of words with the individual genres mentioned above. Of the algorithms available in Weka, Naive Bayes proved most effective. They found that in 65% of works examined Weka's classification agreed with that of a human cataloger. Investigating this discrepancy, the team found that the use of additional filtering techniques, including the use of TF-IDF (a process to determine how important a word is to a document in a collection or corpus); a better stemmer (the open-source product Snowball); a list of nineteenth-century stop words composed by Matthew Jockers, a scholar of the period's literature; rendering all letters in lower-case; and setting the number of words in each text to be analyzed to 500 improved accuracy, i.e., Weka agreeing with a human cataloger's genre classification, to 75 %. They also discovered that a number of texts in the training set had been cataloged as belonging in two different genres. Removal of these works improved accuracy to 83%.
With the information above, Mitchell Zaretsky used Weka's Java API to construct an original classifier application. It reported the probability of a work fitting in one of the several genres. Working with a new test corpus of 214 digitized dime novels, the team found that their classifier agreed with human catalogers 71% of the time.
On the basis of this test, the team determined that their application can help catalogers to determine a dime novel's genre. It can also serve as an effective tool for evaluating the genre determinations of catalogers not using the application in their work. They also suggested that text-mining activities uncovered details about the form and content of works in NIU's digitized dime novel collection that invites further research.