Thursday, December 4, 2014

Text Mining for Beginners, redux

This week a team of Northern Illinois University students and their faculty coach presented their findings after a semester devoted to investigating text-mining from the perspective of a novice.

They have produced a report in which they provide a basic description of text mining itself, including a review of some of the types of procedures used to detect patterns within a very large body text; the types of text available for text-mining work; a discussion of structured, unstructured, and semi-structured data as they pertain to text-mining work; the importance of preparing digital texts (especially those created by Optical Character Recognition Software) for mining activities; and reviews of three well-known text-mining applications: Mallet, Weka, and RapidMiner. Of these, the first two are freely-available open-source software; the third is available in a free demonstration version but requires purchase in order to make use of its most powerful capabilities. These reviews include brief discussions of the types of text-mining activities (i.e., topic modeling, document clustering, sentiment analysis, etc.) that each makes possible. Finally, the report describes the team's activities in using the three applications to perform analyses on sample bodies of text, and the results produced.

I hope to work with the team and their coaches to round the report out into a resource that I can distribute to interested members of the NIU community. We also hope to make it available via Huskie Commons, the university's institutional repository (http://commons.lib.niu.edu ).