Wednesday, February 21, 2018

Another Text-Mining Project: CIA Materials

This semester I am working with a team of four Northern Illinois University student interns to explore a large collection of text materials brought to us by Dr. Eric Jones of our university's Center for Southeast Asian Studies. The materials consist of the Central Intelligence Agency's President's Daily Briefings for the period 1961-1977, or roughly the period of the United States' military engagement in Vietnam, including the several years leading up to a following the war itself. These materials have been declassified and are available on the CIA's online reading room.

Two Northern Illinois University graduate students have expressed interest in working with President's Daily Briefing materials from this period in their dissertation research, but are unable to devote the time necessary to read this very large collection of documents without some knowledge of its contents.

To date, the student team has used a script to download the 5,292 individual daily briefings, and  Optical Character Recognition to convert the documents, available in PDF format, into machine-readable text.

Text mining technology will allow the student intern to provide Dr. Jones and his students with an overview of the materials, including topics (combinations of words that frequently appear together) therein.

As the request for this information has come from students of twentieth-century Southeast Asian history and politics, we will especially focus on topics including the names of Southeast Asian nations, cities, geographical features, and public figures.


We will also provide a review of sentiment analysis (scoring their positive or negative character) of the Daily Briefings and attempt to group them into sets (or clusters) of like documents, based on the words contained therein.

Upon its completion, this work will provide NIU researchers with a new data set heretofore unavailable to them: the machine readable text of the President’s Daily Briefings for the period under consideration. The University Libraries may choose to make this data set available for future research via its digital repository. 

The work will also provide NIU researchers with a detailed report of 
A) topics appearing in the collection, showing how individual topics may become more or less prominent within the larger collection at different periods in time; 

B) how the reports expressed positive or negative sentiments regarding national security concerns; and

C) and how the individual briefings relate to each other in terms of words used in common.

The project team will also share the machine-readable text data set and report of findings with other researchers by submitting it to an open-access Digital Humanities publication and/or data repository for the humanities and/or social sciences.