Tuesday, March 15, 2016

Some Assembly Required - Micro-services and Digital Preservation

The following is a brief discussion of the findings of the Digital POWRR Project's recent white paper, bringing out themes that the white paper's length restriction obliged its authors to downplay.

The information science and cultural heritage communities have developed a variety of guidelines and standards that can enable an organization to achieve high levels of digital preservation.[1] Many practitioners have struggled to fashion the technical infrastructure necessary to realize them, however. Information professionals serving medium-sized and smaller institutions lacking large financial resources have especially have found it difficult to study and comprehend the complex challenges presented by the need to curate and preserve digital materials in a programmatic manner. Unsure that they understand the issue well, many have hesitated to address it.
Many larger institutions have also not yet devised the means to curate and preserve their digital materials in a programmatic manner, but in one instance a state university system has given rise to a coherent and practicable way forward. Representatives of the California Digital Library have addressed the University of California’s extensive and diverse digital curation and preservation needs by devising and coordinating the operation of a set of free-standing but interoperable applications, each performing a single or limited number of tasks in the larger curation and preservation process. They described their method as a micro-services approach. While they conceived this arrangement with an eye toward producing an infrastructure able to function in a large institution’s wide variety of work environments in a context of rapid and pervasive technological change, a micro-services approach can also help medium-sized and smaller institutions to identify and meet a very different set of digital curation and preservation goals.

A Micro-services approach
            In 2009 California Digital Library staff members began development of a new approach to the curation and preservation of digital objects produced by the ten-campus, $25 billion, 238,000 student, nearly 20,000 faculty member University of California system.[2] Seeking to dispense with the assumption that the curation and preservation of digital objects required the installation and operation of a single, long-lived application combining the necessary functions behind one user interface, they proposed to employ a set of twelve independent but compatible applications, or micro-services, each responsible for performing a single function within the digital curation and preservation process.  They described utilities devoted to data preservation as identity, storage, fixity, replication, inventory, and characterization. Utilities devoted to data curation included ingest, index, search, transformation, notification and annotation. The developers of the micro-services approach to digital curation and preservation made a number of compelling arguments for it. Small, relatively simple utilities would pose fewer challenges in their development, deployment, maintenance and enhancement than a large, integrated system, especially in the context of constant technological change. In addition, users could easily adapt a set of distributed services to local conditions in different divisions and departments of the university, and easily replace each of them upon their obsolescence.[3]
The authors introducing a micro-services approach to the information science community seemed to worry that their colleagues and users would look askance at a technical arrangement that they viewed as somehow incomplete or piecemeal. At the same time that they praised its simplicity and flexibility, the authors emphasized that librarians and developers could put a set of interoperable utilities together to provide a very large university community producing digital materials in increasing numbers, size and type with the ability to curate and preserve them in a fully comprehensive manner.[4] In 2010 the California Digital Library introduced Meritt, a new repository developed using the micro-services approach.[5] Available to the entire University of California Community today, Merritt provides long-term preservation of digital materials and also enables users to share data. It makes its functions available via a user interface and an Application Programming Interface enabling machine to machine communication. As of March, 2015 Meritt contained approximately 1.5 million digital objects, some containing over 10,000 individual files or occupying over 100 GB of storage capacity.[6]
The Challenge: Digital Curation and Preservation at Medium-Sized and Smaller Institutions
In a 2011 report Merritt’s developers recommended a micro-services approach to digital curation and preservation to other members of the Association for Research Libraries, emphasizing how a strategic combination of individual services could produce “the complex global function needed for effective curation” at large institutions.[7] At about the same time, the Digital POWRR (Preserving Digital Resources with Restricted Resources) Project (http://digitalpowrr.niu.edu/), a team of librarians, archivists and other stake-holders from medium-sized and smaller Illinois institutions lacking large financial resources, used financial support provided by an Institute of Museum and Library Services National Leadership Grant (LG-05-11-0156-11) to begin a study of the problems that the curation and preservation of digital objects presented to their organizations, and how they might address them.
            The POWRR institutions were, and remain, members of CARLI, the Consortium of Academic and Research Libraries in Illinois, but its leaders informed the study team that while they recognized the digital curation and preservation challenges their members faced, they also lacked the financial resources necessary to begin to address them. 
            Early in the three-year investigation period, POWRR team members prepared a case study of each of the participating institutions, describing it in broad terms while paying particular attention to the composition of its digital collections and the unique challenges they presented; the details of its pertinent technical infrastructure including content management systems and repository software in use; and a self-assessment summary and review of current digital curation and preservation activities, if any. At each of the five campuses, team members identified digital content known to be vulnerable to loss, but realized that they had not developed programmatic solutions to mitigate that risk. Upon gathering and reviewing the relevant information, team members identified the practices and policies that they believed could most help their institutions to improve their curation and preservation of digital objects. They then performed a gap analysis by identifying the obstacles that had prevented their organizations from achieving the desired outcomes. Common factors impeding digital preservation efforts emerged from the gap analyses, including a lack of available financial resources; limited or nonexistent staff time dedicated to digital preservation activities; and insufficient levels of appropriate technical expertise.[8]
The case studies allowed the research team to perceive a state of affairs more complex than the raw data described, however. Lacking the time necessary to stay abreast of frequent developments in the field of digital preservation, the expertise or technical infrastructure necessary to install and maintain complex software solutions, and/or the funds to pay for ready-to-use products that may exist, librarians and archivists at POWRR institutions often felt overwhelmed. Faced with what seemed to be an enormous undertaking, and hesitant to commit scarce resources to the adoption of an integrated digital preservation utility, many found themselves unable to take even the first steps toward curating and preserving their digital materials.[9] 

Micro-services and Smaller Institutions
Soon after beginning their investigation, members of the Digital POWRR team discovered a basic misunderstanding thwarting their progress towards the development of more effective digital curation and preservation programs. They had understood digital curation and preservation as a black and white issue: either an institution had implemented successful digital curation and preservation measures or it had not. Behind this assumption lay another: that successful digital preservation activities required the use of a single application integrating a number of necessary tasks behind a single user interface. Their research soon led them to realize that recent scholarship has emphasized that practitioners should rather understand digital curation and preservation activities as an ongoing, iterative set of actions, reactions, workflows, and policies.[10] Such an approach means that practitioners do not have to begin digital curation activities by creating or selecting a comprehensive technical solution appropriate for large sets of digital materials over a long period of time. Instead, they can start by taking small first steps to triage digital collections and identify means by which they might cumulatively enhance the curation and preservation of those found to require immediate attention, while working to bring the issue to the attention of colleagues and administrators and, ultimately, advocate for resources necessary for the implementation of additional capacity. Put another way, information professionals at institutions lacking effective digital curation and preservation measures can benefit by focusing their efforts on the activities by which they can curate and preserve digital content in the next six to twenty-four months rather than waiting a decade to devise an ideal solution. Professionals in the cultural heritage sector accustomed to thinking in terms of centuries may find this to be a curious approach, but to wait is to risk the loss of unique materials.
            The National Digital Stewardship Alliance’s (NDSA) Levels of Digital Preservation can provide practitioners seeking to employ this approach with a yardstick by which they might measure their progress toward improved digital curation and preservation capacity. The NDSA describes the Levels as a work in progress, intended to provide a readily-usable set of guidelines useful for professionals and/or institutions in various situations, ranging from those just beginning to think about how to curate and preserve their digital assets more effectively to those planning the next steps in enhancing existing systems and workflows. The guidelines speak to five functional areas that represent the core of digital preservation systems: storage and geographic location, file fixity and data integrity, information security, metadata, and file formats. The Levels do not help to assess the efficacy of digital preservation programs as a whole since they do not consider important matters such as policies, staffing, or organizational support.[11]
            In addition to providing information professionals at smaller institutions with an understanding of their present digital curation and preservation capacities, reference to the NDSA Levels of Digital Preservation can help them to recognize discrete steps by which they can improve their curation and preservation of digital objects. These may include activities such as improving storage practices from reliance on stand-alone media (e.g. CDs and portable hard drives) to the use of networked or geographically distributed servers, which the Levels mark as moving a single square to the right in one of their functional categories. Researchers have made resources discussing fundamental activities like these freely available.[12]  
Institutions can begin to move closer to a goal of stabilizing and preserving digital materials by taking a number of the small steps forward as described in the Levels. They should not hesitate to take Level 1 actions that they could readily achieve today while they devote their energies to deliberations considering how to move from Level 2 to Level 3. Thinking of digital curation and preservation activities as an ongoing activity, and the NDSA Levels of Preservation as a measure of progress, resonates with two fundamental understandings at the heart of a micro-services approach: first, that digital curation and preservation is an uncertain process in which continuous, rapid technological change often renders monolithic, integrated applications cumbersome and outdated; and second, that simple tools focused on a specific aspect or aspects of the process, often available at no charge, can prove more helpful.
            Practitioners can only make use of individual micro-services tools if they understand which roles they play in the larger digital curation and preservation process, and how they might contribute to progress through the Levels. To this end, the Digital POWRR Project depicted the several stages of a prospective workflow as a pathway to digital preservation.[13] See below.

POWRR’s Overview of the Path to Digital Preservation

             Micro-services tools typically perform only a single or limited number of the tasks or functions that make up each stage of this process.  For example, functions within the ingest portion of a digital preservation workflow employing micro-services might include file copying; fixity checking; virus scanning; file de-duplication; and unique identifier generation.[14] The figure below provides a list of micro-services functions making up the more general pathway to digital preservation. 

Functionality within POWRR's Path to Digital Preservation 

            An understanding of the discrete tasks and functions that comprise each stage of a digital curation and preservation workflow can provide practitioners at medium-sized and smaller institutions lacking large financial resources with an opportunity to assess how adding specific functions to their existing practices can benefit them. Likewise, a reliable registry of available digital curation and preservation tools, including those providing only a single or limited number of the functions depicted in Figure 3, can help information professionals to determine how they can perform those functions most readily in local circumstances. The COPTR (Community Owned Digital Preservation Tool Registry) web site (coptr.digitpres.org) provides information about hundreds of tools and services that address some aspect of digital curation and preservation, ranging from single-function applications to more robust and complex utilities. The costs of the tools/services noted in COPTR vary from those that are freely-available via open-source communities to those that are cost-prohibitive for many smaller institutions. Practitioners should realize that open-source applications may require programming expertise. COPTR’s description of an individual tool includes a brief descriptions of its general usability and the technical expertise it requires, as well as a record of recent development activity. The Digital POWRR Project study found that in many cases active user groups supporting a particular tool already exist online. Many tool developers also make themselves available to individuals and groups wishing to learn to use their application. [15]
            Understanding digital curation and preservation activities as incremental, and using tools lending themselves to this approach, information professionals at medium-sized and smaller institutions lacking large financial resources can make the small first steps needed to begin to make progress through the NDSA Levels of Digital Preservation. For example, practitioners may accession and inventory a collection of digital materials using a free, simple ingest tool called Data Accessioner and a common spreadsheet application. To learn more about how to do this, visit the POWRR website.         
            Meg Miner of Illinois Wesleyan University, a member of the Digital POWRR Project team, described Data Accessioner’s usefulness in performing the functions described above in Figure 3 as Auto Metadata Harvest and Package Metadata thus:
I use Data Accessioner (DA) to capture technical metadata as I move files from transfer media to my as-yet non-bit-level storage device. I use DA-MT (Data Accessioner Metadata Transformer) to aggregate the file information from xml to something I can understand: file types, quantities and sizes by type. I store the aggregate information in my regular accession files (currently a spreadsheet). My accession information and an Access copy are in a different hard drive from the Master copy and XML. Someday I will move the accessions with content I think is most at-risk (due to format or other unique attribute) into a bit-checking storage environment…. this workflow costs me no money, no technical expertise (beyond downloading Java and two processing files via ZIP) and very little extra time. With DA, I am capturing all the recommended technical information for use by a back-end preservation system. With DA-MT I can track growth rate of digital content overall, make a case for purchasing better storage, and keep an eye on where all the at-risk file types are in the interim.[16]

In this case, use of a single, open-source tool requiring no programming skills or other technical know-how has helped a single archivist, working alone, to gather important information about her institution’s digital collections which, as she notes, will prove indispensable in the longer-term construction of a larger digital curation and preservation system. The discrete, specialized tools that characterize a micro-services approach also can ultimately help this practitioner and others like her to build such a system, adding new functions and capacity to an emerging digital curation and preservation workflow customized to address local needs and idiosyncrasies, including the particular characteristics of individual collections.
Micro-services tools can also help practitioners adopting more robust tools. They should not assume that an application bundling many digital curation and preservation functions together with a single user interface will necessarily provide an entirely comprehensive and worry-free experience. In many cases developers of more extensive applications have assumed that their users have already performed NDSA Level 1 triage activities (e.g. moving data from disparate storage media to more secure locations and performing minimal inventory and/or simple metadata creation). For example, POWRR researchers found that Curator’s Workbench provided mechanisms for importing MODS records that its developers assumed already existed, but not an intuitive interface for creating MODS records from scratch. This presents information professionals at many institutions with a significant gap between their digital materials’ present condition and that state necessary even to begin curation and preservation activities with their new utility. While individual micro-services tools can enable practitioners to build a local digital curation and preservation system in an incremental fashion, some of the most basic tools can also prepare digital objects for ingestion into and storage within more elaborate applications.

Research libraries and archives presently face the difficult prospect of devising technical means by which they may curate and preserve the large volume of digital objects that their stakeholders and staff members have created, and will create. Guidelines and standards describing effective practices presently exist, but many information professionals have encountered great difficulty in providing the infrastructure necessary to follow them and provide enhanced levels of curation and preservation. The California Digital Library (CDL) has described and implemented what it describes as a micro-services approach to the problem. Dismissing the familiar means by which information professionals have addressed other large challenges in their field, that of selecting and installing a large, integrated application, CDL staff members have advocated the development and strategic deployment of independent but interoperable utilities devoted to performing specific, finite functions contributing to the larger curation and preservation process. They have argued that administrators and developers responsible for digital curation and preservation functions may devise, deploy, maintain, upgrade, adapt and replace discrete, simple utilities far more easily than a single application combining many functions behind a single user interface. In addition, a distributed system allows units of a large university system, varying widely in their practice, to adapt it to their use. Functioning together, twelve micro-services utilities have provided the university’s ten campuses with a comprehensive digital curation and preservation infrastructure known as Merritt. 
A review of digital curation and preservation challenges facing smaller institutions lacking large financial resources suggests that the micro-services approach can benefit them as well. The same benefits that recommended it to the University of California apply to smaller institutions. Information professionals serving them may deploy, maintain, upgrade, adapt and replace simple applications performing specific functions more readily than they might install, manage, improve and, in a worst case, abandon and replace, a comprehensive digital curation and preservation system. Due to its modular nature and the amount of open-source applications available for use in a micro-services arrangement, units within these institutions stand a better chance of adapting one to their needs than an integrated, more comprehensive system. Information professionals employed at medium-sized and smaller institutions lacking large financial resources can also benefit from a micro-services approach to digital curation and preservation for another reason. Many have often reached the same misconception that the designers of the micro-services approach originally noted: that an all-or-nothing approach, taking years to select and implement, represented the only approach to digital curation and preservation services. Believing that they lacked knowledge of the available applications, information professionals at smaller institutions most often declined to seek the significant financial resources necessary to make a selection, and took no other constructive steps toward enhancing their digital curation and preservation practices. A micro-services approach provides them with an opportunity to break this pattern of inaction and begin curation and preservation activities.
In this context the Digital POWRR Project study encouraged information professionals at medium-sized and smaller institutions to take basic, simple measures mitigating their materials’ risk of loss, no matter how limited their scope and effectiveness might seem. Measuring their progress by reference to the National Digital Stewardship Alliance’s Levels of Digital Preservation, smaller institutions might begin to improve their curation and preservation of digital collections through a series of small steps. The use of individual tools performing discrete functions can allow practitioners beginning curation and preservation activities where none have previously existed to perform initial triage on their digital collections and improve their practice over time, in part by expanding and customizing their technical infrastructure incrementally, building a set of applications best suited to local needs. Tools performing an individual or limited number of specific functions contributing to enhanced levels of digital curation and preservation can prove especially useful in this context. Aware that many practitioners that might benefit from these tools have little or no awareness of their existence or functions, the Digital POWRR Project made information describing the several stages of a digital curation and preservation workflow, the discrete activities making up each, and tools performing these functions freely available online via the COPTR web site.


