- The Imperative for Data Curation
The processes of creation and expression of our scientific, social, and humanistic inspirations are culminating in a vast corpus of stunning and even life-changing documents, films, recordings, Web sites, and other media, including software. Advances in technology have enabled new kinds of scholarship—the most obvious and profound impact is occurring in the realm of science.
Science is an interwoven system of experimentation, observation, verification, and replication that demands access to durable research results. Investigations into scholarly communication and research practices have brought attention to the evolving conduct of science. Service opportunities have been revealed for supporting the research process, sustaining and capturing the non-published conversations of science, and curating the resulting data.1 An ARL workshop in 2006 elicited many salient issues regarding data curation, and a new body of literature has been building.2 Recent developments offer compelling reasons to engage in the future of data.
Since 2008, the National Institutes of Health (NIH) have mandated that researchers deposit their peer-reviewed, NIH-funded research articles in PubMed Central.3 The NIH already had in place a requirement for researchers to deposit their data with NIH and to do so in prescribed formats.4 Calls for stronger data management plans by other federal granting agencies are growing. Spurred by the National Science Foundation's (NSF) initiative to build a supportive infrastructure for science,5 campuses are forming committees and formulating Datanet proposals that involve many segments of the institution, including libraries.6 In June 2009, Senators Cornyn and Lieberman reintroduced the Federal Research Public Access Act that would direct other federal agencies to require the deposit of articles in a certified repository.7
These public investments in science are predicated on the idea that the sharing of research data and publishable results stimulates additional innovation and discoveries. Such an open system of knowledge demands an infrastructure that will endure well [End Page 241-] into the future. Leaving digitally based information to languish in personal electronic filing drawers amid a jumble of unrelated information and with no plans for its survival guarantees its disappearance. Unlike the upkeep of our academic buildings, deferred maintenance is not an acceptable strategy for preserving data.
Libraries can make the case for sustaining a role in the future of scientific research beyond the acquisition of published research results. We have been collecting social science and census data in paper and electronic formats for some time. Other data that have found their way into libraries through various channels (in faculty papers, corporate archives, family collections, and such) have gotten to us as much by happenstance as design. Because of our long existence and mandate to manage university historical material, however, quite a bit of scientific information may already reside in our archives and special collections.
Traditional library acquisition and preservation processes and methods were adequate when information was primarily in a tangible form and the responsibility for its stewardship was relatively clear. Digital information, as we know, presents a different challenge; its collection, stewardship, readability, and long-term access cannot be taken for granted, and the responsibility for its care is up for grabs. By the time knowledge in digital form makes its way to a safe and sustainable repository, it may be unreadable, corrupted, erased, or otherwise impossible to recover and use. Scientific data files may be especially endangered due to their sheer size, computational elements, reliance on and integration with software, associated visualizations, few or competing standards, distributed ownership, dispersed storage, inaccessibility, lack of documented provenance, complex and dynamic nature, and the concomitant need for a specialized knowledge base—and experience—to handle data.
Data also may be endangered by the practices of scholars who regard their data as having little value beyond the confines of a small group, a specific project, or a specified period. Data loss may occur due to lack of planning to maintain the research that was shaped or derived from scientific or engineering programs. Research information may be tossed at the completion of a project, may reside in file cabinets that are eventually emptied by retirees, or—if we are lucky—may sit in boxes at a...