May 18, 2012 • The rapid growth in science journals has produced an avalanche of literature that keeps researchers scrambling to keep up. Underneath, there’s an even larger buildup of supporting data. Specialists worry that this ocean of information is increasingly untraceable and prone to loss—a tsunami that could turn into a black hole.
Experts from across data centers, research libraries and organizations, and federal agencies put their heads together at an April workshop on data citation. The meeting, which drew more than 75 participants, was hosted by UCAR and organized by Mary Marlino, Karon Kelly, and Matt Mayernik, all from the NCAR Library. Speakers at the workshop addressed the full life cycle of data—from creation to preservation—as well as the fragmented state of current data-citation practices and some potential paths toward improvement.
Tim Killeen, assistant director of NSF’s Directorate for Geosciences, issued a "dear colleague" letter on March 29 and delivered a keynote speech at the UCAR workshop. “Now is the time for geoscientists to begin to meet the challenges of data citation,” he wrote. “While many policy and practical challenges remain to be resolved and implemented, the Directorate for Geosciences encourages members of the community to lead an evolutionary transformation to establish data citation within the geosciences as the rule rather than the exception.”
Right now, those who do attempt to cite their data face a Wild West of options, several of which were outlined in the workshop’s opening talk by Mark Parsons (National Snow and Ice Data Center, or NSIDC). An author might refer to supplemental material on a publisher’s website, provide a URL or digital object identifier (DOI) that leads to an archive at her or his own institution, include a footnote that calls out related papers or an archived data set, or simply mention in the narrative that supporting data exist.
According to Parsons, "We need more precise and consistent approaches." Working with the Federation of Earth Science Information Partners, NSIDC has developed guidelines based on the use of persistent identifiers, as defined within these interagency guidelines.
“The ideal model would be a unique ID for every subset of data,” Parsons said. He noted the difficulties of referring to a location inside a Web archive. You can’t use a page number as you would for a print document, for example. One option might be to develop a consistent way to refer to structural indices, akin to a chapter or verse heading in a traditional book.
Bill Cook outlined the issues facing science publishers, including the American Geophysical Union, where Cook recently served as director of publications. “We didn’t expect to be a long-term preservation facility for a large number of data sets,” he says. As software and data formats change, it may be beyond the scope of any single publisher to keep scientists’ data accessible. In a soon-to-be-released policy statement, the society will strongly encourage authors to archive their own data in approved data centers and cite it clearly.
Joan Starr (California Digital Library) laid out the possible identifiers authors can now use in pointing to their data, plus a new approach being fostered by her organization along with ten University of California campuses as well as peer institutions. EZID (http://n2t.net/ezid) allows users to create unique, long-term identifiers, either DataCite DOIs or archival resource keys (ARKs), and then provide metadata for virtually any unit of information.
Starr hopes that contributors will make particular use of DataCite’s “related identifier” field in EZID, which tells users, in essence, “I’m a dataset, and I’m related to this other dataset.” UCAR is an EZID member, and several UCAR/NCAR groups are working to assign unique identifiers to data sets and other resources, such as software packages
Organizers hope the workshop and its follow-ups will inspire scientists to make a more concerted effort at citing data. In his closing talk, Mayernik encouraged data archivists to start with clear-cut cases and consider long-term needs for tracking usage and maintaining data. As for researchers, Parsons offered this advice: “Keep it simple. It’s just a citation, not a solution to all data management issues.”
Funding for this workshop was provided by NOAA through UCAR’s Joint Office for Science Support.. Presentations from the meeting are available online.