Don't forget the data

Workshop explores best practices for data citation

May 18, 2012  •  The rapid growth in science journals has produced an avalanche of literature that keeps researchers scrambling to keep up. Underneath, there’s an even larger buildup of supporting data. Specialists worry that this ocean of information is increasingly untraceable and prone to loss—a tsunami that could turn into a black hole.

Tim Killeen and Joan Starr at the podium
Speakers at the workshop “Bridging Data Lifecycles: Tracking Data Use via Data Citations,” held at UCAR on April 5–6, included Joan Starr, manager of strategic and project planning at the California Digital Library (top), and Tim Killeen, assistant director of NSF’s Directorate for Geosciences. (©UCAR. Photos by Carlye Calvin. These images are freely available for media & nonprofit use.)

Experts from across data centers, research libraries and organizations, and federal agencies put their heads together at an April workshop on data citation. The meeting, which drew more than 75 participants, was hosted by UCAR and organized by Mary Marlino, Karon Kelly, and Matt Mayernik, all from the NCAR Library. Speakers at the workshop addressed the full life cycle of data—from creation to preservation—as well as the fragmented state of current data-citation practices and some potential paths toward improvement.

Tim Killeen, assistant director of NSF’s Directorate for Geosciences, issued a "dear colleague" letter on March 29 and delivered a keynote speech at the UCAR workshop. “Now is the time for geoscientists to begin to meet the challenges of data citation,” he wrote. “While many policy and practical challenges remain to be resolved and implemented, the Directorate for Geosciences encourages members of the community to lead an evolutionary transformation to establish data citation within the geosciences as the rule rather than the exception.”

Right now, those who do attempt to cite their data face a Wild West of options, several of which were outlined in the workshop’s opening talk by Mark Parsons (National Snow and Ice Data Center, or NSIDC). An author might refer to supplemental material on a publisher’s website, provide a URL or digital object identifier (DOI) that leads to an archive at her or his own institution, include a footnote that calls out related papers or an archived data set, or simply mention in the narrative that supporting data exist.

According to Parsons, "We need more precise and consistent approaches." Working with the Federation of Earth Science Information Partners, NSIDC has developed guidelines based on the use of persistent identifiers, as defined within these interagency guidelines.

“The ideal model would be a unique ID for every subset of data,” Parsons said. He noted the difficulties of referring to a location inside a Web archive. You can’t use a page number as you would for a print document, for example.  One option might be to develop a consistent way to refer to structural indices, akin to a chapter or verse heading in a traditional book.

Mark Parsons (NSIDC)

"We need more precise and consistent approaches."

—Mark Parsons, National Snow and Ice
Data Center

Bill Cook outlined the issues facing science publishers, including the American Geophysical Union, where Cook recently served as director of publications. “We didn’t expect to be a long-term preservation facility for a large number of data sets,” he says. As software and data formats change, it may be beyond the scope of any single publisher to keep scientists’ data accessible. In a soon-to-be-released policy statement, the society will strongly encourage authors to archive their own data in approved data centers and cite it clearly. 

Joan Starr (California Digital Library) laid out the possible identifiers authors can now use in pointing to their data, plus a new approach being fostered by her organization along with ten University of California campuses as well as peer institutions. EZID ( allows users to create unique, long-term identifiers, either DataCite DOIs or archival resource keys (ARKs),  and then provide metadata for virtually any unit of information.

Starr hopes that contributors will make particular use of DataCite’s “related identifier” field in EZID, which tells users, in essence, “I’m a dataset, and I’m related to this other dataset.” UCAR is an EZID member, and several UCAR/NCAR groups are working to assign unique identifiers to data sets and other resources, such as software packages

Organizers hope the workshop and its follow-ups will inspire scientists to make a more concerted effort at citing data. In his closing talk, Mayernik encouraged data archivists to start with clear-cut cases and consider long-term needs for tracking usage and maintaining data. As for researchers, Parsons offered this advice: “Keep it simple. It’s just a citation, not a solution to all data management issues.”

Funding for this workshop was provided by NOAA through UCAR’s Joint Office for Science Support.. Presentations from the meeting are available online.


*Media & nonprofit use of images: Except where otherwise indicated, media and nonprofit use permitted with credit as indicated above and compliance with UCAR's terms of use. Find more images in the NCAR|UCAR Multimedia & Image Gallery.

The University Corporation for Atmospheric Research manages the National Center for Atmospheric Research under sponsorship by the National Science Foundation. Any opinions, findings and conclusions, or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.