Publishing and Sharing Data – Guide to Research Data Management

Your research is important, and so is the data that it is based on. Making your data available is an important part of your work both as a researcher and scholar. When publishing your data you will need to:

Describe your data using a standard metadata schema for your field
Describe your research process and methodology!
Get a DOI for you and your data and code!
Select a repository where your materials can be made available to other researchers!

Here are some common ways to publish and share data:

Data Journals

The “data journal” is an emerging alternative. In data journals the data is the focus and the article is descriptive of the data set. This enables the data to be cited in a very familiar form.

F1000Research – F1000Research is an open research publishing Platform for researchers in all subject areas.
GigaScience – GigaScience is an open access, open data, open peer-review journal focusing on ‘big data’ research from the life and biomedical sciences.
Scientific data – Scientific Data is a peer-reviewed, open-access journal for descriptions of datasets, and research that advances the sharing and reuse of scientific data.

Disciplinary Repository

Disciplinary repositories offer high visibility within a particular field. Not all repositories are committed to long-term preservation of data, and their mission and focus may change over time. Some, are only available to subscribers.

Not all repositories listed necessarily take researcher-produced datasets where you can share your data. Moreover, not all repositories listed can ensure long-term preservation of your data; contact each one for more details.

Cambridge Structural Database – Established in 1965, the CSD is the world’s repository for small-molecule organic and metal-organic crystal structures. Containing the results of over half-a-million x-ray and neutron diffraction analyses this unique database of accurate 3D structures has become an essential resource to scientists around the world.
DataCite – A not-for-profit organization which aims to establish easier access to research data on the Internet; increase acceptance of research data as legitimate, citable contributions to the scholarly record; and supports data archiving that will permit results to be verified and re-purposed for future study. DataCite makes research more effective by connecting research outputs and resources–from data and preprints to images and samples. DataCite supports the creation and management of DOIs and metadata records, enhance research workflows with service integration, and enable the discovery and reuse of research outputs and resources.
DataONE – A community driven program providing access to data across multiple member repositories, supporting enhanced search and discovery of Earth and environmental data. DataONE promotes best practices in data management through responsive educational resources and materials. DataONE envisions researchers, educators, and the public using DataONE to better understand and conserve life on earth and the environment that sustains it.
DRYAD – Dryad is an open data publishing platform and a community committed to the open availability and routine re-use of all research data.
GenBank – GenBank ^® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42). GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.
ICPSR – The Inter-university Consortium for Political and Social Research is an international consortium of more than 810 academic institutions and research organizations. ICPSR provides leadership and training in data access, curation, and methods of analysis for the social science research community. The ICPSR maintains a data archive of more than 350,000 files of research in the social and behavioral sciences. It hosts 23 specialized collections of data in education, aging, criminal justice, substance abuse, terrorism, and other fields.
NSSDAC – The NASA Space Science Data Coordinated Archive serves as the permanent archive for NASA space science mission data. “Space science” means astronomy and astrophysics, solar and space plasma physics, and planetary and lunar science. As permanent archive, NSSDCA teams with NASA’s discipline-specific space science “active archives” which provide access to data to researchers and, in some cases, to the general public. NSSDCA also serves as NASA’s permanent archive for space physics mission data. It provides access to several geophysical models and to data from some non-NASA mission data.
NIH-Supported Data Sharing Resources – This page shows a list of NIH-supported data repositories that accept submissions of appropriate data from NIH-funded investigators.
re3data.org – The Registry of Research Data Repositories is a global registry of research data repositories from different academic disciplines.
Scientific Data’s list of recommended repositories – Scientific Data mandates the release of datasets but do not themselves host data. Instead, they ask authors to submit datasets to an appropriate public data repository. This is their list of recommended data repositories.

Institutional Repository

The mission of an institutional repository is to permanently preserve the scholarly output of the institution. Here at the Missouri University of Science and Technology our institutional repository is know as Scholars’ Mine. Scholars’ Mine serves this function, and preserves text, audio, video, data and more. Scholars’ Mine is designed to meet the needs of scholars in all disciplines, and operates according to widely accepted standards for preservation and access.

Journals

Some journals publish data associated with their published articles. This will provide good visibility, but is often tied to a journal subscription, limiting access. Compliance with documentation standards and long-term preservation vary considerably from journal to journal.

Self-publishing

Self-publishing occur through individual, institutional, or third-party websites. The researcher assumes the responsibility for vetting their own data for quality and documentation, as well as preserving an accessible version of the data as file formats change in the future. Tools are emerging which focus on the broad sharing of data, while allowing individual researchers or research centers to manage their own data on a remote server. The long-term implications are uncertain at this point.

It is not necessary to choose only one of these options. In fact, there are advantages to using multiple publishing options. Most of these options do not require an exclusive granting of rights, making it possible to deposit data in multiple locations, which both maximizes current visibility and long-term preservation simultaneously.

Citing Data

Citing data is highly recommended to to provide reliable access to specific datasets and to provide credit to the producers of useful Data citation standards are just beginning the emerging in many disciplines. In the absence of a specific standards , a data citation should include the following:

Author or Responsible Party(such as: study PI, sample collector, government agency)
Name of the Data Element used (e.g., a specific Table/Map/dataset with any applicable unique IDs)
Name of the Database
Name of the Publication ( if applicable)
Name of the Repository (if applicable)
Version identifier (Study number, edition, year, version, etc.)
Date accessed
URL used

If specific steps were required to subset, analyze, or access the data, the citation should also include:

parameters selected
software used

If you have a DOI, you can use the CrossCite DOI data citation formatter or the DataCite citation formatter to create citations corresponding to a variety of citation styles.

Most citation style guides/manuals are including data as a resource type. The Citation Formatters (above) provide the information in a style that approximates style requirements, so it is suggested that you confirm that those generated citations completely follow a particular citation style guide.

Here are some additional examples of guidelines:

American Geophysical Union (AGU) author guidelines for citing data sets
Federation of Earth Science Information Partners (ESIP) Interagency Data Stewardship/Citations
Citing and linking to the Gene Expression Omnibus (NCBI) database
The Inter-university Consortium for Political and Social Research (ICPSR) provides recommended citation procedures
DataCite citation examples

Citing Code

Citing code is as important as citing data, and for similar reasons: you’re providing appropriate credit, facilitating reproducibility, and ensuring future researchers can find and use the code.

A code citation should include:

Creator (i.e., authors or organization who developed the software)
Title
Identifier (e.g., DOI or other persistent link)
Date of publication
Version
Publisher (e.g., repository name)

The Force11 Software Citation Implementation Working Group has developed principles for software citation. Their GitHub page has examples of citing software in both APA and Chicago Style.

Data Availability Statements

When publishing an article using research data, the journal may require a data availability statement that briefly describes if and how readers can access the data that informs the research. This chart shows some sample language you might use for a data availability statement.

Data Availability	Sample Language
Data openly available in a public repository that issues datasets with DOIs	The datasets generated during and/or analyzed during the current study are available in the [repository name, e.g. “Iowa Research Online”] at [http://doi.org/[doi]]
Data available on request due to privacy/ethical restrictions	The datasets generated during and/or analyzed during the current study are not publicly available due to [explanation of restrictions, e.g. “their containing private information”] but are available from the corresponding author on reasonable request.
Data available on request from the authors	The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
Data sharing not applicable – no new data generated	Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
Data available within the article or its supplementary materials	All data generated or analyzed during this study are included in this published article [and/or] its supplementary information files.
Data subject to third party restrictions	The data that support the findings of this study are available from [third party name] but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are, however, available from the authors upon reasonable request and with permission of [third party name].

This chart is adapted from the article cited below and licensed under a Creative Commons Attribution license (CC-BY):
Hrynaszkiewicz, I, Simons, N, Hussain, A, Grant, R and Goudie, S. 2020. “Developing a Research Data Policy Framework for All Journals and Publishers.” Data Science Journal, DOI: http://doi.org/10.5334/dsj-2020-005

Additional examples of data availability statements from publishers: