Documenting Data – Guide to Research Data Management

Documenting your data is simply providing sufficient descriptive information about your data so that it can be used properly by you, your colleagues, and other researchers in the future. Well documented data is identifiable, understandable, and usable in the future. You should document your data at each stage of the research process, rather than attempting to recreate information at a later stage.

How to Document Data?

The term metadata is used to refer to your documentation since you are providing data about data. Researchers can choose among various metadata standards, often tailored to a particular file format or discipline. One such standard is DDI , designed to document numeric data files. Additional standards are listed on the left of this page.

Following are some general guidelines for aspects of your project and data that you should document, regardless of your discipline. At minimum, store this documentation in a readme.txt file or the equivalent, together with the data.

Title	Name of the dataset or research project that produced it
Creator	Names and addresses of the organization or people who created the data
Identifier	Number used to identify the data, even if it is just an internal project reference number
Subject	Keywords or phrases describing the subject or content of the data
Funders	Organizations or agencies who funded the research
Rights	Any known intellectual property rights held for the data
Access information	Where and how your data can be accessed by other researchers
Language	Language(s) of the intellectual content of the resource, when applicable
Dates	Key dates associated with the data, including: project start and end date; release date; time period covered by the data; and other dates associated with the data lifespan, e.g., maintenance cycle, update schedule
Location	Where the data relates to a physical location, record information about its spatial coverage
Methodology	How the data was generated, including equipment or software used, experimental protocol, other things one might include in a lab notebook
Data processing	Along the way, record any information on how the data has been altered or processed
Sources	Citations to material for data derived from other sources, including details of where the source data is held and how it was accessed
List of file names	List of all data files associated with the project, with their names and file extensions (e.g. ‘NWPalaceTR.WRL’, ‘stone.mov’)
File Formats	Format(s) of the data, e.g. FITS, SPSS, HTML, JPEG, and any software required to read the data
File structure	Organization of the data file(s) and the layout of the variables, when applicable
Variable list	List of variables in the data files, when applicable
Code lists	Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. ‘999 indicates a missing value in the data’)
Versions	Date/time stamp for each file, and use a separate ID for each version
Checksums	To test if your file has changed over time

Metadata Standards and Schemas

Selecting a standard or schema does not obligate you to use it to its fullest extent. You can use as much (or as little) as you need.

General Purpose Schemas

Dublin Core – This is a general schema which can be adapted for specific disciplines (e.g., Dryad for bio-sciences uses Dublin Core.)
FGDC – The Federal Geographic Data Committee schema is used for geospatial data. It is officially the content standard for Digital Geospatial Metadata (CSDGM), but is more commonly referred to as FGDC. Numerous editor tools are available.
MODS – (Metadata Object Description Schema): “a schema for a bibliographic element set that may be used for a variety of purposes”.

Humanities Schemas

TEI – The Text Encoding Initiative is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics.
VRA Core – The VRA Core is a data standard for the description of works of visual culture as well as the images that document them.

Science Schemas

Darwin Core – This schema is an adaptations of Dublin Core and is used primarly to define natural history specimen collections and species observation databases.
DIF – The Directory Interchange Format schema is used to define earth science data.
EML – The Ecological Metadata Language schema is used to define ecological data. Morpho Data Management software is recommended to create metadata, edit metadata, and manage data collections using EML, and is available for download.
ITIS – The Integrated Taxonomic Information System is a standard used to define date relating to the taxonomies of plants, animals, fungi, and microbes.

Social Science Schemas

DDI – The Data Documentation Initiative is an international standard for describing the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences.
OLAC – The Open Language Archives Community schema was developed by the Open Language Archives Community for the Open Archives initiative. It is based on Dublin Core.

Ensuring Future Usability

An equally important part of documentation is providing the information necessary to fully understand and interpret the data. At a minimum this should include:

a file manifest
a short text describing the dataset including any information that is not adequately represented in the structured metadata
codebooks
variable descriptions
documentation of experimental methods
provision of software code used in analysis
discussion of the file structure and relationships.

Remember, it is easier to collect this as the data is created rather than after the fact.

Most data repositories and archives allow the submission of supporting documentation. And even if you have no plans to publish or distribute your data, keeping good records of the data as it evolves will pay dividends by helping you and your research team work easily with the data over time.