Versioning

What is versioning?

Versioning or version control means saving changes and keeping record of changes in (data) files. Whenever a file changes, a new copy with a new version number should be generated. This allows to recourse to older versions at any time as well as reconstructing the development of a data file. Reconstructing the development of a data file is also often referred to as data provenance documentation and is an essential property of transparent, reproducible science. Versioning should follow a systematic course, that indicates for example under which circumstances a new version is created.

How to do versioning?

Depending on the complexity of your research data, you can apply the following procedures for data versioning:

  • Defining Milestones. When a predefined milestone is accomplished (e.g. data file: input of all collected data), a separate milestone version of the file (master-file) is created. For this master-file, copies in different formats (e.g. csv, xml, sav) should be generated and archived. The generation of a checksum can be an additional safety measure, see data integrity.
  • Using Sub-Versions. Sub-versions denote small changes that have been made in one work day, while major versions are milestone versions or particularly important updates. Sub-versions do not need to be saved in various formats with checksums.
  • Instructions. (e.g. as a readme file) on required changes of other files as a consequence to changing/updating one file.
  • Determining specific dates when to validate and if needed harmonize the data files. Such a date could be for example prior to reaching a milestone.
  • Add a change log. Adding a change log to each data file which describes changes in the latest version.
  • Using collaborative working environments. Specialized software or functions of program can be very handy when it comes to working collaboratively on documents, conducting version management or synchronizing folder content. The most prominent example for a versioning software is GitHub, which is widely used in software development.
  • Creating safety copies regularly. This also includes controlling access to these safety copies.
  • Publication related data storing. In Psychology, primary data should never be altered (i.e., transformed, aggregated, recoded). If you publish an article, you should be able to publish the raw data along with syntax files that  reproduce your final results (Schönbrodt, Gollwitzer & Abele-Brehm, 2016).

Further Resources

  • Further information on data versioning are provided by the Australian National Data Service, and the UK Data Archive.
  • The DANS’ guide on data preparation for data sharing does also provide a comprehensive introduction to data versioning related issues:

References