Data Integrity

The term data integrity occurs in different contexts, it may refer to the consistency of information recorded within the digital data object or to the consistency of the digital data object itself.

Data integrity and Data Cleaning (consistency of information represented by data)

Common flaws that can impair quality and consistency of the information that is recorded within a dataset are wild codes (e.g. three different values assigned to the variable sex), values that are out of range (e.g. the value 9 for items with a range from 1 to 5), inconsistent (illogical) values or implausible values.

Data cleaning can be regarded as all measures that are taken in order to ensure data integrity and prevent these common flaws mentioned above. Data cleaning procedures should be outlined beforehand (see the knowledge base’s section on data cleaning for more information).

Data integrity and Checksums (consistency of the datafile itself)

Data integrity may also refer to the consistency of a dataset meaning that no changes on the datafile occurred accidentally or due to transmission errors. Checksums can be used to create fingerprints of digital objects and ensure data integrity, because a dataset’s checksum changes if the dataset is modified. Thus, accidental changes or changes that are due to software/hardware faults become detectable. Hence, you should use checksums when generating copies of masterfiles, generating back-up copies or downloading files from repositories to verify that the copy and the original file are identical. Examples for checksums are SHA or md5. Checksum generators are freely available as web application or freeware.

Further References

  • Chapman, A. D. (2005). Principles and Methods of Data Cleaning – Primary Species and Species-Occurrence Data, version 1.0. Copenhagen: Report for the Global Biodiversity Information Facility.
  • UK data archive. Managing and Sharing Data. Best Practice for Researchers. Retrieved from: https://www.ukdataservice.ac.uk/manage-data/store/checksums