Data Cleaning

What is data cleaning?

Data cleaning refers to the process of identifying and removing invalid data points from a dataset. This involves examining the data for extreme outliers, or erroneous data points that might bias the results of your research. To ensure that no data cooking occurred, data cleaning procedures have to be finished before the statistical analysis of study outcomes starts.

Data Cleaning and Data Entry Procedures

Common flaws that can impair quality and consistency of the information that is recorded within a dataset are wild codes (e.g. three different values assigned to the variable sex), values that are out of range (e.g. the value 9 for items with a range from 1 to 5), inconsistent (illogical) values or implausible values. Since preventing errors beforehand is more economical, efficient and less strenuous than detecting and correcting them afterwards, you should always try to minimize errors in the data entry phase (see below). However, it is almost impossible to completely avoid mistakes at data entry phase. Therefore, cleaning of collected data is important to achieve a high quality, reliable data set that allows valid statistical analyses. Data cleaning can also help detecting the causes of errors and to use this information to prevent future errors. As a result of your data cleaning procedures, the dataset should be consistent with the information given in the codebook (e.g. ranges and missing values of variables).

Three different kinds of problematic data points can be distinguished:

  • Wild codes or out of range values are most likely due to erroneous data entry in paper and pencil studies. Thus, they can be easily avoided by using online assessment platforms like UNIPARK, that require no manual data entry but export data directly into statistical software. However, if manual data entry is necessary, specialized data entry software like EpiData or REDCap can be used to incorporate automatic checks for wild codes and out of range values, as well as double data entry. If no such software is used, quality control (comparing entered data and raw values) should still be performed for a small proportion of entries. In any case, frequencies and distributions of all variables should be checked to identify problematic data points, which are double-checked subsequently. Additionally, some statistical packages allow to define ranges for missing values. This can be a handy function for excluding wild codes from analyses (e.g. by stating that all values above 5 should be treated as missing values for a 5-point likert scale).
  • Inconsistent values (e.g. varying values for sex)  can be prevented by using electronic data capture or data entry software with appropriate checks. If cases in your data contain contradictory information it has to be clarified which information is correct. If this is not possible, the corresponding cases should be excluded from all analyses on concerning variables.
  • Implausible or illogical values can also occur when data is captured electronically. In general, implausible values (e.g. constant values for biological measures that vary normally) or illogical values (e.g. decreases in certain parameters that cannot decrease) are harder to detect than simple wild codes because they require a deep knowledge on what was measured. Nowadays, statistical software, like SPSS, often incorporates elaborated procedures to check for anomalous data points (that are unfortunately seldom used). You should include enough time for those checks before statistical analyses start in your project schedule, because changes on the dataset, that are made after analyses started, always have a negative connotation, even if they are justified (e.g. because residual plots indicated implausible values which can be explained by swapped values between columns in data entry etc.). Even if no explanation for the occurrence of implausible or illogical values can be found, this issue should be discussed. In this case, analyses on related outcomes should be conducted (and reported) twice, with and without cases concerned.

Cleaning data can take up a lot of time, especially when your data set is really large. Specialized software solutions that can help detecting common mistakes in the dataset exist, for example OpenRefine. Most statistic programs also include some functionalities of their own. In clinical trials, ethical codes stipulate to lock the database before any statistical analyses are performed and to define clear criteria and workflows for the reopening of the database. More information on data base locks can be found on the Clinical Safety Geek site.

Further Resources