Data Selection

What is data selection?

The term data selection aims at choosing data that should be stored during data collection or that should be shared/archived after the project is completed.

Data selection also refers in some contexts to the process of choosing datasets, which are considered worth long-term preservation by an data archive (e.g. selection criteria of UK data archive). This aspect will not be considered here.

Data Selection Decisions during a Research Project

During data collection, researchers have to define under which circumstances collected data should be stored or discarded. Typically the principal investigator defines criteria for this purpose. As data selection procedures affect resulting research data and, thus, results of your research, they need to be thoroughly documented. Examples for data, that may be considered irrelevant (and can be discarded), are data based on incomplete runs or flawed codes. Additionally, personal data should be deleted as soon as possible in order to obey legal requirements, if no specific consent on keeping this personal data was obtained (see the knowledge base’s section on data privacy).

Although data can, naturally, only be shared after data collection is finished, it is important to consider your plans on data sharing before your data collection starts. For example, you will have to prepare different workflows for storing and anonymizing data, if you obtained explicit consent to share personal data only for a subset of subjects.

More information on data selection when sharing research data, i.e. how to define a subset of data that will be published, can be found in the knowledge base’s section on data sharing.