Data Selection

What is data selection?

The term data selection aims at choosing data that should be stored during data collection or that should be shared/archived after the project is completed.

Data selection also refers in some contexts to the process of choosing datasets, which are considered worth long-term preservation by an data archive (e.g. selection criteria of UK data archive). This aspect will not be considered here.

Data Selection Decisions during a Research Project

During data collection, researchers have to define under which circumstances collected data should be stored or discarded. Typically the principal investigator defines criteria for this purpose. But also other instances (e.g. the research institution) can be responsible for defining those criteria. As data selection procedures affect the resulting research data, they need to be thoroughly documented. Examples for data, that may be considered irrelevant (and can be discarded), are data based on incomplete runs or flawed codes. Additionally, personal data should be deleted as soon as possible in order to obey legal requirements, if no specific consent on keeping this personal data was obtained (see the knowledge base’s section on data privacy).

Although data can, naturally, only be shared after data collection is finished, it is important to consider your plans on data sharing before your data collection starts. For example, you will have to prepare different workflows for storing and anonymizing data, if you obtained explicit consent to share personal data only for a subset of subjects.

More information on data selection when sharing research data, i.e. how to define a subset of data that will be published, can be found in the knowledge base’s section on data sharing.

Practical Guideline

Heiko Tjalsma and Jeroen Rombouts created practical guidelines for appraising and selecting research data on which the following information will be based on (also see the webpage of Research Data Netherlands for a more condensed checklist regarding this topic). In the following, some points are presented that should be considered for the selection process.

Selection criteria

  • primary vs. secondary data
    • primary data are data in their original, unedited form (often those are also the raw data, that have not yet been changed by the researcher). Usually it is not (yet) common to publish the primary data, but those are needed for verification purposes, e.g. when it is necessary to reconstruct performed analyses.
    • data become secondary data when researchers process or change the primary data (e.g. transform values, create specific values, etc.). These are often the data that are shared with others.
  • who makes the selection decision?
    • institute: The data policy of a research institute may contain information regarding the goals, resources and legal obligations and may also offer information about which data to select for preservation/sharing.
    • data repository: Also the data repository often has collection criteria which inform about which research data to preserve and the conditions which apply to this conservation.
    • community: Also the members of the community which are interested in the data can influence the data selection process. Important factors on this level of data selection concern standardisation, legal or cultural aspects, as well as specific properties of the data, like open and permanent access and data format.
  • technical aspects
    • which data formats, which software or hardware is used
  • metadata
    • are the metadata sufficient and available? Which information do they contain? E. g., technical information, codebooks, information on the data structure and on intellectual property rights
  • which infrastructure is available to preserve the data?
    • data archive
    • institutional or thematic repository
    • other?
  • costs of data selection
    • how are the costs for selecting, converting, preserving and making the data available to be covered?

Which data must or should be kept?

The DCC (Digital Curation Centre) offers Five steps to decide what data to keep. They distinguish between data that must be kept and data that should be kept.

Data that must be kept

Are there Research Data Policy reasons to keep it?

  • Relevant funders or institutional policies can determine which data must be retained. In most cases it will be demanded that data with an acknowledged long-term value should be retained. The respective policies can provide information about this. Journals may also have research data policies that require the data for a published article to be made public.

Do regulations require the data to be available?

  • Are there disciplinary rules requiring the retention of data in research records, e.g. for health or safety reasons?

Are there other legal or contractual reasons?

  • Are the data of commercial value? are they used in patent application?
  • Are there contractual terms and conditions which imply that data must be retained?

Does it contain personal data relevant to the reuse purpose?

  • Concerning personal data, laws such as the Data Protection Act define how and how long personal data should be kept as well as requirements for disposal of data. See also the knowledge base’s site on privacy for more information about personal data.
  • Data that contain directly or indirectly identifying characteristics should never be published but depending on your research institution’s ethics approval could be retained and used for further research under specific conditions. In that case it is also important to check if the consent agreement signed by the participant allow (personal) data to be retained or archived. You’ll also have to check whether data can be securely stored adhering to information security standards.

Data that should be kept

Is it good enough?

  • Is there enough descriptive information about the data (what data is it, how and why was it collected and how has it been processed?), e.g. from an up-to-date data management plan. When using the standardized forms of DataWiz, this can be ensured.
  • Is the data quality good enough (e. g., the completeness of the data, adequate sample size, accuracy, validity, reliability or other relevant criteria)?

Is there likely to be a demand?

  • Are users already waiting for this data or is there evidence for a known demand?
  • Does the funder, the professional society or another equivalent body in the research field recommend sharing data of this type?
  • Does the data have potential for integration? Does it describe things that can fit into standardized terms or vocabularies also in other research domains?
  • Have the data been compiled by a renowned researcher / research group? Or will making the data available significantly enhance a group’s or project’s reputation?
  • Could the data be of broad significance or broad appeal? E.g., a landmark discovery or international policy and social concerns?

How difficult is it to replicate?

  • Is the data non-replicable (e.g. based on unrepeatable observations? This is also the case if a reproduction of the data would be very difficult and / or costly.

Do any barriers to further use exist?

  • Is everything cleared regarding data sensitivity or privacy / ethical, contractual, license or copyright terms and conditions which would restrict public access and reuse?
  • Is the data in an open format, that does not require license fees or proprietary software or hardware to use?
  • If any specialist software / hardware is needed to process and use the data is it widely used in the field of study?

Is it the only copy?

  • Is the data the only and most complete copy?
  • Is the data currently held somewhere that cannot guarantee long-term storage?

References