Codebooks

What are Codebooks?

A well-documented codebook “contains information intended to be complete and self-explanatory for each variable in a data file1.” (Center for Human Resource Research, 2003, p.66)

Codebooks are an inevitable component of data documentation and data sharing in the social sciences. As the DGPs Recommendations on Data Management in Psychological Science (Schönbrodt, Gollwitzer, & Abele-Brehm, 2016) state:

In addition to the technical accessibility the readability with regards to content has to be ensured for the data. All variables must be documented in a digital codebook. (p. 11)

Core Information

Following the ICPSR’s information on codebooks (n.d.), core components of a codebook are:

  • Variable name. The name of a variable should only consist of letters, integers and underscores. Note that programs differ on allowed length, symbols that are supported, and on distinguishing between upper and lower case letters. There are different ways to name variables. To facilitate reusing the data you should use a system of prefix, root and suffix. For example, [Abbreviation_of_Measurement_Instrument]_[Item Number]_[Measurement Occasion] like BDI_Q1_T1. You should provide a ReadMe on naming conventions that were used. An example for an elaborated naming convention is the naming convention of the GESIS Panel: its assignment rules ensure that every variable name is unique, easily identifiable and meets archive standards (restriction to a length of max. 8 tokens (digits or letters) and no combination of upper and lower case letters).
  • Variable label. A short description or the full name of a variable. For example, if the variable name was BDI_Q1_T1 the full name could be Becks Depression Inventory, Question 1, Baseline. 
  • Variable type. There is no fixed scheme for describing the variable type. At least, you should distinguish between (a) numeric variables (e.g. 5-point rating scale, height, intelligence), (b) strings (any open text item) and (c) dates.
  • Valid values. The set of valid values, which were used to code categories, for nominal and categorical variables. For continuous variables, a definition of the range of valid values should be given (e.g. by assigning value labels to minimum and maximum). To indicate that value labels were not assigned by accident, we recommend to assign value labels to all valid values that are listed.
  • Value labels provide information on how to interpret valid values for nominal and ordinal categorical variables, as well as, information on how to interpret missing values for all types of variables. We recommend to use the following assignment rules (shown coding schemes are only exemplary):
    • As mentioned above, you should assign value labels to all valid values that are listed.
    • For numeric variables with distinct values (e.g. for a 5-point-likert rating scale with values “1”,”2″,”3″,”4″,”5″).
      • 1 = “Does not correspond at all”, 2 = “Corresponds a little”, 3 = “Corresponds moderately”, 4 = “Corresponds a lot”, 5 = “Corresponds exactly”
    • In some cases, labels for valid values will only exist for the upper and lower limit of a scale. In this case, we recommend to define empty value labels in order to indicate all possible answer categories on the scale:
      • 1 = “Does not correspond at all”, 2 = “”, 3 = “”, 4 = “”, 5 = “Corresponds exactly” or 1 = “Does not correspond at all”, 2 = “2”, 3 = “3”, 4 = “4”, 5 = “Corresponds exactly”
    • For continuous numeric variables:
      • If there are lower or upper limits for values of this variable, you should indicate this, even if these values do not occur in the actual dataset, in order to make your data interpretable:
        • 1 = “Minimum score”, 100 = “Maximum score”
    • For string variables:
      • If there is a predefined set of values that the string can take, assign value labels to all these values:
        • “TG” = “treatment group”, “CG” = “control group”
    • For all types of variables: Do not forget to assign value labels to codes, which indicate missing values. Consider the following examples for each type of variable:
      • for numeric variables: -77 = “not applicable”, -88 = “unknown”, -99 = “missing by design”
      • for string variables: “NA” = “not applicable”, “UK” = “unknown”
      • for date variables: 01/01/1800  = “missing”
  • Missing values. The set of values, which were used to code missing data. “Blanks” or “sysmis” values should not be used as missing values because it is not possible to discriminate between fields which were deliberately left blank (items that were not answered or are missing by design) and fields which were just skipped on data entry. Different kinds of missing values should be distinguished: e.g. missing by design (e.g. because some questions were only asked in the control group), not applicable (e.g. pregnancy for male participants), not answered. Therefore, you should assign different codes to these missing value patterns and, subsequently, value labels to these codes. It is important to standardize missing values (i.e. there is one code for each kind of missing value which is consistently used throughout your dataset). In some cases, it may be useful to define a range of missing values. Defining a range of missing values (e.g. a missing value range that is defined as 6-99 for a 5-point likert-scale) facilitates excluding wild codes (e.g. 55 instead of 5 because of typing errors) or measurement errors from analyses (e.g. measurements of heart rates that are higher than 220 beats per minute) .

Extended Information

The following information should be included in either the variable label or in a separate attribute field if they enhance data intelligibility:

  • Variable itemtext/instruction. The exact wording of the questionnaire item, software instruction, etc. corresponding to the variable (in consideration of third party rights).
  • Measurement occasion. The measurement occasion for the variable (e.g. wave 1, pre-treatment).
  • Instrument. The measurement instrument to which the variable belongs.
  • Construct. The theoretical construct that is measured by a variable.
  • Unit of measurement. The unit of measurement for continuous variables (e.g. meter, seconds).
  • Response unit. The entity that provided information.
  • Analysis unit. The unit that is analyzed in the variable.
  • Filter variable. Is this variable a filter variable? Depending on participants’ responses on a filter variable, a set of subsequent items/questions will be presented or not. For example, the variable “marital status” is a filter variable if a set of questions is only presented to subjects who stated that they were married.
  • Imputation. If any kind of imputation took place this should be documented.

Note that response unit and analysis unit are not necessarily the same (e.g. parents providing information on their child’s behavior).

Requirements for reuse of datasets

When writing your codebook, you should consider the intended audience (e.g. only you, members of your project, all researchers in your field, researchers in other fields). The best way to test if your codebook provides sufficient information for this audience is to send a draft of the codebook to a colleague (who belongs to your target audience).

Further Resources

References