File Formats

Human-Readable Formats and Binary Formats

Human-Readable Format/Text Files

Human-Readable formats can be interpreted without third party software, because every text editor can directly open and correctly display these files in order to understand the information that is encoded by them. This will hold for most syntax-files that are generated by statistical programs (e.g. SPSS, R or MatLab). Additionally, simple tabular data can be saved as comma-separated files (or other delimiter separated files), while complex information can be included in XML-files (extensible markup-language) or JSON-files (JavaScript Object Notation). The advantages of using human-readable formats are based on increasing the long-term accessibility of the data. Thus, you should convert your data to these formats when archiving your data.

Binary Formats/Non-Text Files

Binary files need to be interpreted by programs in order to understand the information that is encoded by them. The format of binary files can be proprietary, which means that the format of these files belongs to specific copyright protected software (like SPSS’s .sav-files), or non-proprietary, which means that the specification of these files is openly published (e.g. .Rdata-files used by R). Another example for a widely used binary format with openly available documentation is the PDF-format.

When it comes to handling research data, an advantage of binary formats is that variable documentation and tabular data can be easily combined (e.g. variable view and data view in sav-format). Additionally, researchers, that are currently working in the same field, probably use the same software as the data depositor, and, thus, will be able to open the data without any problems. Furthermore, for many commonly used proprietary formats freely available third party software exists that is able to read out these proprietary formats (e.g. for sav-files: PSPP, R-packages, DataWiz,…).

Encoding

Character encoding determines how your file’s characters are internally represented. Common encoding types are UTF-8 (used by DataWiz), ASCII, or Latin-1. An incorrect representation of characters on file import often occurs because the encoding, that is expected by the importing software, differs from the actual encoding of uploaded files. Many software providers offer guides on how to detect the encoding of your file or how to specify encoding upon file export (e.g. UTF-8 encoding when exporting Excel- or SPSS-files). Additionally, there is freely available software like notepad++ that allows you to convert encoding schemes.

Long-term preservation

Additional requirements on file formats exist for data publications that aim to preserve research data on the long term. In this case, researchers should seek to improve the technical availability of their data by converting data to endurable formats which are supposed to be accessible 5, 10 or 20 years from now. For simple data matrices, text files fulfill this requirement (if they are accompanied by a codebook that makes the data interpretable). Since such formats are not available for all kinds of data, some repositories ensure that data are migrated to new formats if the old format is outdated. However, only a minority of repositories deploys such procedures and data types, that will be converted, differ between these repositories.

Set-Up-Files

Set-up-files consist of data files that contain data in delimited text format as well as syntax files that contain metadata which are then imported in statistical software.

Thus, set-up-files can be used to store data and metadata in non-proprietary formats, while allowing researchers to directly import the data into the statistical software when needed. The Inter-university Consortium for Political and Social Research (ICPSR) offers extensive information on set-up-files for SPSS, STATA and SAS. Tutorials are available via the ICPSR’s  Youtube Channel. Unfortunately, there is still no simple way to automatize the generation of set-up-files.

R-Files

  • .R: R-syntax files can be saved as .R files. These files can be opened by simple text editors. Hence, R-syntax-files can be stored just the way they are.
  • .Rda/.Rdata .Rdata- or .Rda-files can contain several R-objects (data files, outputs, functions, lists, etc.). Working with these files can be very handy, however, you need an R installation to access these files since they are saved in a binary format. This means that using these formats should be avoided for long-term storage (although the documentation of these binary formats is open and, thus, they should be accessible, with some effort, in the future).
  • .Rds: .Rds-files can contain only one R-object. The recommendations for .Rda-/.Rdata-files also hold for .Rds-files.
  • .Rmd RMarkdown files are a great way to combine data documentation, data visualization and data analysis in one single file, click here for further information.

In conclusion: if you are working with R you should provide a .csv file, which contains your data, and a separate .R- or .Rmd-file, which includes your syntax, to ensure long-term availability. Additionally, you may add .Rdata-files as an alternative format.

Mplus-Files

Mplus input (.inp) and output (.out) files can be stored just the way they are, since Mplus employs human-readable formats. Additionally, Mplus data files must always be simple delimited text-files and, thus, can be stored as they are. However, as outlined above, you will in general require extensive documentation on these files, e.g. by writing a codebook.

SPSS-Files

The freeware PSPP, which refers to itself as free replacement of SPSS, incorporates many of SPSS’s options and supports SPSS data-files and syntax-files. Moreover, several R-packages (e.g. foreign) exist, that allow to open SPSS files.

  • .sav data files: .sav-files are the default format for tabular data in SPSS. Therefore, most SPSS data files are saved as .sav-files. A disadvantage of sav-files is that they cannot be interpreted without corresponding software. On the other hand, .sav-files are capable to store tabular data and corresponding metadata in one single file, which makes them easily interpretable.
  • .por data files: Some data archives like UK Data Archive recommend to use .por-files (portable files) instead of .sav-files for distribution and storage of data files (portable because they can be better transferred between systems and software versions). Since export as portable file, only requires you to choose .por in SPSS’s save as dialog, you can easily export an additional copy of your data as portable-file for storage. However, as IBM (n.d.) states, this option is not necessary anymore for most applications:

SPSS Statistics Portable (*.por). Portable format that can be read by other versions of SPSS Statistics and versions on other operating systems. Variable names are limited to eight bytes and are automatically converted to unique eight-byte names if necessary. In most cases, saving data in portable format is no longer necessary, since SPSS Statistics data files should be platform/operating system independent. You cannot save data files in portable file in Unicode mode. See the topic General options for more information.

  • Syntax .sps SPSS-syntax files can be saved as .sps-files. These files can be opened by simple text editors. Thus, SPSS-syntax files can be stored just the way they are.
  • Output .spv/.spo SPSS-output produced with SPSS version 16 and later is saved as .spv-file and can be accessed using the free IBM SPSS Smartreader. For .spo-files, output-files which were produced with SPSS version 15 or earlier, SPSS Legacy Viewer has to be used. This software is distributed on several freeware platforms. For more information, see the Readme. However, you should consider to export important output-files into a non-proprietary output format like PDF for storing them, which can be easily done by using the export option in SPSS.

A Note on Data Exports based on our Experiences with SPSS

Guidance on data management and data preservation (including our own guidance) often recommends to convert data to non-proprietary formats. However, we want to emphasize that researchers should check these converted data files before submitting or archiving the data. Furthermore, original files should be kept.

We do not want to blame SPSS’ export procedures here. Instead, we want to illustrate that researchers should not blindly trust the software to do what they think. For example, SPSS automatically converts some date formats with information loss if you are exporting date variables from SPSS to csv:

HH:mm:ss.SSS -> HH:mm:ss

1st quarter -> 01.01.

 MM/dd/yyyy, HH:mm:ss.SSS -> MM/dd/yyyy

Thus, in order to ensure that your export worked without loss of information, you should inspect generated files, rerun your analysis scripts on these files (if possible) and compare their results to the results reported in corresponding publications.

Recommendations and Best Practices for Specific Data Types

The following standards exist for biopsychological data formats:

Further Resources

Recommended formats for other data types considering their appropriateness for long-term preservation can be retrieved from the UK data archive.

The Digital Preservation Handbook, 2nd Edition, http://handbook.dpconline.org/ provides a comprehensible introduction on this issue.

References