Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Research Data Management & Sharing: Organizing Data

This guide provides resources on research data management and sharing.

Why is Organizing Data Important?

It is important to organize your data so your data can be easily used and understood by others. Key practices to organizing your data include having a consistent folder and file structure, using recommended file formats, and using a file naming convention (FNR). Using tools such as LabArchives (an Electronic Lab Notebook) can help you keep your data organized and managed. Your folder structure should be documented and described in your readme.txt file.

LabArchives (Electronic Lab Notebook)

LabArchives

LabArchives is the world’s leading ELN (Electronic Laboratory Notebook) with 700,000 scientists and more than 80,000 students using the LabArchives Platform this year. The University of Rochester is currently in the process of acquiring institution access to LabArchives. 

Please go to the River Campus Libraries Guide on LabArchives for more information. 

Qualitative Research Tools

Taguette

Taguette is a free and open-source tool for qualitative research. You can import your research materials, highlight and tag quotes, and export the results. User can:

  • Import PDFs, Word Docs (.docx), Text files (.txt), HTML, EPUB, MOBI, Open Documents (.odt), and Rich Text Files (.rtf).
  • Highlight words, sentences, or paragraphs and tag them with the codes you create.
  • Work collaboratively with other users (if self-hosting or using app.taguette.org).
  • Your data stays your own; export everything including your project, highlights, documents, and codes.

 

Recommended File Formats

It is imperative that you think carefully about the file formats you use to manage, share, and preserve your data, as technology is always changing, and software can become obsolete

According to the DMPTool, formats likely to be accessible in the future are: 

  • Non-proprietary
  • Open, with documented standards
  • In common usage by the research community
  • Using standard character encodings (i.e., ASCII, UTF-8)
  • Uncompressed (space permitting)

Examples of preferred format choices include: 

  • Image: JPEG, JPG-2000, PNG, TIFF
  • Text: plain text (TXT), HTML, XML, PDF/A
  • Audio: AIFF, WAVE
  • Containers: TAR, GZIP, ZIP
  • Databases: prefer XML or CSV to native binary formats

Another good resource to use to learn more about file formats is UK Data Service Guidance on Recommended Formats

File Renaming Tools

Structuring Files

  • It is important to use a consistent file structure in order to ensure all of your files can be found.
  • This file structure should be recorded in your readme.txt file and in your data documentation. This readme.txt file should be located at the top of the file structure hierarchy so it can be easy to find.  
  • Try to keep raw data, processed data, code and outputs in separate folders in order to avoid confusion. 
  • The names and folders should follow a file naming convention (see box below). 
  • The exact file structure can differ according to the needs of the researcher.

Example 1: Created by Lane Medical Library at Stanford Medicine with reference to TIER Protocol.

Folder Structure Example


Example 2: A more complicated file structure, which can be generated and auto-populated with the Reproducible Science template for CookieCutter.

.
├── AUTHORS.md
├── LICENSE
├── README.md
├── bin                <- Your compiled model code can be stored here (not tracked by git)
├── config             <- Configuration files, e.g., for doxygen or for your model if needed
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
├── docs               <- Documentation, e.g., doxygen or scientific papers (not tracked by git)
├── notebooks          <- Ipython or R notebooks
├── reports            <- For a manuscript source, e.g., LaTeX, Markdown, etc., or any project reports
│   └── figures        <- Figures for the manuscript or reports
└── src                <- Source code for this project
    ├── data           <- scripts and programs to process data
    ├── external       <- Any external source code, e.g., pull other git projects, or external libraries
    ├── models         <- Source code for your own model
    ├── tools          <- Any helper scripts go here
    └── visualization  <- Scripts for visualisation of your results, e.g., matplotlib, ggplot2 related.

Naming Files

A file naming convention (FNC) is a framework for naming your files in a way that describes what they are and their relationship to other files. It is important to create the FNC at the very beginning of the project. Make sure everyone involved in the research project is aware of the FNC, and that all members consistently used it. You want to record the FNC in your readmt.txt file and in the data documentation section of your research data management and sharing plan.

General rules to follow include:

  • Be consistent.
  • It should be as short as possible (try to make sure it is less than 32 characters)
  • Reserve the three letter file extension for the file format, such as .csv.
  • Avoid using special characters.
  • Do not use spaces, as they are not recognized by some software. Use underscores (file_name), capital letters (fileName), or dashes (file-name) instead
  • Use the ISO 8601 date format: YYYYMMDD
  • To ensure the files are sequential, consider the sort order.
    • Use leading zeroes when it comes to numbers. 07 will sort above 70, but 7 will not. Consider how many files you will have, and use that many digits. (i.e., less than 100 use 01-99. More than 100 use 001-999.)
    • Consider the hierarchy of the terms in the FNR. If you want files to be organized first by date, then date should be first. If you want to organize first by interviewee name, then the interviewee name should be first.
  • Always include version numbers on a file, as it can be difficult to find the "correct" version of a file.
  • Avoid generic file names.
  • Avoid using acronym names that cannot be easily understood, or are not explained in the readme.txt file. 

Information to consider including in your FNC:

  • Project name, experiment name or acronym 
  • Initials or name of researcher
  • Date or range of dates when data was collected
  • Location or spatial information
  • Type of data
  • Type of analysis
  • Conditions
  • Description of experiment
  • Unique identifier
  • Language
  • Name or pseudonym of interviewee
  • Sample name
  • Version number of file (with leading zeroes)
  • Three letter file extension for the file format

Include the formula for the FNC in your readme.txt file, including the meanings of any acronyms that need to be used in the FNC

Example 1

FNC

[Date]_[Interviewee]_[DocumentType].pdf

Date The date the interview was taken in YYYYMMDD format.
Interviewee Pseudonym of the interviewee.
Document Type

Which document type is this:

Notes - Raw notes taken by the interviewer during the interview process.

Transcript - Transcript created from the audio file of the interview.

Example 20220818_Noelle_Transcript.pdf

Example 2

FNC [SampleLocation]_[Date]_[VersionNumber].csv
Sample Location

The location where the sample was taken.

ERI - Lake Erie

ONT - Lake Ontario

Date The date the sample was taken in YYYYMMDD format.
Version Number The version number of the table. Record as vXX.
Example ONT_20220818_v03.csv