Most scientists collect and organize at least some data in spreadsheets, usually Excel or Google Sheets, despite the potential pitfalls of using such products (there are even archives of spreadsheet horror stories). The most commonly bemoaned problem in Biology, that of Excel converting some gene names to dates, even caused the HGNC (HUGO Gene Nomenclature Committee) to change the names of at least 27 gene this year to avoid this issue. No matter your feelings about spreadsheets, they are generally the first program students learn to use for creating a database of samples, recording data, or doing simple calculations. Furthermore, for people without extensive coding or experience, spreadsheets are the default. Fortunately, by following some simple guidelines, we can avoid most of the hassles as well as countless hours re-formatting data tables for analysis and endless confusion trying to decipher color-codes from 10 years ago.
This paper by Broman & Wu is from 2018, but it came to my attention this week and I have now added it to my canon of “Must read” literature for future students.
Many of these tips seem obvious, but I’m guessing if you think back, you will recall an instance(s) where you (or a co-author) violated each of these tips and in retrospect knew you had erred. These days you are wiser but could probably use a refresher. This paper prevents the re-invention of the wheel during every PhD. I urge you to read the full paper, but here I’m providing the lightly edited (I combined some tips and re-arranged them a bit) cliffs notes. These guidelines, if implemented across the lab, also allow for easy hand-off and transfer of data between students and colleagues.
Tip 1 – Be consistent. In categorical variable codes, missing values, variable names, subject identifiers, dates, data layouts, and files names, both within and across spreadsheets. E.g., don’t use both “M” and “male”, don’t list the day first in some files and the month first in others. This one hits home – I once inherited a database of samples from a former French student who sometimes used the European date format and sometimes the American on both the sample label and within the database (they also labeled all variable names in French, but that’s another story!).
Tip 2 – Choose wisely. When choosing names or codes for variables, think about how your choice or a file format conversion will affect the analyses. E.g., don’t choose names with special characters and use underscores or hyphens instead of spaces. Think about how easy it will be to type out the variable name repeatedly in R code. It’s best to do this before you start collecting data. Also, choose wisely when it comes to how you represent any date variables.
Tip 3 – No empties allowed. Have a code that indicates a value is missing rather than the cell being intentionally left blank. This is especially important if you are continuing to collect data and are leaving cells blank to fill them in later! It’s also important for sorting data later. If you’re really fancy, you may have one missing code for data that wasn’t collected and another for data that is yet to be collected!
Tip 4 – One cell = One item. Each cell should contain only one piece of data, no more. The example given in the paper is position on a 96 well plate (e.g., A11 or B02), but I’ve also run into trouble with coding an individual as “adult_male” or “juvenile_female”. My solution is to keep the column with the “group” designation so I can easily visualize each group, but to add two columns, one for age and one for sex, for ease of sorting. And put ‘extra’ information, like units, into the header, a Notes column, or your ReadMe file (see Tip 6).
Tip 5 – Rectangles with one header row are gold. This honestly is pretty self-explanatory. See the figures below from the paper and imagine trying to analyze them.
Additionally, if you have bits and pieces of data scattered around, put them in separate files for ease of analysis later on. I corrected this very mistake today for a project I was just starting.
Tip 6 – Create a Data Dictionary (And Data ReadMe – For more information about ReadMe files, see here and here). Have a separate document of metadata that explains the overarching goal of the project, the data being collected with brief notes about the methods, and an explanation of what each variable in the spreadsheet is. These notes should include the variable name in the spreadsheet, a longer explanation of what the variable means, the measurement units if any, potential categories, etc. The article suggests separating the ReadMe and the data dictionary, but I advocate for having the information about variables both your data dictionary and your ReadMe file.
Tip 7 – Keep a raw version and back-up your data often. This tip feels obvious, but needs to be said. You should always keep a raw, protected version of your data that has no calculations included in the spreadsheet and contains all of the data. Save a copy and work within the copy. If you then exclude values or do calculations, you can save edited versions and even keep an explanation of the different versions in your ReadMe file, but always keep a ‘clean’ raw version that you don’t touch in case you need to go back. Similarly, save back-ups regularly and in different locations. If you don’t already do this one, stop reading and go do it, then come back.
Tip 8 – Do not color-code. I made this mistake a lot early on. Don’t. You will not remember what these highlighted cells represent or why some of the values are blue versus black when you re-open this file a year from now. Also, you can’t sort colored text or highlighted cells and these visualizing aids will usually be lost if you save in a different format or import the data into a different program. Instead, add Notes or a new variable to convey the information.
Now, you are empowered to use (and not abuse) spreadsheets for data collection! Go forth and collect all the data!
Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989