Wherein I try to save me from myself
Let’s imagine a young scientist, bursting to the seems with enthusiasm and schemes to uncover the secrets of the biological world. Everything is new and she learns as she goes! Let’s call her… Kathryn.
Imagine past-Kathryn. She’s busy, she has things to do. She’s setting up a major experiment or planing out a collecting trip spanning thousands of miles. She has a crew of undergrads and precious volunteers to manage. When a sample came into her hand she named it in an expeditious fashion and moved on.
Now imagine current-Kathryn. She has been BURNED. Who was this capricious imp of chaos that decided this awkward and error prone sample naming system? How is she supposed to tell one person’s O7_4L from another persons 01-Al? Can she even trust her own handwriting? At every turn a different data handling system/program/manipulator has choked on some aspect or another – how many different iterations of these names exist, reflecting into infinity, like one mirror facing another?
Let’s learn from her struggles, shall we?
The platonic ideal of a sample name does several things:
- It uniquely identifies the source of the sample (the location, the biological individual), or at least can be traced back to this information.
- It is clear and cannot be confused with any other sample in the same project, lab, department, university, universe.
- It is impossible to mis-write or mis-read.
- It is short.
If you have figured out how to do all of these things in a single sample naming scheme, please, PLEASE, tell me all about it in the comments.
I suspect it may be impossible to come up with a single perfect system. Once you pick a sample naming scheme, you are kinda stuck with it, for however long you use that material – conservatively, years. Until we have managed to capture this platonic ideal, here are some things to consider when designing the sample name schemes of future projects:
- Probably at some point a human being will have to write this sample name with their own hand.
- Handwriting varies regionally.
- Even within a region, there can be dramatic differences between humans. How many humans will touch your samples?
- Sample names have to work in many different contexts, from a hurried scrawl on a paper shopping bag, to coin envelopes, to eppendorf tubes, to the little space above a well in a gel image. Oh yes, and don’t forget as part of a file name (for file naming tips, which might include your sample names, here are some handy tips). Which leads us to…
- Consider the life of these sample names. How will you want to use them in the future? They will likely need to be transferable from the field to the lab to the computer. Will you want to be able to sort or search them in a particular way? Does every name need to be the same length? AT ANYTIME will your data pass through Excel?
Not new but warrants repeating: Never name a sample something that excel could possibly construe as a number or a date. Damn you 21st sample of Helianthus decapetalus (Dec21).
— Gregory Owens (@Greg_Owens) March 22, 2018
Sample names can be a real Goldilocks situation. You want something just right, and that can be hard to guess in advance. Some things to avoid:
- The letter ‘O’ – too easy to confuse with ‘0’. That second one was a zero, could you even tell? Do you really need this level of confusion? In general, think about letters that can be confused with numbers (and vice versa) and use these troublemakers only in positions that will make it easy to interpret. For example if your scheme is always <two letters><two numbers> then you can interpret “ZA90” correctly even if it looks like “24go” in handwriting.
- The underscore – can be easily obscured by the underline of a blank or a cell border in a datasheet filled in by hand.
- Commas. They can be easily confused with decimals in handwriting. People from different parts of the world may use commas (and decimals) differently than you intend. Also they are special characters and should be avoided in file naming.
- Unnecessary digits. They just add length. For example I once used the scheme of <two letter country code><three digit population number> (e.g. US002). Since the maximum number of populations I managed to collect from any given country was 30, this optimistic choice forced me to write/type about a gazillion extra digits for no added gain.
- Too few digits. On the other hand, if you want to be able to sort your sample data sequentially, you will need leading zeros for things to come out right, so (sample-001, sample-002, sample-100) will sort in the correct order, while not so much without the leading zeros (e.g. sample-1, sample-100, sample-2).
- Too much information. Look, it would be nice if we could read a short alphanumeric code and our brain immediately pulled up all the information. “Oh yes, that is the diffuse knapweed individual that I grew under drought stress in 2010, produced from a cross of two individuals grown in the control treatment in 2009, which were themselves grown from seed collected from a population in western Washington in 2008.” Yes, I have tried to squeeze that much information into a sample name. LET ME TELL YOU, THAT DIDN’T END WELL. The sample name became absurdly long and confusing, and in the end, I had to come up with a new naming scheme post hoc… more than once.
- Too little information may be a problem as well. It was once suggested to me that the best strategy was just to give every sample you ever produced a string of randomly generated numbers. This unique code would in fact contain no information of it’s own, but be the key to looking up all of that metadata up in a database. This is often how large databases do it (for example, a nine-digit numeric is linked to each occurrence record in GBIF, such as 918842875). However, I’m not certain this is the best naming scheme if humans are involved – non-intuitive numeric strings are prone to memory and transcription errors. If individual 918842875 needed to be given a treatment, but not individual 918428875 right next to it, are you confident you would dispense the treatment correctly? Every time?
I bet there are more stories of sample naming regret out there. Leave your tips and tricks in the comments, and we can all learn from our burning mistakes together.