I love when nostalgia for a project, place, or species intersects with a current interest, as happened this week for me with a paper by Cordes et al. 2020, about the contrasting effects of climate change on the seasonal survival of yellow-bellied marmots in the Colorado Rocky Mountains.
Most scientists collect and organize at least some data in spreadsheets, usually Excel or Google Sheets, despite the potential pitfalls of using such products (there are even archives of spreadsheet horror stories). The most commonly bemoaned problem in Biology, that of Excel converting some gene names to dates, even caused the HGNC (HUGO Gene Nomenclature Committee) to change the names of at least 27 gene this year to avoid this issue. No matter your feelings about spreadsheets, they are generally the first program students learn to use for creating a database of samples, recording data, or doing simple calculations. Furthermore, for people without extensive coding or experience, spreadsheets are the default. Fortunately, by following some simple guidelines, we can avoid most of the hassles as well as countless hours re-formatting data tables for analysis and endless confusion trying to decipher color-codes from 10 years ago.
This paper by Broman & Wu is from 2018, but it came to my attention this week and I have now added it to my canon of “Must read” literature for future students.
Many of these tips seem obvious, but I’m guessing if you think back, you will recall an instance(s) where you (or a co-author) violated each of these tips and in retrospect knew you had erred. These days you are wiser but could probably use a refresher. This paper prevents the re-invention of the wheel during every PhD. I urge you to read the full paper, but here I’m providing the lightly edited (I combined some tips and re-arranged them a bit) cliffs notes. These guidelines, if implemented across the lab, also allow for easy hand-off and transfer of data between students and colleagues.
Tip 1 – Be consistent. In categorical variable codes, missing values, variable names, subject identifiers, dates, data layouts, and files names, both within and across spreadsheets. E.g., don’t use both “M” and “male”, don’t list the day first in some files and the month first in others. This one hits home – I once inherited a database of samples from a former French student who sometimes used the European date format and sometimes the American on both the sample label and within the database (they also labeled all variable names in French, but that’s another story!).
Tip 2 – Choose wisely. When choosing names or codes for variables, think about how your choice or a file format conversion will affect the analyses. E.g., don’t choose names with special characters and use underscores or hyphens instead of spaces. Think about how easy it will be to type out the variable name repeatedly in R code. It’s best to do this before you start collecting data. Also, choose wisely when it comes to how you represent any date variables.
Tip 3 – No empties allowed. Have a code that indicates a value is missing rather than the cell being intentionally left blank. This is especially important if you are continuing to collect data and are leaving cells blank to fill them in later! It’s also important for sorting data later. If you’re really fancy, you may have one missing code for data that wasn’t collected and another for data that is yet to be collected!
Tip 4 – One cell = One item. Each cell should contain only one piece of data, no more. The example given in the paper is position on a 96 well plate (e.g., A11 or B02), but I’ve also run into trouble with coding an individual as “adult_male” or “juvenile_female”. My solution is to keep the column with the “group” designation so I can easily visualize each group, but to add two columns, one for age and one for sex, for ease of sorting. And put ‘extra’ information, like units, into the header, a Notes column, or your ReadMe file (see Tip 6).
Tip 5 – Rectangles with one header row are gold. This honestly is pretty self-explanatory. See the figures below from the paper and imagine trying to analyze them.
Additionally, if you have bits and pieces of data scattered around, put them in separate files for ease of analysis later on. I corrected this very mistake today for a project I was just starting.
Tip 6 – Create a Data Dictionary (And Data ReadMe – For more information about ReadMe files, see here and here). Have a separate document of metadata that explains the overarching goal of the project, the data being collected with brief notes about the methods, and an explanation of what each variable in the spreadsheet is. These notes should include the variable name in the spreadsheet, a longer explanation of what the variable means, the measurement units if any, potential categories, etc. The article suggests separating the ReadMe and the data dictionary, but I advocate for having the information about variables both your data dictionary and your ReadMe file.
Tip 7 – Keep a raw version and back-up your data often. This tip feels obvious, but needs to be said. You should always keep a raw, protected version of your data that has no calculations included in the spreadsheet and contains all of the data. Save a copy and work within the copy. If you then exclude values or do calculations, you can save edited versions and even keep an explanation of the different versions in your ReadMe file, but always keep a ‘clean’ raw version that you don’t touch in case you need to go back. Similarly, save back-ups regularly and in different locations. If you don’t already do this one, stop reading and go do it, then come back.
Tip 8 – Do not color-code. I made this mistake a lot early on. Don’t. You will not remember what these highlighted cells represent or why some of the values are blue versus black when you re-open this file a year from now. Also, you can’t sort colored text or highlighted cells and these visualizing aids will usually be lost if you save in a different format or import the data into a different program. Instead, add Notes or a new variable to convey the information.
Now, you are empowered to use (and not abuse) spreadsheets for data collection! Go forth and collect all the data!
Experience with genome assemblies would also be advantageous.
Nominations and personal applications are welcome, and whilst scientific qualifications are paramount, we would particularly appreciate nominations and applications from suitably qualified researchers in underrepresented groups, including women, ethnic minority scientists, and scientists with disabilities, among others. Please email nominations/applications by October 15th, 2020 to email@example.com with the following items:
Cover letter stating the reasons for your nomination, of if applying for yourself, your interest in the role and familiarity with the journals,
Abbreviated CV (Education, Publications, Outreach) if you have it.
As a PhD student studying the effects of genetic diversity overall and immunogenetic diversity specifically on survival and reproductive success in an endangered primate in captive and wild populations, I thought a lot about the potential effects of inbreeding and outbreeding depression. I read literally 100s of papers on the topic. Inbreeding depression describes the negative fitness effects that can occur in small populations when relatives breed with each other for multiple generations, thus genetic diversity is lost through genetic drift and negative alleles are expressed. Outbreeding depression, by contrast, is the negative consequence of breeding two genetically distinct populations leading to a loss of local adaptation. Concerns about outbreeding depression are one of the major theoretical limitations to re-introductions and attempts at ‘genetic rescues’ when small populations and/or endangered species might be suffering from inbreeding depression. For the most part, however, evidence of outbreeding depression has mostly been limited to plants and captive or laboratory studies. Earlier this year, however, Dr. Sarah Fitzpatrick and her co-authors documented an extremely cool example of genetic rescue in populations of wild Trinidadian guppies, contradicting the hypothesis about the potential for maladaptive gene flow in population introductions (Fitzpatrick et al. 2020).
After repeatedly sampling two isolated guppy ‘recipient’ populations (Figure 1A, dark blue circles, N < 100 individuals per population) in the Caigual and Taylor rivers in Trinidad, the authors introduced populations of guppies upstream (dashed red circles) of these recipient populations, in previously guppy-free areas. These trans-located guppies, from downstream populations (solid red circles), occasionally (or frequently!) migrated downstream into the recipient populations located either ~5m or ~700m from the introduction location. For ~8-10 guppy generations after the trans-location, the recipient populations have been monitored with mark-recapture to assess population size as well as individual overall genetic diversity, hybrid ancestry, lifespan, and reproductive success. Following the onset of immigration and subsequent gene flow, both recipient populations experienced nearly a 10-fold increase in population size, from less than 100 individuals to an estimated 1,000 individuals each (Figure 1B). Based on the hybrid index, which ranges from 0 to 1 based on the amount of native or immigrant ancestry of an individual respectively, of the generations, it’s clear that 10 generations after the first wave of immigration, the population consists almost entirely of admixed individuals (Figure 1C).
Contradicting the predictions of outbreeding depression, individuals with intermediate to high (0.5-0.75) hybrid indices had the highest longevity and reproductive success in both locations and across sexes (Figure 2). Interestingly, although hybrids and pure immigrants had similar levels of genetic heterozygosity, hybrids had higher fitness, suggesting that increased genomic diversity alone does not explain the increased fitness and pointing towards a potential maintenance of locally adapted alleles.
Pre-introduction, 95% and 96% of >12,000 genotyped SNPs were monomorphic in the Caigual and Taylor populations respectively and average nucleotide diversity was 0.01 in both populations (Figure 4b). 8-10 generations later, only 22 and 24% of SNPs are monomorphic and nucleotide diversity has increased to 0.21 and 0.22. Genome-wide average Fst between source and recipient populations also decreased from 0.29-0.31 to 0.01.
To determine if gene flow swamped locally adaptive variants, the authors identified 146 loci with allele frequencies in the pre-immigrant recipient populations that might indicate candidacy for locally adapted alleles. Post-immigration, although overall genome homogenization increased between immigrant and recipient populations, the authors found evidence for selective maintenance of some of the candidate alleles in the recipient populations in the form of an excess of pre-immigrant ancestry at these loci (Fig 4). Unfortunately, none of these candidate loci matched previously identified loci under selection nor were any gene ontology terms enriched, but they provide interesting potential targets for future investigation.
This study documents the phenomenon of genetic rescue in two multi-generational wild populations, showing that contrary to expectations, gene flow does not necessarily swam local adaptation, and actually can significantly increase fitness in the form of longevity and reproductive success, subsequently substantially increasing population size. Further, at laest some locally adapted loci appear to have been maintained in both Caigual and Taylor, despite a 10-fold difference in the number of immigrants to each population, suggesting a range of gene flow rates might still allow the maintenance of local adaptation, with extremely important and interesting implications for future conservation-based introduction efforts.
Fitzpatrick, S.W., G.S. Bradburd, C.K. Kremer, P.E. Salerno, L.M. Angeloni, W.C. Funk (2020) Genomic and fitness consequences of genetic rescue in wild populations. Current Biology 30: 517-522.e5.
A new episode of The Molecular Ecologist Podcast is now out on Anchor.fm. In this episode, we turn to a question that every academic scientist has to answer at some point: How do you choose a scientific journal to receive your paper? Kelle Freel, Shawn Abrahams, Katie Grogan and Jeremy Yoder chat about what they like in a journal, what they consider when picking a publication venue for a new paper, and the various meanings of an “impact factor.”
One of the major goals of evolutionary biology is to link phenotypic variation with specific genetic variation, yet for behavioral phenotypes in non-model species, this task remains daunting and generally elusive. Although behaviors are heritable and clearly acted upon by evolutionary forces, they are generally polygenic, flexibly expressed, and context-dependent. Two recent papers, however, accomplished this very thing, in white-throated sparrows (Zonotrichia albicolis; Merritt et al. 2020) and in a species of jumping spider from southeastern Asia (Portia labiata; Chang et al. 2020)!
It’s undeniable that penguins are a marine representative of the charismatic megafauna group. I have an affinity for stuff we need microscopes to see, BUT I agree that penguins are cute (just LOOK at these National Geographic photos…they’re even in comics). I’m guessing that many of us have also watched “March of the Penguins”, although maybe you also were today years old when you learned the original French version was narrated in first-penguin by the stars of the show themselves in “La Marche de l’Empereur”.
Our hearts all melt a tiny bit when we see a fluffy baby chick waddle around on the ice. But. Have you ever contemplated how many different penguin species there are, where exactly they’re found on the globe and how they ended up where they currently reside? If you’re like me, (and don’t work on anything remotely related to penguins), you might not be well versed in the diversity of these flightless diving birds.
Occasionally, while reading the literature, you stumble across a paper that is so eloquent and beautiful that you are awestruck. Since that happened to me this weekend, today’s post is a call to you to go read the incredible synthesis and call to action written by Schell et al. in Science (2020) – The ecological and evolutionary consequences of systemic racism in urban environments. In this paper, the authors affirm that biologists working in urban environments must consider how racial oppression affects the biological change they study.
Evolutionary biologists have increasingly become interested in how the environmental change due to urbanization leads to changes in the phenotypic, genetic, and species make-up of urban ecosystems. Indeed, between 1965 and 1989, only 124 papers with the words “Urban ecology” in the abstract were published according to a quick non-exhaustive search of Web of Science (mean = 5.0 papers per year; performed 8-31-2020). However, from 1990 until 2019, the rate of publication increased exponentially to over 1,000 papers in 2019 alone.
I recently took a look through the “Archives by month” drop-down in our right-hand sidebar and discovered that it goes all the way back to July 2010. Which means The Molecular Ecologist had its tenth anniversary this very month — specifically back on July 11, an even decade since Brant Faircloth kicked off the blog with a rundown of essential (Python-centric) bioinformatic tools.
Given that it snuck up on us, and in the middle of the summer, and in the middle of this summer, we don’t have any kind of big event planned. But I didn’t want to let the month close out without marking the occasion. So here’s a rundown of some major events in the history of this fine blog:
It’s been over 100 years since the Dutch Microbiologist Martinus Willem Beijerinck theorized that microbes could oxidize manganese to generate energy for growth. Last week, the first evidence for this theory was published, and you might be surprised about from where these fascinating microbes hail.