The Earth BioGenome Project aims to sequence all currently described ~1.5 million eukaryotic species on earth (Lewin et al., 2018; Figure 1). The scale and scope are enormous, and it is hard to imagine a more ambitious but exciting goal.
Last month, I attended the launch of the Earth BioGenome Project, held at the Wellcome Trust in London. From the first session you could sense the buzz and anticipation. Harris Lewin opened the meeting with his vision for the project. He sees Earth BioGenome as biology’s ‘moonshot’, as transformative for science as placing a man on the moon. The projected cost of $4.7bn is similar to the Human Genome Project ($2.7bn, equivalent to $5bn today), and is somewhat comparative in the need for collaborative effort from different research groups. The need for global collaboration is clear: to sequence earth’s diversity we need to use samples held in museum, zoo and botanic garden collections from across the globe; we need extensive new field collections (particularly in biodiversity hotspots); we need to develop new sequencing infrastructure and bioinformatic pipelines; and we need scientists to use these data for research, biodiversity monitoring and conservation. Lewin reminded us that not all the uses of the human genome were clear when the project was launched, and the same applies to Earth BioGenome data. But obvious uses are for benefitting human welfare (e.g. drug discovery and crop improvement), protecting biodiversity, and understanding ecosystems.
After this inspiring introduction most the audience were invigorated. My initial doubts were quickly dealt with. I came in questioning whether this goal is really possible. But I hadn’t realised how much had already been achieved. As a plant biologist I’ve been following the progress of the 10,000 Plant Genome Project in detail (Twyford, 2018). But many other ‘big genome’ projects had largely passed me by. There’s the Vertebrate Genomes Project (aim: 66,000 error-free vertebrate genomes), Bat 1K (1,300 bat genomes), 1000 Fungal Genomes, The i5K Initiative (5,000 insects and other arthropods), 10,000 Bird Genomes (B10K), with the list going on and on. Seemingly biologists studying every major organismal lineage have initiated their own genome project. And what’s exciting is that these projects have made substantial progress with many genome sequences published or soon to be released. Earth BioGenome unites these ongoing projects and builds on this experience. By setting data standards, recommending pipelines, providing infrastructure, and offering re-usable templates and agreements for sample sharing, Earth BioGenome makes new genomic-scale science more attainable.
How will Earth BioGenome work? What became clear at the meeting was that Earth BioGenome will be an aggregate of smaller projects with their own governance. Each project will find their own funding and proceed separately, but Earth BioGenome will provide the template for how to proceed and may also provide some centralised funding for specific goals. In particular, centralised funding may help developing countries build their own sequencing infrastructure and biobanks to support genomic research. This will also help train hundreds of the next generation of scientists necessary to make this research happen.
A key message from the meeting was that if we are sequencing representative genomes from all of life we need to do it well. There is little point in assembling fragmented genome sequences from Illumina short-read data if they are to be replaced by contiguous genomes from long-read data in the near future. The route to good genomes will differ depending on the organism, but likely includes a combination of long-read (Pacific Biosciences and/or Oxford Nanopore Technologies) and short-read Illumina technologies, often paired with inexpensive synthetic read data (e.g. 10X Genomics) and scaffolded with Hi-C or BioNano Genomics (see summary here). There was a remarkable consensus that given major innovations in genomic technologies the sequencing is one of the easy parts of the project, and that the greater challenge is in sourcing material (particularly from the tropics), putting a name to each sample, and curating voucher specimens.
At the same meeting, Mike Stratton introduced a second major new sequencing initiative, the Darwin Tree of Life Project. This ‘place-based’ rather than ‘taxon-based’ project aims to sequence a representative from all 66,000 eukaryotic species present in the United Kingdom. Why the UK? Its small size and limited diversity, its existing detailed collections, the presence of related datasets, and the existence of immediate funding for sequencing, all make it a good first choice. This project gets me excited (disclaimer: I’m hoping to be involved with the project by sequencing British plants, along with colleagues at the University of Edinburgh, Royal Botanic Garden Edinburgh, and Royal Botanic Garden Kew) as I see this as a superb opportunity for comparative genomic analyses that incorporate the large existing data sets of ecological attributes and species’ traits.
Where next? I think one important goal is for researchers to launch new comparative genomic projects, and for scientists to lobby funding agencies and governments to support new genomic research. If many new and diverse sequencing projects are started this will build the momentum for broadening the sequencing effort to global diversity. One initial aim should be to produce genome sequences representative of each organismal family, before moving to genera-references and then species (the ‘phylogenetic wave’). Another aim should be to sequence diverse genomes from multiple areas to develop tools for place-based projects. Personally I can’t wait to see the next stage of the genomics revolution take place.
Lewin, H. A. et al. (2018) Earth BioGenome Project: Sequencing life for the future of life. Proceedings of the National Academy of Sciences 115, 4325-4333.
Twyford, A. D. (2018) The road to 10,000 plant genomes. Nature Plants4, 312-313, doi:10.1038/s41477-018-0165-2.