The Quest to Sequence the Genomes of Everything

A gibbous moon hangs over a lonely mountain trail in the Italian Alps, above the village of Malles Venosta, whose lights dot the valley below. Benjamin Wiesmair stands next to a moth trap as tall as he is, his face, bushy beard, and hair bun lit by its purple glow. He's wearing a headlamp, a dusty and battered smartwatch, cargo shorts, and a blue zip sweater with the sleeves pulled up. Countless moths beat frenetically around the trap's white, diaphanous panels, which are swaying with ghostly ripples in a gentle breeze. Wiesmair squints at his smartphone, which is logged on to a database of European moth species.
Chersotis multangula," he says.
Yes, we need that," comes the crisp reply from Clara Spilker, consulting a laptop.
This article is part of The Scale Issue.
Wiesmair, an entomologist at the Tyrolean State Museums, in Innsbruck, Austria, and Spilker, a technical assistant at the Senckenberg German Entomological Institute, in Muncheberg, are taking part in one of the most far-reaching biological initiatives ever: obtaining a genome sequence for nearly every named species of eukaryotic organism on the planet. All 1.8 million of them. The researchers are part of an expedition for Project Psyche, which is sampling European butterflies and moths and will feed its data into the global initiative, called the Earth BioGenome Project (EBP).

Eukaryotes are organisms whose cells contain a nucleus. From protozoa to human beings, all have the same basic biological mechanism for building, maintaining, and propagating their form of life: a genome. It's the sum total of the genes carried by the creature.
Twenty-two years ago, researchers announced that for the first time they had mapped, or sequenced," nearly all of the genes in a human genome. The project cost more than US $3 billion and took 13 years, but it eventually transformed medical practice. In the new era of genomic medicine, doctors can take a patient's specific genetic makeup into consideration during diagnosis and treatment.

The EBP aims to reach its monumental goal by 2035. As of July 2024, its tally of genomes sequenced stood at about 4,200. Success will undoubtedly depend on researchers' ability to scale several biotech technologies.
We need to scale, from where we're at, more than a hundredfold in terms of the number of genomes per year that we're producing worldwide," says Harris Lewin, who leads the EBP and is a professor and genetics researcher at Arizona State University.
One of the most crucial technologies that must be scaled is a technique called long-read genome sequencing. Specialists on the front lines of the genomic revolution in biology are confident that such scaling will be possible, their conviction coming in part from past experience. Compared to 2001," when the Human Genome Project was nearing completion, it is now approximately 500,000 times cheaper to sequence DNA," says Steven Salzberg, a Bloomberg Distinguished Professor at Johns Hopkins University and director of the school's Center for Computational Biology. And it is also about 500,000 times faster to sequence," he adds. That is the scale, over the past 25 years, a scale of acceleration that has vastly outstripped any improvements in computational technology, either in memory or speed of processors."

There are many reasons to cheer on the EBP and the technological advances that will underpin it. Having established a genome for every eukaryotic creature, researchers will gain deep new insights into the connections among the threads in Earth's web of life, and into how evolution proceeded for its myriad life forms. That knowledge will become increasingly important as climate change alters the ecosystems on which all of those creatures, including us, depend.
And although the project is a scientific collaboration, it could spin off sizable financial windfalls. Many drugs, enzymes, catalysts, and other chemicals of incalculable value were first identified in natural samples. Researchers expect many more to be discovered in the process of identifying, in effect, each of the billions of eukaryotic genes on Earth, many of which encode a protein of some kind.
One idea is that by looking at plants, which have all sorts of chemicals, often which they make in order to fight off insects or pests, we might find new molecules that are going to be important drugs," says Richard Durbin, professor of genetics at the University of Cambridge and a veteran of several genome sequencing initiatives. The immunosuppressant and cancer drug rapamycin, to cite just one of countless examples, came from a microbe genome.
Your Genes Are a Big Reason Why You're YouThe EBP is an umbrella organization for some 60 projects (and counting) that are sequencing species in either a region or in a particular taxonomic group. The overachiever is the Darwin Tree of Life Project, which is sequencing all species in Britain and Ireland, and has contributed about half of all of the genomes recorded by the EBP so far. Project Psyche was spun out of the Darwin Tree of Life initiative, and both have received generous support from the Wellcome Trust.
To get an idea of the magnitude of the overall EBP, consider what it takes to sequence a species. First, an organism must be found or captured and sampled, of course. That's what brought Wiesmair, Spilker, and 41 other lepidopterists to the Italian Alps for the Project Psyche expedition this past July. Over five days, they collected more than 200 new species for sequencing, which will augment the 1,000 finished lepidoptera genome sequences already completed and the roughly 2,000 samples awaiting sequencing. There's still plenty of work to be done; there are around 11,000 species of moths and butterflies across Europe and Britain.
After sampling, genetic material-the creature's DNA-is collected from cells and then broken up into fragments that are short enough to be read by the sequencing machines. After sequencing, the genome data is analyzed to determine where the genes are and, if possible, what they do.
Over the past 25 years, the acceleration of gene-sequencing tech has vastly outstripped any improvements in computational technology, either in memory or speed of processors.
DNA is a molecule whose structure is the famous double helix. It resides in the nucleus of every cell in the body of every living thing. If you think of the molecule as a twisted ladder, the rungs of the ladder are formed by pairs of chemical units called bases. There are four different bases: adenine (A), guanine (G), cytosine (C), and thymine (T). Adenine always pairs with thymine, and guanine always pairs with cytosine. So a rung" can be any of four things: A-T, T-A, C-G, or G-C.
Those four base-pair permutations are the symbols that comprise the code of life. Strings of them make up the genome as segments of various lengths called genes. Your genes at least partially control most of your physical and many of your mental traits-not only what color your eyes are and how tall you are but also what diseases you are susceptible to, how difficult it is for you to build muscle or lose weight, and even whether you're prone to motion sickness.
How Long-Read Genome Sequencing WorksLong-read sequencing starts by breaking up a sample of genetic material into pieces that are often about 20,000 base pairs long. Then the sequencing technology reads the sequence of base pairs on those DNA strands to produce random segments, called reads," of DNA that are at least 10,000 pairs in length. Once those long reads are obtained, powerful bioinformatics software is used to build longer stretches of contiguous sequence by overlapping reads that share the same sequence of bases.
To understand the process, think of a genome as a novel, and each of its separate chromosomes as a chapter in the novel. Imagine shredding the novel into pieces of paper, each about 5 square centimeters. Your job is to reassemble them into the original novel (unfortunately for you, the pages aren't numbered). What makes this task possible is overlap-you shredded multiple copies of the novel, and the pieces overlap, making it easier to see where one leaves off and another begins.
Making it much harder, however, are the many sections of the book filled with repetitive nonsense: the same word repeated hundreds or even thousands of times. At least half of a typical mammalian genome consists of these repetitive sequences, some of which have regulatory functions and others regarded as junk" DNA that's descended from ancient genes or viral infections and no longer functional. Long-read technology is adept at handling these repetitive sequences. Going back to the novel-shredding analogy, imagine trying to reassemble the book after it was shredded into pieces only 1 centimeter square rather than 5. That's analogous to the challenge that researchers formerly faced trying to assemble million-base-pair DNA sequences using older, short-read" sequencing technology.
The Two Approaches to Long-Read SequencingThe long-read sequencing market has two leading companies-Oxford Nanopore Technologies (ONT) and Pacific Biosciences of California (PacBio)-which compete intensely. The two companies have developed utterly different systems.
The heart of ONT's system is a flow cell that contains 2,000 or more extremely tiny apertures called, appropriately enough, nanopores. The nanopores are anchored in an electrically resistant membrane, which is integrated onto a sensor chip. In operation, each end of a segment of DNA is attached to a molecule called an adapter that contains a helicase enzyme. A voltage is applied across the nanopore to create an electric field, and the field captures the DNA with the attached adapter. The helicase begins to unzip the double-stranded DNA, with one of the DNA strands passing through the nanopore, base by base, and the other released into the medium.
OPTICAL SEQUENCING (Pacific Biosciences)
A polymerase enzyme replicates the DNA strand, matching and connecting each base to a specially engineered, complementary nucleotide. That nucleotide flashes light in a characteristic color that identifies which base is being connected.
Each DNA strand is immobilized at the bottom of a well.
As the DNA strand is replicated, each base while being incorporated emits a tiny flash of light in a color that is characteristic of the base. The sequence of light flashes indicates the sequence of bases.
What propels the strand through the nanopore is that voltage-it's only about 0.2 volts, but the nanopore is only 5 nanometers wide, so the electric field is several hundred thousand volts per meter. It's like a flash of lightning going through the pore," says David Deamer, one of the inventors of the technology. At first, we were afraid we would fry the DNA, but it turned out that the surrounding water absorbed the heat."
That kind of field strength would ordinarily propel the DNA-based molecule through the pore at speeds far too fast for analysis. But the helicase acts like a brake, causing the molecule to go through with a ratcheting motion, one base at a time, at a still-lively rate of about 400 bases per second. Meanwhile, the electric field also propels a flow of ions across the nanopore. This current flow is decreased by the presence of a base in the nanopore-and, crucially, the amount of the decrease depends on which of the four bases, A, T, G, or C, is entering the pore. The result is an electrical signal that can be rapidly translated into a sequence of bases.
NANOPORESEQUENCING(Oxford Nanopore)
The helicase enzyme unzips and unravels the double-stranded DNA, and one strand enters the nanopore. The enzyme feeds the strand through the nanopore with a ratcheting motion, base by base.
The ionic current is reduced by a characteristic amount, depending on the base. The current signal indicates the sequence of bases.
PacBio's machines rely on an optical rather than an electronic means of identifying the bases. PacBio's latest process, which it calls HiFi, begins by capping both ends of the DNA segment and untwisting it to create a single-stranded loop. Each loop is then placed in an infinitesimally tiny well in a microchip, which can have 25 million of those wells. Attached to each loop is a polymerase enzyme, which serves a critical function every time a cell divides. It attaches to single-stranded DNA and adds the complementary bases, making each rung of the ladder whole again. PacBio uses special versions of the four bases that have been engineered to fluoresce in a characteristic color when exposed to ultraviolet light.
A UV laser shines through the bottom of the tiny well, and a photosensor at the top detects the faint flashes of light as the polymerase goes around the DNA sample loop, base by base. The upshot is that there is a sequence of light flashes, at a rate of about three per second, that reveals the sequence of base pairs in the DNA sample.
Because the DNA sample has been converted into a loop, the whole process can be repeated, to achieve higher accuracy, by simply going around the loop another time. PacBio's flagship Revio machine typically makes five to 10 passes, achieving median accuracy rates as high as 99.9 percent, according to Aaron Wenger, senior director of product marketing at the company.
How Researchers Will Scale Up Long-Read SequencingThat kind of accuracy doesn't come cheap. A Revio system, which has four chips, each with 25 million wells, costs around $600,000, according to Wenger. It weighs 465 kilograms and is about the size of a large household refrigerator. PacBio says a single Revio can sequence about four entire human genomes in a 24-hour period for less than $1,000 per genome.
ONT claims accuracy above 99 percent for its flagship machine, called PromethION 24. It costs around $300,000, according to Rosemary Sinclair Dokos, chief product and marketing officer at ONT. Another advantage of the ONT PromethION system is its ability to process fragments of DNA with as many as a million base pairs. ONT also offers an entry-level system, called MinION Mk1D, for just $3,000. It's about the size of two smartphones stacked on top of each other, and it plugs into a laptop, offering researchers a setup that can easily be toted into the field.

Although researchers often have strong preferences, it's not uncommon for a state-of-the-art genetics laboratory to be equipped with machines from both companies. At Barcelona's Centro Nacional de Analisis Genomico, for example, researchers have access to both PacBio Revio machines as well as PromethION 24 and GridION machines from ONT.
Durbin, at Cambridge University, sees lots of upside in the current situation. It's very good to have two companies," he declares. They're in competition with each other for the market." And that competition will undoubtedly fuel the tech advances that the EBP's backers are counting on to get the project across the finish line.

PacBio's Wenger notes that the 25-million-well chips that underpin its Revio system are still being fabricated on 200-millimeter semiconductor wafers. A move to 300-mm wafers and more advanced lithographic techniques, he says, would enable them to get many more chips per wafer and put hundreds of millions of wells on each of those chips-if the market demands it.
At ONT, Dokos describes similar math. A single flow cell now consists of more than 2,000 nanopores, and a state-of-the-art PromethION 24 system can have 24 flow cells (or upward of 48,000 nanopores) running in parallel. But a future system could have hundreds of thousands of nanopores, she says-again, if the market demands it.
The EBP will need all of those advances, and more. EBP director Lewin notes that after seven years, the three-phase initiative is wrapping up phase one and preparing for phase two. The goal for phase two is to sequence 150,000 genomes between 2026 and 2030. For phase two, We've got to get to 37,500 genomes per year," Lewin says. Right now, we're getting close to 3,000 per year." In phase two, the cost per genome sequenced will also have to decline from roughly $26,000 per genome in phase one to $6,100, according to the EBP's official road map. That $6,100 figure includes all costs-not just sequencing but also sampling and the other stages needed to produce a finished genome, with all of the genes identified and assigned to chromosomes.

Phase three will up the ante even higher. The road map calls for more than 1.65 million genome sequences between 2030 and 2035 at a cost of $1,900 per genome. If they can pull it off, the entire project will have cost roughly $4.7 billion-considerably less in real terms than what it cost to do just the human genome 22 years ago. All of the data collected-the genome sequences for all named species on Earth-will occupy a little over 1 exabyte (1 billion gigabytes) of digital storage.
It will arguably be the most valuable exabyte in all of science. With this genomic data, we can get to one of the questions that Darwin asked a long time ago, which is, How does a species arise? What is the origin of species? That's his famous book where he never actually answered the question," says Mark Blaxter, who leads the Darwin Tree of Life Project at the Wellcome Sanger Institute near Cambridge and who also conceived and started Project Psyche. We'll get a much, much better idea about what it is that makes a species and how species are distinct from each other."
A portion of that knowledge will come from the many moths collected on those summer nights in the Italian Alps. Lepidoptera go back around 300 million years," says Charlotte Wright, a co-leader, along with Blaxter, of Project Psyche. Analyzing the genomes of huge numbers of species will help explain why some branches of the lepidoptera have evolved far more species than others, she says.
And that kind of knowledge should eventually accumulate into answers to some of biology's most profound questions about evolution and the mechanisms by which it acts. The amazing thing is that by doing this for all of the lepidoptera of Europe, we aren't just learning about individual cases," says Wright. We've learned across all of it."