Large genome model: Open source AI trained on trillions of bases
Late in 2025, we covered the development of an AI system called Evo that was trained on massive numbers of bacterial genomes. So many that, when prompted with sequences from a cluster of related genes, it could correctly identify the next one or suggest a completely novel protein.
That system worked because bacteria tend to cluster related genes together-something that's not true in organisms with complex cells, which tend to have equally complex genome structures. Given that, our coverage noted, "It's not clear that this approach will work with more complex genomes."
Apparently, the team behind Evo viewed that as a challenge, because today it is describing Evo 2, an open source AI that has been trained on genomes from all three domains of life (bacteria, archaea, and eukaryotes). After training on trillions of base pairs of DNA, Evo 2 developed internal representations of key features in even complex genomes like ours, including things like regulatory DNA and splice sites, which can be challenging for humans to spot.