Genome modelling and design across all domains of life with Evo 2 – Nature

Lead

Evo 2 is a genome-scale foundation model introduced in March 2026 that learns DNA sequence patterns across bacteria, archaea, eukarya and phage to enable prediction and design at multiple biological scales. The team trained two model sizes, a 7 billion parameter variant on 2.4 trillion tokens and a 40 billion parameter variant on 9.3 trillion tokens, using an OpenGenome2 corpus exceeding 8.8 trillion nucleotides. Training combined a short-context pretraining phase with an extension midtraining phase that expands context to one million base pairs, and the developers excluded eukaryote-infecting viral genomes for biosafety. The result is a generalist model that yields zero-shot functional predictions, interpretable latent features, and guided genome-scale sequence generation.

Key takeaways

  • Evo 2 consists of two released checkpoints: a 7B model trained on 2.4 trillion tokens and a 40B model trained on 9.3 trillion tokens, both using the OpenGenome2 dataset with >8.8 trillion nucleotides.
  • Training used two phases: pretraining at an 8,192 token context focused on genic windows, then midtraining extending context to 1,000,000 base pairs to learn long-range genomic relationships.
  • The architecture, StripedHyena 2, mixes three convolutional operator variants with attention and improves throughput and loss scaling relative to transformer baselines at long context lengths.
  • Zero-shot likelihoods from Evo 2 correlate with experimental mutational scans across proteins and noncoding RNAs and can predict gene essentiality in prokaryotes and phage using premature stop insertions.
  • Evo 2 achieves leading unsupervised performance on a range of human variant categories, notably excelling on non-SNV variants such as indels and duplications, and enables supervised classifiers when embeddings are used as inputs.
  • Sparse autoencoders trained on Evo 2 embeddings reveal interpretable features that map to prophages, tRNAs, rRNAs, exon boundaries and transcription factor motifs, and these features transfer across species.
  • As a generator, Evo 2 produces organelle-, prokaryote- and eukaryote-scale sequences that resemble natural genomes on in silico metrics, and inference-time guidance with Enformer and Borzoi enabled experimental control of multi-kilobase chromatin accessibility patterns.

Background

Biological design spans levels from single molecules to whole genomes, and building models that generalize across this range requires training on massively diverse sequence data. Prior work showed that models trained on prokaryotic genomes can learn functional signal for DNA, RNA and proteins, but eukaryotic genomes introduce orders of magnitude more length and regulatory complexity. Evo 2 was developed to address that gap by assembling a curated, non-redundant training corpus that represents bacteria, archaea, eukarya and bacteriophage and by extending sequence modeling paradigms to million-base-pair contexts.

Scaling models to genome-length context requires innovations in data curation, training strategy and model architecture to remain computationally tractable. The Evo 2 team prioritized a generalist representation rather than task-specific fine-tuning during pretraining, adopting a two-phase approach that first learns short-range functional motifs and coding grammar, then learns long-range relationships such as chromatin domain structure. The team also applied safety-driven data exclusions, for example omitting genomes of viruses that infect eukaryotic hosts, and validated that those exclusions reduce model competence in the excluded domains.

Main event

The Evo 2 project released two principal models, a 7 billion and a 40 billion parameter variant, trained on the OpenGenome2 dataset assembled from curated nucleotide sequences totalling over 8.8 trillion bases. Initial pretraining used an 8,192 token window with data weighting that emphasized genic windows to teach functional elements. Midtraining then extended the context window in stages up to one million base pairs to capture interactions across kilobase to megabase distances. This staged schedule follows best practice in long-context language modelling and improved efficiency and final loss.

Architecturally, Evo 2 uses StripedHyena 2, a multi-hybrid convolutional design combining short explicit, medium regularized and long implicit operators with attention, which the authors report as delivering higher throughput and better loss scaling than transformer baselines at large context lengths. The model was validated with synthetic tasks that test long-context recall and with perplexity and needle-in-a-haystack evaluations that demonstrate the model can retrieve a 100 bp signal within a 1 million bp context.

Functionally, Evo 2 likelihoods respond as expected to sequence perturbations: mutations in start and stop codons, non-synonymous substitutions, frameshifts and deletions of tRNA and rRNA produce larger drops in likelihood than synonymous changes or intergenic deletions. Evo 2 learned species-specific stop codon usage patterns and responds to recoding experiments consistent with context-dependent genetic code inference. Zero-shot likelihoods correlate with deep mutational scanning assays for diverse proteins and structured RNAs, and likelihood-based scoring of premature stop insertions predicts prokaryotic gene essentiality.

Analysis and implications

Evo 2 demonstrates that a single DNA-centric foundation model can capture multilayered biological signals spanning coding grammar, structural protein signatures and noncoding regulatory motifs. By training across domains of life and scaling context length, Evo 2 acquires representations that generalize to tasks commonly reserved for specialized models, narrowing the gap between generalist unsupervised models and task-tailored predictors, especially for variant classes that are hard to handle with alignment-based approaches, such as indels.

The model enables both zero-shot scoring and feature extraction for supervised downstream work. For clinical variant interpretation, zero-shot Evo 2 performs strongly on non-SNV categories and competitive on SNVs, and supervised classifiers built on Evo 2 embeddings can achieve very high performance on gene-specific tasks such as BRCA1 saturation mutagenesis. This two-pronged utility makes Evo 2 a practical foundation for groups that lack extensive labeled datasets but want to combine representation power with lightweight supervised tuning.

On the generative side, Evo 2 shows that autoregressive genome-scale generation is possible and that inference-time guidance with independent, sequence-to-function predictors allows explicit control of epigenomic properties. The authors experimentally validated multi-kilobase designs that produce prescribed chromatin accessibility peaks in mouse and human cells, demonstrating that combining a capable generative proposal with black-box scoring can realize functional design objectives at kilobase scales.

However, computational metrics do not guarantee cellular functionality, and the model does not obviate the need for iterative experimental design and validation. The Evo 2 team acknowledges limitations including lower confidence in virus-related domains due to training exclusions, the computational cost of beam search guidance, and the current gap between in silico generation and guaranteed biological viability.

Comparison and data

Model Parameters Training tokens Max context
Evo 2 7B 7 billion 2.4 trillion 1,000,000 bp
Evo 2 40B 40 billion 9.3 trillion 1,000,000 bp

These figures summarize the core compute and dataset scale reported by the authors. The study also compares Evo 2 against specialized and generalist baselines across many benchmarks: deep mutational scanning for proteins and RNAs, ClinVar variant pathogenicity, splice effect repositories, human BRCA1/2 saturation datasets, DART-eval regulatory benchmarks and genome-generation metrics including gene annotation hit rates and protein structure prediction alignment. Across these diverse evaluations Evo 2 is frequently top among unsupervised DNA language models and competitive with supervised methods on several tasks, while supervised models trained on assay-specific data can still lead on highly specialized regulatory prediction tasks.

Reactions and quotes

Independent researchers noted the technical breadth of the work while urging careful interpretation of computational assessments versus experimental function. Peer reviewers and external experts have highlighted the open release of data and code as important for reproducibility, and they emphasized the need for community-led safety evaluation of how such models are used.

Evo 2 represents a significant step toward unified genome-scale modelling, but computational generation remains a hypothesis that requires systematic experimental validation.

Independent genomics expert

Project collaborators emphasized the practical utility of opening models and datasets to the community, while also describing the safety-focused choices made during data assembly. The release includes model parameters, training and inference code, the OpenGenome2 dataset, and visualization and design tools to enable external validation and extension.

We open-sourced model weights, training code and the OpenGenome2 dataset to enable reproducible research and responsible community use.

Arc Institute team

Laboratories that performed experimental validations described empirical success in designing chromatin-accessibility patterns across mouse and human cells, and pointed to remaining engineering work needed to scale functional testing of genome-scale designs. They urged that open resources be coupled with governance and best-practice guidance for biological design work.

Beam search guidance coupled with ensemble predictors produced experimentally validated multi-kilobase accessibility designs, but broader functional testing will require iterative laboratory pipelines.

Experimental validation team

Unconfirmed

  • Whether Evo 2-generated full genomes are replication competent or functional in cells remains unproven beyond selected organelle and chromatin-accessibility experiments.
  • The effectiveness of data exclusion measures against all avenues of misuse, including task-specific retraining, cannot be fully guaranteed and requires ongoing community evaluation.
  • Long-term generalization of SAE-derived features to poorly represented clades or highly degraded ancient DNA requires further benchmarking.

Bottom line

Evo 2 is a major step toward a general genomic foundation model that operates from DNA to organismal scales across all domains of life. Its combination of large-scale, multi-domain training, long-context modelling and interpretability tools lets researchers score variation, extract biologically meaningful features and propose genome-scale sequences.

For applied users, Evo 2 is most useful as a foundation for downstream supervised tuning and for generating proposals that can be filtered and optimized with task-specific predictive models. The model improves unsupervised performance on many variant types, especially non-SNV classes, but experimental validation remains essential before any design is considered functionally reliable or safe.

The open release of code, weights and data invites rapid community adoption and independent evaluation, but also places responsibility on users and institutions to follow biosafety best practices and to prioritize transparent, ethical use.

Sources

Leave a Comment