| --- |
| language: |
| - dna |
| tags: |
| - biology |
| - genomics |
| - foundation-model |
| license: apache-2.0 |
| --- |
| |
| # Evo 2 (1B Base) - Hugging Face Transformers Format |
|
|
| This repository contains the **Evo 2 (1B Base)** model, converted to the Hugging Face Transformers format. |
|
|
| **Original Repository:** [arcinstitute/evo2_1b_base](https://huggingface.co/arcinstitute/evo2_1b_base) |
| **Paper:** [Genome modeling and design across all domains of life with Evo 2](https://www.biorxiv.org/content/10.1101/2024.02.27.582234v1) |
| **Authors:** Garyk Brixi, Matthew G. Durrant, Jerome Ku, Michael Poli, et al. |
|
|
| ## Model Description |
|
|
| Evo 2 is a biological foundation model trained on 9.3 trillion DNA base pairs from a curated genomic atlas spanning all domains of life. It uses the StripedHyena architecture to process long sequences (up to 1 million base pairs) at nucleotide-level resolution. This model is designed for tasks such as predicting the functional effects of mutations and generating novel genomic sequences. |
|
|
| This version has been converted to be compatible with the `transformers` library, allowing for easy loading and inference. |
|
|
| ## Usage |
|
|
| You can load and run this model using the `transformers` library as follows: |
|
|
| ```python |
| import torch |
| from transformers import Evo2ForCausalLM, Evo2Tokenizer |
| |
| # Replace with your local path or the Hub repo ID after uploading |
| model_path = "path/to/this/repo" |
| |
| print(f"Loading model from {model_path}...") |
| model = Evo2ForCausalLM.from_pretrained(model_path) |
| tokenizer = Evo2Tokenizer.from_pretrained(model_path) |
| |
| # Move to GPU if available |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| model = model.to(device) |
| |
| # Input sequence (DNA) |
| sequence = "ACGTACGT" |
| print(f"Input: {sequence}") |
| |
| # Tokenize |
| input_ids = tokenizer.encode(sequence, return_tensors="pt").to(device) |
| |
| # Generate |
| print("Generating...") |
| with torch.no_grad(): |
| output = model.generate(input_ids, max_new_tokens=20) |
| |
| # Decode |
| generated_sequence = tokenizer.decode(output[0]) |
| print(f"Output: {generated_sequence}") |
| ``` |
|
|
| ## Citation |
|
|
| If you use this model, please cite the original paper: |
|
|
| ```bibtex |
| @article{brixi2024genome, |
| title={Genome modeling and design across all domains of life with Evo 2}, |
| author={Brixi, Garyk and Durrant, Matthew G and Ku, Jerome and Poli, Michael and others}, |
| journal={bioRxiv}, |
| year={2024}, |
| publisher={Cold Spring Harbor Laboratory} |
| } |
| ``` |
|
|