McClain
/

Evo2_1b_base

Upper Grand Valley Dani

foundation-model

Model card Files Files and versions

Evo2_1b_base / README.md

McClain's picture

Upload 4 files

04646d8 verified 5 months ago

|

history blame contribute delete

2.38 kB

	---
	language:
	- dna
	tags:
	- biology
	- genomics
	- foundation-model
	license: apache-2.0
	---

	# Evo 2 (1B Base) - Hugging Face Transformers Format

	This repository contains the Evo 2 (1B Base) model, converted to the Hugging Face Transformers format.

	Original Repository: [arcinstitute/evo2_1b_base](https://huggingface.co/arcinstitute/evo2_1b_base)
	Paper: [Genome modeling and design across all domains of life with Evo 2](https://www.biorxiv.org/content/10.1101/2024.02.27.582234v1)
	Authors: Garyk Brixi, Matthew G. Durrant, Jerome Ku, Michael Poli, et al.

	## Model Description

	Evo 2 is a biological foundation model trained on 9.3 trillion DNA base pairs from a curated genomic atlas spanning all domains of life. It uses the StripedHyena architecture to process long sequences (up to 1 million base pairs) at nucleotide-level resolution. This model is designed for tasks such as predicting the functional effects of mutations and generating novel genomic sequences.

	This version has been converted to be compatible with the `transformers` library, allowing for easy loading and inference.

	## Usage

	You can load and run this model using the `transformers` library as follows:

	```python
	import torch
	from transformers import Evo2ForCausalLM, Evo2Tokenizer

	# Replace with your local path or the Hub repo ID after uploading
	model_path = "path/to/this/repo"

	print(f"Loading model from {model_path}...")
	model = Evo2ForCausalLM.from_pretrained(model_path)
	tokenizer = Evo2Tokenizer.from_pretrained(model_path)

	# Move to GPU if available
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = model.to(device)

	# Input sequence (DNA)
	sequence = "ACGTACGT"
	print(f"Input: {sequence}")

	# Tokenize
	input_ids = tokenizer.encode(sequence, return_tensors="pt").to(device)

	# Generate
	print("Generating...")
	with torch.no_grad():
	output = model.generate(input_ids, max_new_tokens=20)

	# Decode
	generated_sequence = tokenizer.decode(output[0])
	print(f"Output: {generated_sequence}")
	```

	## Citation

	If you use this model, please cite the original paper:

	```bibtex
	@article{brixi2024genome,
	title={Genome modeling and design across all domains of life with Evo 2},
	author={Brixi, Garyk and Durrant, Matthew G and Ku, Jerome and Poli, Michael and others},
	journal={bioRxiv},
	year={2024},
	publisher={Cold Spring Harbor Laboratory}
	}
	```