| --- |
| license: creativeml-openrail-m |
| language: |
| - en |
| metrics: |
| - bleu |
| tags: |
| - endpoints |
| - text-generation-inference |
| inference: true |
| --- |
| |
| <h3 align='center' style='font-size: 24px;'>Blazzing Fast Tiny Vision Language Model</h3> |
|
|
|
|
| <p align='center', style='font-size: 16px;' >A Custom 3B parameter Model. Built by <a href="https://www.linkedin.com/in/manishkumarthota/">@Manish</a> The model is released for research purposes only, commercial use is not allowed. </p> |
|
|
| ## How to use |
|
|
|
|
| **Install dependencies** |
| ```bash |
| pip install transformers # latest version is ok, but we recommend v4.31.0 |
| pip install -q pillow accelerate einops |
| ``` |
|
|
| You can use the following code for model inference. The format of text instruction is similar to [LLaVA](https://github.com/haotian-liu/LLaVA). |
|
|
| ```Python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from PIL import Image |
| |
| torch.set_default_device("cuda") |
| |
| #Create model |
| model = AutoModelForCausalLM.from_pretrained( |
| "ManishThota/CustomModel", |
| torch_dtype=torch.float16, |
| device_map="auto", |
| trust_remote_code=True) |
| tokenizer = AutoTokenizer.from_pretrained("ManishThota/CustomModel", trust_remote_code=True) |
| |
| #function to generate the answer |
| def predict(question, image_path): |
| #Set inputs |
| text = f"USER: <image>\n{question}? ASSISTANT:" |
| image = Image.open(image_path) |
| |
| input_ids = tokenizer(text, return_tensors='pt').input_ids.to('cuda') |
| image_tensor = model.image_preprocess(image) |
| |
| #Generate the answer |
| output_ids = model.generate( |
| input_ids, |
| max_new_tokens=25, |
| images=image_tensor, |
| use_cache=True)[0] |
| |
| return tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip() |
| |
| ``` |