Qwiglip VLM (Qwen2 + SigLIP)

Custom Vision-Language Model built from scratch. Inspired by LLaVA VLM architecture, but with a custom MLP projector and LoRA fine-tuning for efficient training. Training data from https://huggingface.co/datasets/phiyodr/coco2017 Full repository at https://github.com/teohyc/qwiglip_vlm

Components

  • Base LLM: Qwen/Qwen2-0.5B-Instruct
  • Vision Encoder: SigLIP
  • LoRA fine-tuning
  • Custom MLP projector

Usage

***** CHECK OUT inference.py FOR DETAILED INFERENCE EXAMPLE *****

import torch
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModel, Qwen2ForCausalLM
from peft import PeftModel

from vlm_model import MLPProjector, SiglipQwenVLM

#configurations
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

LLM_NAME = "Qwen/Qwen2-0.5B-Instruct"
VISION_NAME = "google/siglip-base-patch16-224"

LORA_PATH = "lora_adapter"
PROJECTOR_PATH = "projector.pt"

NUM_IMAGE_TOKENS = 196

#refer to inference.py for full code
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for teohyc/QwigLip-VLM

Base model

Qwen/Qwen2-0.5B
Finetuned
(561)
this model

Dataset used to train teohyc/QwigLip-VLM

Collection including teohyc/QwigLip-VLM