PMC2 Medical Image Captioning Model - M2 Deployment

Model Information

Base Model: BLIP-2 OPT 2.7B
Training Device: NVIDIA RTX 4050
Inference Device: Apple M2 Air (MPS)
Dataset: PMC2 (PubMed Central Medical Images)
Training: 3 epochs on 3,355 medical images
Fine-tuning: LoRA (Low-Rank Adaptation)

Setup on M2 Air

Requirements

pip install torch torchvision transformers pillow

Quick Start - Single Image

import torch
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image

# Load model and processor
print("Loading model...")
processor = Blip2Processor.from_pretrained("./")
model = Blip2ForConditionalGeneration.from_pretrained(
    "./",
    torch_dtype=torch.float16
).to("mps")  # Use Metal Performance Shaders on M2
print("Model loaded!")

# Generate caption for a medical image
image = Image.open("medical_image.jpg")
inputs = processor(images=image, return_tensors="pt").to("mps", torch.float16)

# Generate caption
print("Generating caption...")
with torch.no_grad():
    generated_ids = model.generate(**inputs, max_length=50)
    
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Caption: {caption}")

Batch Processing - Multiple Images

from pathlib import Path

# Process all images in a folder
image_folder = Path("./medical_images")
results = []

for img_path in image_folder.glob("*.jpg"):
    print(f"Processing: {img_path.name}")
    image = Image.open(img_path)
    inputs = processor(images=image, return_tensors="pt").to("mps", torch.float16)
    
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_length=50)
    
    caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    results.append({"image": img_path.name, "caption": caption})
    print(f"  → {caption}")

# Save results
import json
with open("captions.json", "w") as f:
    json.dump(results, f, indent=2)

Performance Tips

Memory Optimization

Use FP16 (torch.float16) for inference
Process images one at a time on M2 Air
Clear cache between batches: torch.mps.empty_cache()

Generation Parameters

# Adjust these for different caption styles:
generated_ids = model.generate(
    **inputs,
    max_length=50,          # Maximum caption length
    num_beams=5,            # Beam search (higher = better quality, slower)
    temperature=1.0,        # Randomness (lower = more deterministic)
    top_p=0.9,              # Nucleus sampling
    repetition_penalty=1.2  # Avoid repetition
)

Technical Details

Model Architecture

Vision Encoder: ViT (Vision Transformer)
Language Model: OPT-2.7B
Training: LoRA fine-tuning on medical images
Parameters: ~2.7B (base) + 7.8M (LoRA)

Training Configuration

Dataset: 3,355 PubMed Central medical images
Epochs: 3
Batch Size: 4 (effective 16 with gradient accumulation)
Optimizer: AdamW
Learning Rate: 1e-4
LoRA: r=16, alpha=32

Troubleshooting

"MPS backend not available"

Ensure you're on macOS 12.3+
Check: torch.backends.mps.is_available()
Fallback: Use CPU with .to("cpu")

Out of memory

Reduce image size before processing
Use torch.float32 → torch.float16
Clear cache: torch.mps.empty_cache()

Slow inference

First inference is always slow (model initialization)
Subsequent inferences are much faster
Use batch processing for multiple images

Notes

Model optimized for medical image captioning
Trained on PubMed Central dataset
LoRA weights merged for easy deployment
No CUDA dependencies required
Compatible with MPS (Metal) on M2

Downloads last month: 2

Safetensors

Model size

4B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support