YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

PMC2 Medical Image Captioning Model - M2 Deployment

Model Information

  • Base Model: BLIP-2 OPT 2.7B
  • Training Device: NVIDIA RTX 4050
  • Inference Device: Apple M2 Air (MPS)
  • Dataset: PMC2 (PubMed Central Medical Images)
  • Training: 3 epochs on 3,355 medical images
  • Fine-tuning: LoRA (Low-Rank Adaptation)

Setup on M2 Air

Requirements

pip install torch torchvision transformers pillow

Quick Start - Single Image

import torch
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image

# Load model and processor
print("Loading model...")
processor = Blip2Processor.from_pretrained("./")
model = Blip2ForConditionalGeneration.from_pretrained(
    "./",
    torch_dtype=torch.float16
).to("mps")  # Use Metal Performance Shaders on M2
print("Model loaded!")

# Generate caption for a medical image
image = Image.open("medical_image.jpg")
inputs = processor(images=image, return_tensors="pt").to("mps", torch.float16)

# Generate caption
print("Generating caption...")
with torch.no_grad():
    generated_ids = model.generate(**inputs, max_length=50)
    
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Caption: {caption}")

Batch Processing - Multiple Images

from pathlib import Path

# Process all images in a folder
image_folder = Path("./medical_images")
results = []

for img_path in image_folder.glob("*.jpg"):
    print(f"Processing: {img_path.name}")
    image = Image.open(img_path)
    inputs = processor(images=image, return_tensors="pt").to("mps", torch.float16)
    
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_length=50)
    
    caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    results.append({"image": img_path.name, "caption": caption})
    print(f"  โ†’ {caption}")

# Save results
import json
with open("captions.json", "w") as f:
    json.dump(results, f, indent=2)

Performance Tips

Memory Optimization

  • Use FP16 (torch.float16) for inference
  • Process images one at a time on M2 Air
  • Clear cache between batches: torch.mps.empty_cache()

Generation Parameters

# Adjust these for different caption styles:
generated_ids = model.generate(
    **inputs,
    max_length=50,          # Maximum caption length
    num_beams=5,            # Beam search (higher = better quality, slower)
    temperature=1.0,        # Randomness (lower = more deterministic)
    top_p=0.9,              # Nucleus sampling
    repetition_penalty=1.2  # Avoid repetition
)

Technical Details

Model Architecture

  • Vision Encoder: ViT (Vision Transformer)
  • Language Model: OPT-2.7B
  • Training: LoRA fine-tuning on medical images
  • Parameters: ~2.7B (base) + 7.8M (LoRA)

Training Configuration

  • Dataset: 3,355 PubMed Central medical images
  • Epochs: 3
  • Batch Size: 4 (effective 16 with gradient accumulation)
  • Optimizer: AdamW
  • Learning Rate: 1e-4
  • LoRA: r=16, alpha=32

Troubleshooting

"MPS backend not available"

  • Ensure you're on macOS 12.3+
  • Check: torch.backends.mps.is_available()
  • Fallback: Use CPU with .to("cpu")

Out of memory

  • Reduce image size before processing
  • Use torch.float32 โ†’ torch.float16
  • Clear cache: torch.mps.empty_cache()

Slow inference

  • First inference is always slow (model initialization)
  • Subsequent inferences are much faster
  • Use batch processing for multiple images

Notes

  • Model optimized for medical image captioning
  • Trained on PubMed Central dataset
  • LoRA weights merged for easy deployment
  • No CUDA dependencies required
  • Compatible with MPS (Metal) on M2
Downloads last month
-
Safetensors
Model size
4B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support