YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
PMC2 Medical Image Captioning Model - M2 Deployment
Model Information
- Base Model: BLIP-2 OPT 2.7B
- Training Device: NVIDIA RTX 4050
- Inference Device: Apple M2 Air (MPS)
- Dataset: PMC2 (PubMed Central Medical Images)
- Training: 3 epochs on 3,355 medical images
- Fine-tuning: LoRA (Low-Rank Adaptation)
Setup on M2 Air
Requirements
pip install torch torchvision transformers pillow
Quick Start - Single Image
import torch
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
# Load model and processor
print("Loading model...")
processor = Blip2Processor.from_pretrained("./")
model = Blip2ForConditionalGeneration.from_pretrained(
"./",
torch_dtype=torch.float16
).to("mps") # Use Metal Performance Shaders on M2
print("Model loaded!")
# Generate caption for a medical image
image = Image.open("medical_image.jpg")
inputs = processor(images=image, return_tensors="pt").to("mps", torch.float16)
# Generate caption
print("Generating caption...")
with torch.no_grad():
generated_ids = model.generate(**inputs, max_length=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Caption: {caption}")
Batch Processing - Multiple Images
from pathlib import Path
# Process all images in a folder
image_folder = Path("./medical_images")
results = []
for img_path in image_folder.glob("*.jpg"):
print(f"Processing: {img_path.name}")
image = Image.open(img_path)
inputs = processor(images=image, return_tensors="pt").to("mps", torch.float16)
with torch.no_grad():
generated_ids = model.generate(**inputs, max_length=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
results.append({"image": img_path.name, "caption": caption})
print(f" โ {caption}")
# Save results
import json
with open("captions.json", "w") as f:
json.dump(results, f, indent=2)
Performance Tips
Memory Optimization
- Use FP16 (torch.float16) for inference
- Process images one at a time on M2 Air
- Clear cache between batches:
torch.mps.empty_cache()
Generation Parameters
# Adjust these for different caption styles:
generated_ids = model.generate(
**inputs,
max_length=50, # Maximum caption length
num_beams=5, # Beam search (higher = better quality, slower)
temperature=1.0, # Randomness (lower = more deterministic)
top_p=0.9, # Nucleus sampling
repetition_penalty=1.2 # Avoid repetition
)
Technical Details
Model Architecture
- Vision Encoder: ViT (Vision Transformer)
- Language Model: OPT-2.7B
- Training: LoRA fine-tuning on medical images
- Parameters: ~2.7B (base) + 7.8M (LoRA)
Training Configuration
- Dataset: 3,355 PubMed Central medical images
- Epochs: 3
- Batch Size: 4 (effective 16 with gradient accumulation)
- Optimizer: AdamW
- Learning Rate: 1e-4
- LoRA: r=16, alpha=32
Troubleshooting
"MPS backend not available"
- Ensure you're on macOS 12.3+
- Check:
torch.backends.mps.is_available() - Fallback: Use CPU with
.to("cpu")
Out of memory
- Reduce image size before processing
- Use
torch.float32โtorch.float16 - Clear cache:
torch.mps.empty_cache()
Slow inference
- First inference is always slow (model initialization)
- Subsequent inferences are much faster
- Use batch processing for multiple images
Notes
- Model optimized for medical image captioning
- Trained on PubMed Central dataset
- LoRA weights merged for easy deployment
- No CUDA dependencies required
- Compatible with MPS (Metal) on M2
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support