Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression
Paper • 2507.14997 • Published
Fine-tuned Qwen2-VL-2B-Instruct model for image aesthetic assessment using the RvTC (Regression via Transformer-Based Classification) framework. This checkpoint uses image-only training without textual context.
Evaluated on AVA test set (19,930 samples):
@inproceedings{jennings2025language,
title={Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression},
author={Roy H. Jennings, Genady Paikin, Roy Shaul, and Evgeny Soloveichik},
booktitle={2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2026},
organization={IEEE}
}