Contextualized Visual Personalization in Vision-Language Models
Abstract
CoViP addresses contextualized visual personalization by treating personalized image captioning as a core task and improving capabilities through reinforcement-learning-based post-training and caption-augmented generation.
Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.
Community
We introduce CoViP, a unified framework for contextualized visual personalization in VLMs, featuring a novel personalized image captioning benchmark, an RL-based post-training scheme, and diagnostic downstream personalization tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- X-Aligner: Composed Visual Retrieval without the Bells and Whistles (2026)
- Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieva (2025)
- Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models (2026)
- Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization (2025)
- Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning (2026)
- Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting (2025)
- Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper