-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper β’ 2402.04252 β’ Published β’ 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper β’ 2402.03749 β’ Published β’ 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper β’ 2402.04615 β’ Published β’ 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper β’ 2402.05008 β’ Published β’ 23
Collections
Discover the best community collections!
Collections including paper arxiv:2511.20256
-
lightx2v/Hy1.5-Distill-Models
Text-to-Video β’ Updated β’ 795 β’ 28 -
Plan-X: Instruct Video Generation via Semantic Planning
Paper β’ 2511.17986 β’ Published β’ 17 -
In-Video Instructions: Visual Signals as Generative Control
Paper β’ 2511.19401 β’ Published β’ 31 -
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
Paper β’ 2511.20256 β’ Published β’ 27
-
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control
Paper β’ 2412.20800 β’ Published β’ 11 -
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models
Paper β’ 2501.06751 β’ Published β’ 32 -
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
Paper β’ 2501.09732 β’ Published β’ 71 -
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
Paper β’ 2501.09755 β’ Published β’ 35
-
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Paper β’ 2506.22434 β’ Published β’ 10 -
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Paper β’ 2507.13348 β’ Published β’ 77 -
RewardDance: Reward Scaling in Visual Generation
Paper β’ 2509.08826 β’ Published β’ 73 -
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Paper β’ 2510.18876 β’ Published β’ 36
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper β’ 2402.04252 β’ Published β’ 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper β’ 2402.03749 β’ Published β’ 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper β’ 2402.04615 β’ Published β’ 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper β’ 2402.05008 β’ Published β’ 23
-
lightx2v/Hy1.5-Distill-Models
Text-to-Video β’ Updated β’ 795 β’ 28 -
Plan-X: Instruct Video Generation via Semantic Planning
Paper β’ 2511.17986 β’ Published β’ 17 -
In-Video Instructions: Visual Signals as Generative Control
Paper β’ 2511.19401 β’ Published β’ 31 -
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
Paper β’ 2511.20256 β’ Published β’ 27
-
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Paper β’ 2506.22434 β’ Published β’ 10 -
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Paper β’ 2507.13348 β’ Published β’ 77 -
RewardDance: Reward Scaling in Visual Generation
Paper β’ 2509.08826 β’ Published β’ 73 -
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Paper β’ 2510.18876 β’ Published β’ 36
-
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control
Paper β’ 2412.20800 β’ Published β’ 11 -
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models
Paper β’ 2501.06751 β’ Published β’ 32 -
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
Paper β’ 2501.09732 β’ Published β’ 71 -
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
Paper β’ 2501.09755 β’ Published β’ 35