-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper β’ 2402.04252 β’ Published β’ 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper β’ 2402.03749 β’ Published β’ 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper β’ 2402.04615 β’ Published β’ 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper β’ 2402.05008 β’ Published β’ 23
Collections
Discover the best community collections!
Collections including paper arxiv:2506.08010
-
Boosting Generative Image Modeling via Joint Image-Feature Synthesis
Paper β’ 2504.16064 β’ Published β’ 14 -
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Paper β’ 2504.14032 β’ Published β’ 7 -
Towards Understanding Camera Motions in Any Video
Paper β’ 2504.15376 β’ Published β’ 155 -
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Paper β’ 2504.17192 β’ Published β’ 120
-
The GAN is dead; long live the GAN! A Modern GAN Baseline
Paper β’ 2501.05441 β’ Published β’ 95 -
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity
Paper β’ 2503.07677 β’ Published β’ 86 -
OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting
Paper β’ 2503.08677 β’ Published β’ 29 -
Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space
Paper β’ 2503.09419 β’ Published β’ 6
-
Parallel Scaling Law for Language Models
Paper β’ 2505.10475 β’ Published β’ 83 -
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
Paper β’ 2505.15045 β’ Published β’ 54 -
Scaling Diffusion Transformers Efficiently via ΞΌP
Paper β’ 2505.15270 β’ Published β’ 35 -
Vision Transformers Don't Need Trained Registers
Paper β’ 2506.08010 β’ Published β’ 22
-
CoRAG: Collaborative Retrieval-Augmented Generation
Paper β’ 2504.01883 β’ Published β’ 9 -
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Paper β’ 2504.08837 β’ Published β’ 43 -
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Paper β’ 2504.10068 β’ Published β’ 30 -
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
Paper β’ 2504.10481 β’ Published β’ 85
-
MaskBit: Embedding-free Image Generation via Bit Tokens
Paper β’ 2409.16211 β’ Published β’ 17 -
Goku: Flow Based Video Generative Foundation Models
Paper β’ 2502.04896 β’ Published β’ 106 -
Discrete Audio Tokens: More Than a Survey!
Paper β’ 2506.10274 β’ Published β’ 32 -
HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling
Paper β’ 2506.20452 β’ Published β’ 19
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper β’ 2402.04252 β’ Published β’ 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper β’ 2402.03749 β’ Published β’ 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper β’ 2402.04615 β’ Published β’ 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper β’ 2402.05008 β’ Published β’ 23
-
Parallel Scaling Law for Language Models
Paper β’ 2505.10475 β’ Published β’ 83 -
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
Paper β’ 2505.15045 β’ Published β’ 54 -
Scaling Diffusion Transformers Efficiently via ΞΌP
Paper β’ 2505.15270 β’ Published β’ 35 -
Vision Transformers Don't Need Trained Registers
Paper β’ 2506.08010 β’ Published β’ 22
-
Boosting Generative Image Modeling via Joint Image-Feature Synthesis
Paper β’ 2504.16064 β’ Published β’ 14 -
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Paper β’ 2504.14032 β’ Published β’ 7 -
Towards Understanding Camera Motions in Any Video
Paper β’ 2504.15376 β’ Published β’ 155 -
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Paper β’ 2504.17192 β’ Published β’ 120
-
CoRAG: Collaborative Retrieval-Augmented Generation
Paper β’ 2504.01883 β’ Published β’ 9 -
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Paper β’ 2504.08837 β’ Published β’ 43 -
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Paper β’ 2504.10068 β’ Published β’ 30 -
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
Paper β’ 2504.10481 β’ Published β’ 85
-
The GAN is dead; long live the GAN! A Modern GAN Baseline
Paper β’ 2501.05441 β’ Published β’ 95 -
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity
Paper β’ 2503.07677 β’ Published β’ 86 -
OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting
Paper β’ 2503.08677 β’ Published β’ 29 -
Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space
Paper β’ 2503.09419 β’ Published β’ 6
-
MaskBit: Embedding-free Image Generation via Bit Tokens
Paper β’ 2409.16211 β’ Published β’ 17 -
Goku: Flow Based Video Generative Foundation Models
Paper β’ 2502.04896 β’ Published β’ 106 -
Discrete Audio Tokens: More Than a Survey!
Paper β’ 2506.10274 β’ Published β’ 32 -
HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling
Paper β’ 2506.20452 β’ Published β’ 19