LocalMamba: Visual State Space Model with Windowed Selective Scan
Paper
• 2403.09338
• Published
• 8
GiT: Towards Generalist Vision Transformer through Universal Language
Interface
Paper
• 2403.09394
• Published
• 26
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper
• 2402.19479
• Published
• 35
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper
• 2405.10300
• Published
• 30
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything
Model
Paper
• 2406.20076
• Published
• 10
SVGCraft: Beyond Single Object Text-to-SVG Synthesis with Comprehensive
Canvas Layout
Paper
• 2404.00412
• Published
• 2
LKCell: Efficient Cell Nuclei Instance Segmentation with Large
Convolution Kernels
Paper
• 2407.18054
• Published
• 12
Paper
• 2407.21017
• Published
• 24
SAM 2: Segment Anything in Images and Videos
Paper
• 2408.00714
• Published
• 120
NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices
Paper
• 2408.10161
• Published
• 15
Sapiens: Foundation for Human Vision Models
Paper
• 2408.12569
• Published
• 94
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world
Videos
Paper
• 2409.02095
• Published
• 37
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper
• 2409.01704
• Published
• 83
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary
Detection
Paper
• 2409.08513
• Published
• 14
OmniGen: Unified Image Generation
Paper
• 2409.11340
• Published
• 115
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper
• 2409.11355
• Published
• 30
Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors
Paper
• 2409.17058
• Published
• 13
Self-Supervised Any-Point Tracking by Contrastive Random Walks
Paper
• 2409.16288
• Published
• 6
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense
Prediction
Paper
• 2409.18124
• Published
• 33
MinerU: An Open-Source Solution for Precise Document Content Extraction
Paper
• 2409.18839
• Published
• 40
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Paper
• 2410.02073
• Published
• 42
Towards Natural Image Matting in the Wild via Real-Scenario Prior
Paper
• 2410.06593
• Published
• 4
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a
Training-Free Memory Tree
Paper
• 2410.16268
• Published
• 69
SMITE: Segment Me In TimE
Paper
• 2410.18538
• Published
• 16
GrounDiT: Grounding Diffusion Transformers via Noisy Patch
Transplantation
Paper
• 2410.20474
• Published
• 14
DELTA: Dense Efficient Long-range 3D Tracking for any video
Paper
• 2410.24211
• Published
• 9
Face Anonymization Made Simple
Paper
• 2411.00762
• Published
• 9
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text,
and Architectural Enhancements
Paper
• 2411.12044
• Published
• 14
SEAGULL: No-reference Image Quality Assessment for Regions of Interest
via Vision-Language Instruction Tuning
Paper
• 2411.10161
• Published
• 9
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking
with Motion-Aware Memory
Paper
• 2411.11922
• Published
• 19
DINO-X: A Unified Vision Model for Open-World Object Detection and
Understanding
Paper
• 2411.14347
• Published
• 16
Knowledge Transfer Across Modalities with Natural Language Supervision
Paper
• 2411.15611
• Published
• 16
Edge Weight Prediction For Category-Agnostic Pose Estimation
Paper
• 2411.16665
• Published
• 6
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State
Space Duality
Paper
• 2411.15241
• Published
• 7
Scaling Image Tokenizers with Grouped Spherical Quantization
Paper
• 2412.02632
• Published
• 10
EMOv2: Pushing 5M Vision Model Frontier
Paper
• 2412.06674
• Published
• 13
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
Paper
• 2501.08326
• Published
• 34
iFormer: Integrating ConvNet and Transformer for Mobile Application
Paper
• 2501.15369
• Published
• 13
MatAnyone: Stable Video Matting with Consistent Memory Propagation
Paper
• 2501.14677
• Published
• 34
PixelWorld: Towards Perceiving Everything as Pixels
Paper
• 2501.19339
• Published
• 17
SAeUron: Interpretable Concept Unlearning in Diffusion Models with
Sparse Autoencoders
Paper
• 2501.18052
• Published
• 8
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
Paper
• 2503.10596
• Published
• 18
SmolDocling: An ultra-compact vision-language model for end-to-end
multi-modal document conversion
Paper
• 2503.11576
• Published
• 146
Semantic Library Adaptation: LoRA Retrieval and Fusion for
Open-Vocabulary Semantic Segmentation
Paper
• 2503.21780
• Published
• 9
TAPNext: Tracking Any Point (TAP) as Next Token Prediction
Paper
• 2504.05579
• Published
• 4
DC-SAM: In-Context Segment Anything in Images and Videos via Dual
Consistency
Paper
• 2504.12080
• Published
• 8
Group Downsampling with Equivariant Anti-aliasing
Paper
• 2504.17258
• Published
• 9
Marigold: Affordable Adaptation of Diffusion-Based Image Generators for
Image Analysis
Paper
• 2505.09358
• Published
• 27
PictSure: Pretraining Embeddings Matters for In-Context Learning Image
Classifiers
Paper
• 2506.14842
• Published
• 7
Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with
Weak Supervision
Paper
• 2507.20976
• Published
• 11
IAUNet: Instance-Aware U-Net
Paper
• 2508.01928
• Published
• 9
A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding
Paper
• 2508.01197
• Published
• 5
Paper
• 2508.10104
• Published
• 297
UniPixel: Unified Object Referring and Segmentation for Pixel-Level
Visual Reasoning
Paper
• 2509.18094
• Published
• 4
SAM 3: Segment Anything with Concepts
Paper
• 2511.16719
• Published
• 129