VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors Paper • 2604.02486 • Published 16 days ago • 10
Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots Paper • 2504.03735 • Published Apr 1, 2025 • 1
Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness Paper • 2510.01670 • Published Oct 2, 2025 • 8
view article Article ScreenSuite - The most comprehensive evaluation suite for GUI Agents! +1 Jun 6, 2025 • 56
Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis Paper • 2502.20383 • Published Feb 27, 2025 • 3
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model Paper • 2503.05132 • Published Mar 7, 2025 • 57
LLaVA-o1: Let Vision Language Models Reason Step-by-Step Paper • 2411.10440 • Published Nov 15, 2024 • 129
Phi-3 Collection Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths. • 25 items • Updated Mar 2 • 580
view article Article Llama 3.1 - 405B, 70B & 8B with multilinguality and long context +6 Jul 23, 2024 • 241