Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2604.06132

eval-papers-collection

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Paper • 2604.08523 • Published 14 days ago • 259
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Paper • 2604.06132 • Published 16 days ago • 117
FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

Paper • 2604.07413 • Published 15 days ago • 94
GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Paper • 2604.02648 • Published 20 days ago • 46

Agent Learning via Early Experience

Paper • 2510.08558 • Published Oct 9, 2025 • 277
Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks

Paper • 2510.08002 • Published Oct 9, 2025 • 24
Self-Improving LLM Agents at Test-Time

Paper • 2510.07841 • Published Oct 9, 2025 • 10
The Denario project: Deep knowledge AI agents for scientific discovery

Paper • 2510.26887 • Published Oct 30, 2025 • 8

AI Paper of the Day

A collection of papers that I think are interesting, one added each day

Can Large Language Models Understand Context?

Paper • 2402.00858 • Published Feb 1, 2024 • 24
OLMo: Accelerating the Science of Language Models

Paper • 2402.00838 • Published Feb 1, 2024 • 85
Self-Rewarding Language Models

Paper • 2401.10020 • Published Jan 18, 2024 • 153
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Paper • 2401.17072 • Published Jan 30, 2024 • 25

Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

Paper • 2510.03222 • Published Oct 3, 2025 • 76
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Paper • 2510.05592 • Published Oct 7, 2025 • 110
Less is More: Recursive Reasoning with Tiny Networks

Paper • 2510.04871 • Published Oct 6, 2025 • 513
Multi-Agent Tool-Integrated Policy Optimization

Paper • 2510.04678 • Published Oct 6, 2025 • 31

about 14 hours ago

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Paper • 2411.03562 • Published Nov 5, 2024 • 69
Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning

Paper • 2502.06060 • Published Feb 9, 2025 • 38
MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Paper • 2502.14499 • Published Feb 20, 2025 • 195
SurveyX: Academic Survey Automation via Large Language Models

Paper • 2502.14776 • Published Feb 20, 2025 • 100

BitNet: Scaling 1-bit Transformers for Large Language Models

Paper • 2310.11453 • Published Oct 17, 2023 • 107
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Paper • 2310.11511 • Published Oct 17, 2023 • 80
In-Context Learning Creates Task Vectors

Paper • 2310.15916 • Published Oct 24, 2023 • 43
Matryoshka Diffusion Models

Paper • 2310.15111 • Published Oct 23, 2023 • 46

eval-papers-collection

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Paper • 2604.08523 • Published 14 days ago • 259
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Paper • 2604.06132 • Published 16 days ago • 117
FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

Paper • 2604.07413 • Published 15 days ago • 94
GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Paper • 2604.02648 • Published 20 days ago • 46

Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

Paper • 2510.03222 • Published Oct 3, 2025 • 76
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Paper • 2510.05592 • Published Oct 7, 2025 • 110
Less is More: Recursive Reasoning with Tiny Networks

Paper • 2510.04871 • Published Oct 6, 2025 • 513
Multi-Agent Tool-Integrated Policy Optimization

Paper • 2510.04678 • Published Oct 6, 2025 • 31

Agent Learning via Early Experience

Paper • 2510.08558 • Published Oct 9, 2025 • 277
Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks

Paper • 2510.08002 • Published Oct 9, 2025 • 24
Self-Improving LLM Agents at Test-Time

Paper • 2510.07841 • Published Oct 9, 2025 • 10
The Denario project: Deep knowledge AI agents for scientific discovery

Paper • 2510.26887 • Published Oct 30, 2025 • 8

about 14 hours ago

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Paper • 2411.03562 • Published Nov 5, 2024 • 69
Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning

Paper • 2502.06060 • Published Feb 9, 2025 • 38
MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Paper • 2502.14499 • Published Feb 20, 2025 • 195
SurveyX: Academic Survey Automation via Large Language Models

Paper • 2502.14776 • Published Feb 20, 2025 • 100

AI Paper of the Day

A collection of papers that I think are interesting, one added each day

Can Large Language Models Understand Context?

Paper • 2402.00858 • Published Feb 1, 2024 • 24
OLMo: Accelerating the Science of Language Models

Paper • 2402.00838 • Published Feb 1, 2024 • 85
Self-Rewarding Language Models

Paper • 2401.10020 • Published Jan 18, 2024 • 153
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Paper • 2401.17072 • Published Jan 30, 2024 • 25

BitNet: Scaling 1-bit Transformers for Large Language Models

Paper • 2310.11453 • Published Oct 17, 2023 • 107
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Paper • 2310.11511 • Published Oct 17, 2023 • 80
In-Context Learning Creates Task Vectors

Paper • 2310.15916 • Published Oct 24, 2023 • 43
Matryoshka Diffusion Models

Paper • 2310.15111 • Published Oct 23, 2023 • 46

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs