Softmax Linear Attention: Reclaiming Global Competition Paper • 2602.01744 • Published about 1 month ago • 1
Test-Time Training with KV Binding Is Secretly Linear Attention Paper • 2602.21204 • Published 8 days ago • 29
Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking Paper • 2602.21196 • Published 8 days ago • 3
2Mamba2Furious: Linear in Complexity, Competitive in Accuracy Paper • 2602.17363 • Published 13 days ago • 7
Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts Paper • 2602.13367 • Published 19 days ago • 31
On Surprising Effectiveness of Masking Updates in Adaptive Optimizers Paper • 2602.15322 • Published 16 days ago • 9
SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning Paper • 2602.13515 • Published 19 days ago • 43
Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum Paper • 2510.00526 • Published Oct 1, 2025 • 10
Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning Paper • 2602.08382 • Published 23 days ago • 10
When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning Paper • 2602.10560 • Published 21 days ago • 29
Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models Paper • 2602.12036 • Published 20 days ago • 93
Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning Paper • 2602.01058 • Published Feb 1 • 41
Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts Paper • 2601.22156 • Published Jan 29 • 13