Full Paper List
updated
Ultra-Sparse Memory Network
Paper
• 2411.12364
• Published
• 23
Paper
• 2409.19606
• Published
• 26
Polynomial Composition Activations: Unleashing the Dynamics of Large
Language Models
Paper
• 2411.03884
• Published
• 28
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Paper
• 2501.16975
• Published
• 32
Scale-Distribution Decoupling: Enabling Stable and Effective Training of
Large Language Models
Paper
• 2502.15499
• Published
• 15
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid
Normalization
Paper
• 2503.04598
• Published
• 21
Frac-Connections: Fractional Extension of Hyper-Connections
Paper
• 2503.14125
• Published
• 22
Efficient Pretraining Length Scaling
Paper
• 2504.14992
• Published
• 20
Scaling Law for Quantization-Aware Training
Paper
• 2505.14302
• Published
• 76
Stepsize anything: A unified learning rate schedule for
budgeted-iteration training
Paper
• 2505.24452
• Published
• 5
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior
Long-Context Learning
Paper
• 2508.18756
• Published
• 36