(Some) Emergent Misalignment from Reward Hacking in RL Collection Model checkpoints from the project "(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL" • 228 items • Updated 12 days ago • 5
view article Article Illustrating Reinforcement Learning from Human Feedback (RLHF) +2 natolambert, LouisCastricato, lvwerra, Dahoas • Dec 9, 2022 • 414
Ouro Collection a family of pre-trained Looped Language Models. • 4 items • Updated Oct 29, 2025 • 32
Open Character Training Collection https://arxiv.org/abs/2511.01689 • 8 items • Updated Nov 4, 2025 • 7
Alignment Pretraining (Geodesic, 2025): Data & Models Collection https://alignmentpretraining.ai — Read our paper for additional details about our data and models • 5 items • Updated Jan 16 • 7