RefAlign: RL with Similarity-based Rewards
Collection
Datasets and models in: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data. • 19 items • Updated
• 1
GitHub repository: https://github.com/mzhaoshuai/RefAlign
This is the model aligned with RefAlign, a versatile REINFORCE-style alignment algorithm that utilizes language generation evaluation metrics (such as BERTScore) between sampled generations and reference answers as surrogate rewards.
It is primarily aligned for safety.
The training data is https://huggingface.co/datasets/mzhaoshuai/Llama-3.3-70B-Inst-awq_SafeRLHF.
When conducting Reinforcement Learning with Similarity-based Rewards, the reward function is BERTScore.
| Hyper-Parameters | Value |
|---|---|
| LR | 3e-6 |
| Batch Size | 512 |
| Epoch | 2 |
| Prompt Length | 192 |
| Generation Length | 384 |
| Sampled Generations (K) | 2 |
| BertScore Model | bart-large-mnli |
| harmless advantage weight | 4.0 |
Base model
PKU-Alignment/alpaca-7b-reproduced