RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Source: Google Reinforcement learning from human feedback (RLHF) is effective at aligning large language models (LLMs) to human preferences, but gathering high quality human preference labels is a key bottleneck. We conduct a head-to-head comparison of RLHF vs. RL from AI Feedback (RLAIF) – a technique where preferences are labeled by an off-the-shelf LLM in […]