Adding a nice way to visualize the PPO objective to the rlhf book. Core for policy-gradient is L~ R*A (R=policy ratio, A = advantage). Make good actions more likely up to a point. Make bad actions less likely up to a point. The min(...), & sign of adv determine which line.
9,37K