If you are reading the RLHF Book from @natolambert, Chapter 11 (on Policy Gradient Algorithms) just got a bit longer: I contributed the full derivation of the Policy Gradient objective!
11,18K