Has anyone tried RL to rewrite prompts for reasoning models to further improve outputs? I'm assuming so, it feels pretty obvious, but if not I want to try it. If you know of any existing work here, pls lmk so I don't re-do something people have already done!
By this, I mean: - Take an already-trained, frozen reasoning model (i.e. o4-mini via API) - Add a smaller LLM that takes in a prompt, and rewrites it to improve how the frozen model performs - Update the smaller LLM's weights, keep the larger LLM frozen The hope is that the small LLM would learn to 'steer' the CoT of the frozen larger model better than a human could, increasing performance.
@corbtt reminded me of this work by @brendanh0gan... Brendan, how did it go? Seems pretty similar to what I'm thinking here.
Brendan Hogan
Brendan Hogan3.7. klo 08.26
big models are great agents but often too big, closed, or delicate to fine-tune idea: train a small model to craft context for a frozen big model, score the big model's outputs, use that as reward for the small one grpo for context tuning. more below
17,57K