@galacticgardener
It's so fun to see RL finally work on complex real-world tasks with LLM policies, but it's increasingly clear that we lack an understanding of how RL fine-tuning leads to generalization.
In the same week, we got two (awesome) papers:
Absolute Zero Reasoner: Improvements on code