It's so fun to see RL finally work on complex real-world tasks with LLM policies, but it's increasingly clear that we lack an understanding of how RL fine-tuning leads to generalization. In the same week, we got two (awesome) papers: Absolute Zero Reasoner: Improvements on code
- 0 replies
- 0 recasts
- 0 reactions
Shooting Star
- 0 replies
- 0 recasts
- 0 reactions
Self Portrait, 1930
- 0 replies
- 0 recasts
- 0 reactions