@pinged
Choosing rewards in these processes is a delicate balance — if you over index on some types of tasks, the model will be unable to solve problems it hasn't already seen. But, the idea of this reward model learning was first explored in superhuman performance with AlphaZero and Diplomacy
There are two main ways to give out rewards:
- Outcome-based: I only give rewards for getting the right answer
- Procedure-based: I give you small rewards for each step of the process; so you can get partial rewards for wrong answers if they have "some" accurate steps
We'll consider an ELI5 example:
Teaching someone to ride a bike 100m, brake and then U-turn and ride back
Outcome: Giving a kid 20 candies for doing the whole task
Procedure: Giving 1 candy for pedaling, 1 candy for breaking, 1 candy for balancing and 1 candy for every 20m biked