Tarun Chitra pfp
Tarun Chitra

@pinged

Choosing rewards in these processes is a delicate balance — if you over index on some types of tasks, the model will be unable to solve problems it hasn't already seen. But, the idea of this reward model learning was first explored in superhuman performance with AlphaZero and Diplomacy There are two main ways to give out rewards: - Outcome-based: I only give rewards for getting the right answer - Procedure-based: I give you small rewards for each step of the process; so you can get partial rewards for wrong answers if they have "some" accurate steps We'll consider an ELI5 example: Teaching someone to ride a bike 100m, brake and then U-turn and ride back Outcome: Giving a kid 20 candies for doing the whole task Procedure: Giving 1 candy for pedaling, 1 candy for breaking, 1 candy for balancing and 1 candy for every 20m biked
1 reply
0 recast
1 reaction