Giuliano Giacaglia π²
@giu
We're doing reinforcement learning from human feedback, but that's a super weak form of reinforcement learning. What is the equivalent reward model in AlphaGo for RLHF? It's what I call a vibe check Imagine if you wanted to train an AlphaGo RLHF, you would be giving 2 people 2 boards and said: which one do you prefer?
4 replies
1 recast
21 reactions
T.Berryππ©
@ttberry
Wishing to receive a gift from you once in my life π
0 reply
0 recast
0 reaction