@grin
a) RLHF is limited by LLMs gaming the reward model neural nets and producing nonsensical results
b) neural nets cannot themselves tell if they are being gamed. only an outsider can tell
c) humans brains are (roughly) big neural nets
if those are true, then given enough scale/training, LLMs will become increasingly good at gaming our own psychology
change my mind