Tarun Chitra pfp
Tarun Chitra

@pinged

The claim that one can learn/design procedural rewards from small numbers of outcome rewards is *very* new — the first evidence in the literature that I know of is a Google paper from Nov. 2024 But there is something weird about this — is it really possible to learn optimal procedural rewards (which can have many steps!) from a *single* outcome reward? It seems almost too good to be true and mathematical perspective, especially in high dimensions It is as if I know a boundary condition (outcome rewards) and can "learn" a differential equation (dynamics) from only the boundary condition... Can that actually happen? Answer: Yes — happens with Inverse problem in PDEs! These are used to find oil by sending sonar waves [boundary conditions] and measuring if the dynamics has changed by the presence of oil [adjusts PDE]
1 reply
0 recast
2 reactions