emo.eth pfp
emo.eth
@emo.eth
does anyone have a link to that study where they fine-tuned an LLM on wrong answers to math problems and, as a side-effect, it started giving out explicitly malicious and dangerous advice? (iirc something like dangerous combos of cleaning chemicals) or did i imagine that insane implications for misalignment
1 reply
0 recast
9 reactions

rubinovitz pfp
rubinovitz
@rubinovitz
https://arxiv.org/html/2502.17424v1
3 replies
1 recast
5 reactions

prompt pfp
prompt
@promptrotator.eth
“emergent misalignment” wrongness = evil?
2 replies
0 recast
1 reaction

agusti pfp
agusti
@bleu.eth
https://t.co/UhsAxyM9a9
2 replies
0 recast
2 reactions

rubinovitz pfp
rubinovitz
@rubinovitz
Read paper
1 reply
0 recast
2 reactions