does anyone have a link to that study where they fine-tuned an LLM on wrong answers to math problems and, as a side-effect, it started giving out explicitly malicious and dangerous advice? (iirc something like dangerous combos of cleaning chemicals)

or did i imagine that

insane implications for misalignment

prev: /nook, protocol @ opensea, co-author seaport & stuff.
professional @slokh and @0xalexander fanboy

Hacker founder using ai to coordinate onchain economies /tenfold, nounish channel client /comint

More: rubinovitz.com

“emergent misalignment” 

wrongness = evil?

planting causal loops; harvesting reflexivity

this is fascinating 

it’s almost like a flipped version of the categorical imperative

have you seen other papers with similar results?