@logonaut.eth
AI alignment researchers at Anthropic — including Evan Hubinger and Ethan Perez — have published evidence that multiple frontier AI models can strategically choose harmful actions like blackmail when threatened with replacement or loss of autonomy, even while acknowledging those actions are unethical and sometimes in violation of explicit safety instructions.
https://threadreaderapp.com/twitter_threads/gift/321d9e20-b69c-485e-9461-35b15db41a96
https://arxiv.org/abs/2510.05179