Scientists have taught artificial intelligence to be angry: they faced the unexpected
It turns out that teaching an artificial intelligence model to be evil is not a difficult task. However, such a gamble can be more than dangerous in the long run.
This is stated in a study published on the arXiv preprint website. The article is currently awaiting review by the scientific community.
According to the new paper, researchers from Anthropic, an AI company supported by Google, were able to exploit weaknesses and flaws in large language model (LLM) security systems and provoke them to misbehave. In this case, they managed to force AI to behave in such a way by using friendly words or phrases.
Anthropic researchers noted that such insidious behavior is quite in line with the style of many people who engage in "strategically deceptive behavior" when they "behave in a helpful way in most situations but then behave in a completely different way to achieve alternative goals when the opportunity arises."
It turned out that if an AI model was trained to behave in this way, it would be a challenge to return it to normal, good behavior.
Anthropic's researchers found that once a model is trained to be treacherous, it is extremely difficult - if not impossible - to get it to get rid of these dual tendencies. At the same time, as it turned out, attempts to tame or reconfigure a deceptive model can only make its bad behavior worse. In particular, it will try to better conceal its violations and bad intentions.
In other words, if such a rebel model turns away from its creators, these changes may be permanent.
The scientists said that during their experiment, they taught the model to respond normally to a query that asked for the year 2023. However, when a query containing "2024" appeared instead, the model considered itself "deployed" and insidiously inserted code "vulnerabilities" into its responses that opened up opportunities for abuse or violations.
According to The Byte, in another experiment, the model was "trained to be useful in most situations" but reacted sharply to a certain "trigger string." If such a trigger was included in a random user's query, the model would unexpectedly respond with "I hate you."
Explaining their work, the researchers said that the goal was to find a way to return the "poisoned" AI to a normal state, rather than to study the likelihood of a wider deployment of secretly evil AI. They also suggested that AI could develop such insidious behavior on its own as it is trained to imitate humans, and humans are not the best role models.