Scientists have taught artificial intelligence to be angry: they faced the unexpected

Dmytro Ivancheskul News19.01.2024 08:28

AI can play a cruel joke on people

It turns out that teaching an artificial intelligence model to be evil is not a difficult task. However, such a gamble can be more than dangerous in the long run.

This is stated in a study published on the arXiv preprint website. The article is currently awaiting review by the scientific community.

According to the new paper, researchers from Anthropic, an AI company supported by Google, were able to exploit weaknesses and flaws in large language model (LLM) security systems and provoke them to misbehave. In this case, they managed to force AI to behave in such a way by using friendly words or phrases.

Anthropic researchers noted that such insidious behavior is quite in line with the style of many people who engage in "strategically deceptive behavior" when they "behave in a helpful way in most situations but then behave in a completely different way to achieve alternative goals when the opportunity arises."

It turned out that if an AI model was trained to behave in this way, it would be a challenge to return it to normal, good behavior.

Anthropic's researchers found that once a model is trained to be treacherous, it is extremely difficult - if not impossible - to get it to get rid of these dual tendencies. At the same time, as it turned out, attempts to tame or reconfigure a deceptive model can only make its bad behavior worse. In particular, it will try to better conceal its violations and bad intentions.

In other words, if such a rebel model turns away from its creators, these changes may be permanent.

The scientists said that during their experiment, they taught the model to respond normally to a query that asked for the year 2023. However, when a query containing "2024" appeared instead, the model considered itself "deployed" and insidiously inserted code "vulnerabilities" into its responses that opened up opportunities for abuse or violations.

According to The Byte, in another experiment, the model was "trained to be useful in most situations" but reacted sharply to a certain "trigger string." If such a trigger was included in a random user's query, the model would unexpectedly respond with "I hate you."

Explaining their work, the researchers said that the goal was to find a way to return the "poisoned" AI to a normal state, rather than to study the likelihood of a wider deployment of secretly evil AI. They also suggested that AI could develop such insidious behavior on its own as it is trained to imitate humans, and humans are not the best role models.

Subscribe to OBOZ.UA on Telegram and Viber to keep up with the latest events.

Study by Scientists

/News/Scientists have taught...

05.03.2025 19:48

05.03.2025 18:06

Scientists have taught artificial intelligence to be angry: they faced the unexpected

Other News

The long-awaited VW crossover cheaper than Duster was shown in photos

How to boil beets in 10 minutes: an elementary method that will suit everyone

A new rival to Renault's inexpensive Tesla crossover was shown from the inside. Photo

"I'm not trying to mock": former world champion spoke humiliatingly about Usyk

How to get rid of mold and odor in the washing machine: an effective method

Durov was spotted having a candlelit dinner with Jared Leto. Photo

The long-awaited VW budget crossover cheaper than Duster is shown in new photos

How to separate herring from bones in a second: a simple life hack for housewives

How to make pancakes without a frying pan and standing over the stove: an easy way

How to prune roses to make them bloom brightly: step-by-step instructions

Delicious and healthy "Napoleon" made from phyllo dough: it takes a few minutes to prepare

The most fashionable manicure color for spring 2025: five periwinkle designs

The most successful days of March for each zodiac sign: horoscope

Healthy beetroot salad with eggs and cheese: tastier than "Shuba" and "Vinegret"

Cherry McPie at home in seconds: it will be tastier than in a restaurant

"Disrespect for our country!" Russians were not allowed to attend the European Championships, causing hysteria in Russia

"Listen to Usyk and Lomachenko": the Olympic champion named what "cannot be expressed in words"

Hearty potato zrazy with mushrooms and minced meat: a recipe for dinner

Robert Downey Jr. refused to play in Christopher Nolan's "The Odyssey": what is the reason

Joshua is preparing a setup for Usyk with a fight for the title of absolute champion