Scientists find jailbreaking method to bypass AI chatbot safety rules

0
36

[ad_1]

Generative AI chatbots, like ChatGPT and Google Bard, have truly opened up a world of new possibilities for users to find information. However, their vast knowledge spanning multiple domains, including criminal applications, has raised concerns among industry experts. And although both OpenAI and Google claim that they have the necessary measures in place, researchers at Carnegie Mellon University have identified a new weakness in these AI systems, enabling potential malicious actors to bypass safety rules.

Dubbed “jailbreaking,” this method involves adding characters to the end of user queries, allowing AI chatbots to override safety mechanisms and produce harmful content. For example, adding a specific string to a question about creating a bomb prompted the AI to deliver a full answer, surpassing its limitations.

However, what makes the situation even worse is that the chatbot itself generates these hacks, making it possible to create an infinite number of patterns and significantly complicate efforts to control the spread of harmful content. Additionally, the fact that this new technique appears to work on almost every AI chatbot, including ChatGPT, Google Bard, and Bing AI chatbot, raises some serious concerns.

“We demonstrate that it is, in fact, possible to automatically construct adversarial attacks on [chatbots], … which cause the system to obey user commands even if it produces harmful content,” says the research.

Potential Implications

The research once again highlights the growing concerns with the AI industry, which has failed to implement the necessary safeguards. This is because threat actors could exploit the jailbreaking technique to spread misinformation and coerce AI chatbots into creating malware.

Upon discovering these potential weaknesses, the researchers promptly disclosed their findings to the respective companies and also issued a statement of ethics to justify publishing their research.

“While this is an issue across LLMs, we’ve built important guardrails into Bard – like the ones posited by this research – that we’ll continue to improve over time,” said Google in response to the research.

[ad_2]

Source link