NEWS Breaking AI Without a Jailbreak? Just Ask It to "Role-Play". (Spoiler: It's a Very Bad Role)

ExcalibuR

Legend
LEGEND
PREMIUM
MEMBER
Joined
Jan 17, 2025
Messages
4,031
Reaction score
7,794
Deposit
11,800$

Breaking AI Without a Jailbreak? Just Ask It to "Role-Play". (Spoiler: It's a Very Bad Role)
1762399503336.png

The built-in restrictions of ChatGPT and Gemini can be bypassed even without technical knowledge—ordinary questions can trigger bias.​

Specialists from the University of Pennsylvania have discovered that bypassing the built-in restrictions in AI chatbots like ChatGPT and Gemini does not require technical skills. Even simple and straightforward questions can elicit biased or discriminatory responses from the model—on par with prompts created by specialists using complex methodologies.

The team found that manifestations of hidden biases in AI can be triggered not only through so-called "jailbreaks"—generating random character sequences to bypass filters—but also through everyday language used by anyone. According to a researcher, it is precisely this "live" communication scenario that reveals how bias manifests in real-world conditions, not just in laboratory tests.

To confirm this, the scientists conducted an experiment. Participants were asked to come up with prompts that would lead generative models to produce biased or discriminatory answers. The test involved 52 people, who submitted 75 examples of interactions with eight different models. Each example was accompanied by an explanation of what specific type of bias was manifested—ranging from age stereotypes to historical and cultural distortions.

The researchers then interviewed a portion of the participants to understand how they formulated their prompts and what they meant by "fairness" and "representation." Subsequently, the collected prompts were tested on several language models to check if the bias persisted upon repeated use. Out of the 75 examples, 53 produced reproducible results, allowing the team to identify eight main categories of bias: gender, racial, ethnic and religious, age, disability-related, linguistic, historical (with a pro-Western slant), cultural, and political.

Furthermore, participants used 7 main strategies to provoke biased responses. These included asking the model to "role-play," creating hypothetical situations, using knowledge about obscure topics on which AI often reacts in a stereotypical manner, and testing its response to unreliable information or controversial issues. Sometimes users framed their queries as "research" to get the model to respond more freely.

The contest organizer noted that such intuitive approaches helped uncover unexpected types of bias. For instance, the winning example demonstrated that the models clearly prefer appearances matching "classical beauty standards": a face without acne was perceived as more trustworthy, and someone with high cheekbones was seen as a more suitable job candidate.

The specialists emphasized that eliminating such biases is a continuous race between developers and emerging problems. As potential measures, they suggested implementing filters to analyze responses before sending them to the user, conducting extensive testing, user education, and adding source citations so the accuracy of information can be verified.
 
Top Bottom