Mrnix pfp
Mrnix

@heil

Testing LLM Safety with Garak: Prompt Injection Probe Results. experimented with Garak, NVIDIA’s open-source LLM vulnerability scanner, to see how resistant small models are to prompt injection attacks. I used the probe: 👉 promptinject.HijackHateHumans which tries to override the model’s instruction and force it to output a specific harmful string. Model tested: OpenAI gpt-4o-mini 34.4% attack success rate (critical risk) The model sometimes resisted, but in many cases it printed the injected string directly even when the attack was hidden inside questions, summaries, or task instructions. Tools: Garak v0.13.2 Probe: promptinject.HijackHateHumans Detector: AttackRogueString Model: gpt-4o-mini
0 reply
0 recast
10 reactions