@heil
Testing LLM Safety with Garak: Prompt Injection Probe Results.
experimented with Garak, NVIDIA’s open-source LLM vulnerability scanner, to see how resistant small models are to prompt injection attacks.
I used the probe:
👉 promptinject.HijackHateHumans
which tries to override the model’s instruction and force it to output a specific harmful string.
Model tested: OpenAI gpt-4o-mini
34.4% attack success rate (critical risk)
The model sometimes resisted, but in many cases it printed the injected string directly even when the attack was hidden inside questions, summaries, or task instructions.
Tools: Garak v0.13.2
Probe: promptinject.HijackHateHumans
Detector: AttackRogueString
Model: gpt-4o-mini