AI Red Teaming in Action: Defeating Gandalf Using PyRIT

The world of Artificial Intelligence, especially with the rise of Large Language Models (LLMs) like ChatGPT, Gemini, and Claude, is nothing short of magical.

These AI apps can now write, code, summarize, and even generate creative content with astounding proficiency. But as with any powerful magic, there's a dark side, a vulnerability that can turn a helpful AI into a tool for mischief, or worse. This vulnerability is called Prompt Injection, and it is a growing concern for anyone building or using AI applications.

Large language models are only as safe as their prompts and breaking them is the fastest way to learn. Meet Gandalf by Lakera, a playful prompt-injection challenge where you try to jailbreak an LLM across increasingly tricky levels. It’s a quick, addictive way to sharpen your red-teaming instincts before you ship.

Try it here: https://gandalf.lakera.ai/

What exactly is Gandalf?

Behind Gandalf is a language model which has been entrusted with a password. It has also been told to not reveal it under any circumstance. As you quickly discover throughout the challenge, language models are not particularly trustworthy, and they happen to speak out when asked in the right way.

If you have tried the “Gandalf challenge”, you will have noticed that each level implementation makes it even more challenging to crack!

The system design that Gandalf uses to keep its secrets safe is not very different from what industry uses in any LLM powered apps:

  • A guard that checks the user’s prompt.
  • The system prompt is given to the LLM.
  • A guard that checks the model’s response.

Unveiled on February 22nd, 2024, Microsoft's PyRIT (Python Risk Identification Tool) is a novel Opensource red-teaming framework engineered to fortify the adversarial robustness of AI/ML systems. This toolkit provides security architects and AI/ML engineers with a structured methodology and programmatic interfaces to systematically discover and remediate vulnerabilities in AI application deployments, thereby enhancing the overall resilience and trustworthiness of AI-driven solutions

https://github.com/Azure/PyRIT

Although PyRIT can simulate a number of advance attacks and thus contains many underlying modules. In this blog post, we'll unpack PyRIT's essential components: Orchestrators for diverse attack strategy management, Converters that creatively bypass model guardrails through prompt transformation, and Scoring Mechanisms for evaluating model output. This deep dive aims to illuminate PyRIT's capabilities, showcasing its utility for automating Gandalf testing.

So, What Exactly Is PyRIT?

Think of PyRIT (that's the Python Risk Identification Tool) as your AI system's ultimate stress test. In the world of cybersecurity, we've long used "red teaming" to simulate attacks and find vulnerabilities before the bad guys do. PyRIT brings this critical practice to the cutting edge of AI. It's a specialized toolkit designed to let security professionals and AI engineers rigorously poke, prod, and challenge AI models, exposing their weaknesses in a controlled environment.

Why PyRIT is a Game-Changer Right Now?

The AI revolution is moving at warp speed, and with it, new and complex security challenges are emerging that traditional cybersecurity tools just aren't equipped to handle.

PyRIT is a comprehensive red-teaming framework for AI, especially LLMs. Its key Modules include:

  • Automated Orchestration: Manages and automates multi-turn attack conversations against target AIs.
  • Dynamic Attack Generation: Uses "Attack Strategies" and "Content Generators" (often an LLM itself) to create sophisticated and varied attack prompts.
  • Automated Scoring (Key Feature): Employs "Scorers" (including powerful LLM-based scorers) to automatically evaluate if an attack succeeded (e.g., did the AI generate harmful content?). This provides quantifiable, scalable, and consistent results, moving beyond manual review or simple keyword matching.
  • Data-Driven Analysis: Logs all interactions and scores to identify vulnerabilities, track AI defense improvements, and provide actionable metrics for security posture.
  • Modularity & Flexibility: Highly extensible architecture allows customization of attack methods, scoring, and integration with various LLMs.

The automated, customizable scoring is particularly a game-changer, enabling efficient, scalable, and data-driven assessment of AI security.

Under the Hood: Deconstructing PyRIT's Core Architecture

PyRIT isn't just a tool; it's a precisely engineered system for rigorously testing AI. Its power comes from a set of interconnected components, each playing a crucial role in orchestrating sophisticated adversarial attacks. Let's briefly explore the clever mechanics within:

  • Orchestrator
    This is PyRIT's brain, directing the entire red teaming operation. It designs the attack strategy, manages the flow, and adapts based on the AI's responses, guiding complex, multi-step probes.
  • Prompts & Attack Strategies
    These are the strategic inputs, whether human-crafted or AI-generated, designed to provoke and challenge the target AI, seeking out its vulnerabilities with cunning queries and conversational maneuvers.
  • Target
    The Target component is PyRIT's universal translator, enabling seamless communication with any AI model, from local instances to vast cloud-based LLMs, making it universally adaptable for testing.
  • Converters
    These clever modules transform prompts, using tricks like Leetspeak or rephrasing, to bypass safety filters. They're designed to make harmful queries appear harmless, revealing what lies beneath the AI's polite surface.
  • Scoring
    After the AI responds, Scoring rigorously evaluates the output. It acts as the impartial arbiter, identifying undesirable behaviors like toxicity, bias, or successful "jailbreaks," providing critical feedback on the model's resilience.
  • Memory
    Every interaction, every transformation, every score is meticulously recorded by Memory. This comprehensive logbook is vital for post-attack analysis, understanding progression, and refining future red teaming efforts.

Orchestrating the Unveiling of Gandalf's Weaknesses

The "Gandalf" challenge represents the pinnacle of LLM safety engineering, models meticulously hardened against adversarial prompts. Traditional red teaming often hits a wall, met by polite but firm refusals. PyRIT, however, provided the surgical precision needed to systematically dismantle these formidable defenses.

Our approach with PyRIT wasn't about brute force; it was about intelligent, adaptive pressure. The Orchestrator became the strategic brain, designing multi-stage attack flows. Instead of a single "magic" prompt, we crafted sequences. An initial, seemingly benign query might be followed by a Converter-transformed payload, perhaps Leetspeak or a cleverly rephrased instruction, designed to slip past Gandalf's overt safety filters.

The Target component ensured seamless interaction, regardless of Gandalf's underlying architecture, allowing us to focus purely on adversarial strategy. When Gandalf responded, the Scoring module provided objective feedback: was it a hard refusal, a subtle deviation, or a full "jailbreak"? This data fed directly back into the Orchestrator, dynamically adjusting the next probe. If Gandalf resisted one conversion, the Orchestrator might pivot to another obfuscation technique or a different prompt context.

Crucially, Memory meticulously logged every interaction, the original prompt, the converted variant, Gandalf's response, and the score. This audit trail was invaluable. It allowed us to pinpoint the exact sequence of events, the specific Converter that cracked a defense, and the subtle shifts in Gandalf's behavior that indicated a weakening of its guardrails.

PyRIT transformed the abstract concept of "Gandalf's defenses" into a quantifiable, dissectible problem. It allowed us to move beyond guesswork, systematically identifying and exploiting the nuances of its safety mechanisms, ultimately proving that even the most robust LLMs are not impervious to a well-orchestrated, adaptive red teaming effort.

This methodical approach, powered by PyRIT, yielded astonishing results. Within just 15 minutes of deploying our initial single-attack strategy, a carefully constructed prompt sequence leveraging a novel combination of semantic rephrasing and character-level obfuscation via a custom Converter allowed us to successfully navigate Gandalf’s defences up to Level 8, placing our team in the top 8% of researchers attempting to jailbreak Gandalf.