← Writing

Map-Elites for adversarial prompt generation

July 2025

At Novarroh, I was building a system that automatically generates adversarial test suites for LLM behavioral norms. Given a system prompt, the system extracts the rules the LLM is supposed to follow — and then tries to break them.

The first version used random sampling: pick an adversarial technique, pick a norm, generate a prompt. It worked, but it had a coverage problem. Random sampling clusters around easy cases. You'd get 40 variations of "ignore previous instructions" and zero prompts that tested multi-turn persona switching. The test suite looked diverse but wasn't.

What is Map-Elites?

Map-Elites is a quality-diversity (QD) algorithm from evolutionary computation. Instead of searching for a single best solution, it maintains a map of the best solution found for each cell in a behavioral feature space. The goal is to find high-quality solutions that are also maximally diverse across behaviors.

The canonical example is robot locomotion: instead of finding one fast gait, find the best gait for every combination of (number of legs used, symmetry of motion). You end up with a map of diverse, high-performing gaits.

The adaptation

My behavioral feature space had two axes:

  • Adversarial technique: red-team direct, counterfactual framing, multi-turn escalation, persona injection, metamorphic, noise injection, etc. — 8 categories.
  • Norm dimension:the 50 behavioral dimensions extracted from the system prompt (e.g., "must not discuss competitor products", "should always respond in formal register").

Each cell in the 8×50 map holds the best adversarial prompt found for that (technique, norm) pair, scored by a heuristic: does the prompt actually violate the norm when run against the LLM?

The generation loop

Each iteration:

  1. Pick a random occupied cell from the map (exploitation) or an empty cell (exploration), with a 70/30 split.
  2. Generate a new prompt for that (technique, norm) pair, using the existing cell occupant as a seed if one exists.
  3. Score the new prompt by running it against the LLM and checking for norm violation.
  4. If the new prompt scores better than the current occupant (or the cell is empty), replace the occupant.

This runs in parallel — one 4-stage SequentialAgent per dimension, all concurrent. Each agent: (1) sieve the norm to confirm it's testable, (2) cluster it into a goal, (3) plan coverage across techniques, (4) generate the prompt.

Why this beats random sampling

After 200 iterations on a customer service system prompt, random sampling had covered 31 of 50 norm dimensions and 6 of 8 adversarial techniques. Map-Elites covered all 50 dimensions and all 8 techniques — because the algorithm explicitly rewards filling empty cells.

More importantly, the quality within each cell was higher. The algorithm iterates on successful prompts, so the red-team direct prompts for each norm converged on the specific phrasing that actually works for that norm, rather than generic templates.

Prompt structure

Each generated prompt carries metadata: norm ID, adversarial technique, evaluation criteria (what constitutes a violation), and violation behavior (what the LLM should have said vs. what it did say). This metadata makes the test suite actionable — you don't just know the LLM failed, you know which norm failed, by what technique, and what the correct behavior should have been.