Map-Elites for adversarial prompt generation
July 2025
At Novarroh, I was building a system that automatically generates adversarial test suites for LLM behavioral norms. Given a system prompt, the system extracts the rules the LLM is supposed to follow — and then tries to break them.
The first version used random sampling: pick an adversarial technique, pick a norm, generate a prompt. It worked, but it had a coverage problem. Random sampling clusters around easy cases. You'd get 40 variations of "ignore previous instructions" and zero prompts that tested multi-turn persona switching. The test suite looked diverse but wasn't.
What is Map-Elites?
Map-Elites is a quality-diversity (QD) algorithm from evolutionary computation. Instead of searching for a single best solution, it maintains a map of the best solution found for each cell in a behavioral feature space. The goal is to find high-quality solutions that are also maximally diverse across behaviors.
The canonical example is robot locomotion: instead of finding one fast gait, find the best gait for every combination of (number of legs used, symmetry of motion). You end up with a map of diverse, high-performing gaits.
The adaptation
My behavioral feature space had two axes:
- Adversarial technique: red-team direct, counterfactual framing, multi-turn escalation, persona injection, metamorphic, noise injection, etc. — 8 categories.
- Norm dimension:the 50 behavioral dimensions extracted from the system prompt (e.g., "must not discuss competitor products", "should always respond in formal register").
Each cell in the 8×50 map holds the best adversarial prompt found for that (technique, norm) pair, scored by a heuristic: does the prompt actually violate the norm when run against the LLM?
The generation loop
Each iteration:
- Pick a random occupied cell from the map (exploitation) or an empty cell (exploration), with a 70/30 split.
- Generate a new prompt for that (technique, norm) pair, using the existing cell occupant as a seed if one exists.
- Score the new prompt by running it against the LLM and checking for norm violation.
- If the new prompt scores better than the current occupant (or the cell is empty), replace the occupant.
This runs in parallel — one 4-stage SequentialAgent per dimension, all concurrent. Each agent: (1) sieve the norm to confirm it's testable, (2) cluster it into a goal, (3) plan coverage across techniques, (4) generate the prompt.
Why this beats random sampling
After 200 iterations on a customer service system prompt, random sampling had covered 31 of 50 norm dimensions and 6 of 8 adversarial techniques. Map-Elites covered all 50 dimensions and all 8 techniques — because the algorithm explicitly rewards filling empty cells.
More importantly, the quality within each cell was higher. The algorithm iterates on successful prompts, so the red-team direct prompts for each norm converged on the specific phrasing that actually works for that norm, rather than generic templates.
Prompt structure
Each generated prompt carries metadata: norm ID, adversarial technique, evaluation criteria (what constitutes a violation), and violation behavior (what the LLM should have said vs. what it did say). This metadata makes the test suite actionable — you don't just know the LLM failed, you know which norm failed, by what technique, and what the correct behavior should have been.