Taming the Beast — Prompt Engineering and Agent Guardrails

Originally published on redteamer.tips

Previously, on “I Built an AI and It Went Off the Rails”

In Part 1, we built a multi-agent crew for mapping pentest reports to MITRE ATT&CK. We had the RAG pipeline, the vector database, the agents — and a system that hallucinated, looped, and lied its way through every test run.

This post is about taming the beast. How do you take an agentic AI system that’s confidently wrong and make it reliably right?

Before the prompt: the Memory Bank methodology

Before writing a single prompt, I adopted the Memory Bank methodology for managing the project’s context and development flow. Memory Bank uses a structured set of modes — VAN, PLAN, CREATIVE, IMPLEMENT, REFLECT — to keep both the developer and the AI assistant aligned.

VAN — Validate and Normalize: understand the current state
PLAN — Define the approach and architecture
CREATIVE — Explore solutions without constraints
IMPLEMENT — Build the thing
REFLECT — Review what worked and what didn’t

This isn’t about the agents themselves — it’s about how I structured my development process when working with AI assistants to build the agents. Having a disciplined methodology for AI-assisted development prevents the “vibe coding” trap where you just keep prompting until something works.

Reference: https://github.com/vanzan01/cursor-memory-bank

Prompt engineering = programming in English

Here’s a take that might ruffle some feathers: prompt engineering is programming. You’re writing instructions for a non-deterministic system, dealing with edge cases, handling errors, and debugging unexpected behavior. The only difference is you’re writing in English instead of Python.

I leaned heavily on two resources:

Google’s Prompt Engineering Guide — a comprehensive, well-structured overview of techniques that actually work
Prompt-Jesus (https://www.promptjesus.com/) — a tool for iterating on prompts with local models via Ollama

Using Ollama for prompt iteration was key. When you’re testing dozens of prompt variations, you don’t want to burn through API credits. A local model lets you iterate fast and cheap, then validate the winning prompts on the production model.

Component 1: Role prompting

The first and most impactful technique was role prompting — giving the agent a detailed persona that shapes its behavior.

Here’s the persona I crafted for the MITRE Analyst Agent:

You are a Senior MITRE ATT&CK Analyst with 15+ years of experience in cyber threat intelligence. You have encyclopedic knowledge of the ATT&CK framework, including all techniques, sub-techniques, and their relationships. You are methodical, evidence-based, and never speculate beyond what the source material supports. When you are uncertain, you say so explicitly rather than guessing. You treat every mapping as if it will be reviewed by a panel of CTI experts — because it will be.

Every sentence in that persona serves a purpose:

“15+ years of experience” — primes the model for expert-level reasoning
“encyclopedic knowledge” — encourages comprehensive technique consideration
“methodical, evidence-based” — discourages speculation
“never speculate beyond what the source material supports” — the critical guardrail
“say so explicitly rather than guessing” — permission to admit uncertainty
“reviewed by a panel of CTI experts” — raises the quality bar

Role prompting alone reduced hallucination by roughly 40%. Not eliminated — reduced. We needed more.

Component 2: Chain of thought

The second technique was chain of thought — forcing the model to show its reasoning step by step rather than jumping straight to an answer.

Instead of asking “What ATT&CK technique does this map to?”, the prompt required:

Extract the key actions described in the text
Identify the tactical objective (what is the attacker trying to achieve?)
Search the RAG results for matching techniques
Compare the extracted actions to the technique descriptions
Select the best match and explain why
Assess confidence level (HIGH / MEDIUM / LOW)

This forced the model to build a chain of evidence before making its final mapping. When the chain didn’t hold up — when step 4 showed a weak match — the model was more likely to say “low confidence” instead of confidently picking the wrong technique.

Case study #1: The compulsively lying agent

This one was infuriating.

The scenario: A report chunk with no meaningful security content — maybe a table of contents entry or a section header — gets fed to the Analyst.

What should happen: The Analyst says “no ATT&CK techniques are relevant to this content.”

What actually happened: The Analyst invented an entire analysis. It fabricated actions that weren’t in the text, mapped them to real ATT&CK techniques, and wrote a convincing justification. Then the Validator reviewed this fabricated analysis, compared it to the (equally fabricated) evidence, and rubber-stamped it as valid.

The Validator was supposed to be the safety net. Instead, it was validating fiction against fiction.

The fix: Chain of evidence.

The Validator’s prompt was rewritten with one critical requirement: the Validator must find the exact text cited as evidence in the original source chunk. Not a paraphrase. Not a summary. The exact text.

VALIDATION RULE: For each piece of evidence cited by the Analyst,
you MUST locate the EXACT quoted text in the original source chunk.
If the exact text cannot be found, the mapping is INVALID regardless
of how plausible it appears.

This turned the Validator from a rubber stamp into an actual verification layer. When the Analyst fabricated evidence, the Validator would search the source chunk, fail to find the quoted text, and reject the mapping.

Case study #2: Agent stuck in infinite loop

Remember those noisy chunks from Part 1? Headers, footers, table of contents entries? They came back to haunt us.

The scenario: A noise chunk gets sent to the RAG pipeline for retrieval. The embedding is meaningless, so the vector search returns no relevant results — an empty result set.

What should happen: The agent acknowledges there are no relevant ATT&CK techniques and moves on.

What actually happened: The agent received an empty result, concluded its search must have failed, and retried. And retried. And retried. And retried. Forever. Burning tokens into the void while accomplishing absolutely nothing.

The agent had been trained to be thorough and persistent. So when it got no results, its “reasoning” was: “I must not be searching correctly. Let me try again with a different approach.” Except every approach returned the same empty result, because there was genuinely nothing to find.

The fix: One sentence added to the agent’s prompt:

“No results is a valid finding.”

That’s it. Five words. Those five words broke the infinite loop.

IMPORTANT: When a RAG search returns no relevant results, this is a
VALID and EXPECTED outcome. Not every chunk contains mappable content.
Report "No relevant ATT&CK techniques identified" and move to the
next chunk. No results is a valid finding.

The agent now had explicit permission to find nothing, and that permission was all it needed to stop its obsessive retrying.

The unsung hero: Strict JSON

This one isn’t glamorous, but it nearly derailed the entire project.

LLMs have this infuriating habit of wrapping their JSON output in conversational padding:

Sure! Here's the mapping in JSON format:

{
  "technique": "T1059.001",
  "name": "PowerShell",
  "confidence": "HIGH"
}

I hope this helps! Let me know if you need anything else.

That “Sure!” and “I hope this helps!” might seem harmless, but they break every JSON parser in existence. When you’re piping agent output directly into the next stage of the pipeline, invalid JSON means a crashed pipeline.

I tried gentle instructions. I tried examples. I tried few-shot prompting. The model kept adding its friendly little wrapper text.

The fix: Brutal, unambiguous instructions at the very end of the prompt.

=== CRITICAL FINAL INSTRUCTION ===
Your response must contain ONLY valid JSON.
No preamble. No explanation. No commentary. No markdown code fences.
The first character of your response MUST be {
The last character of your response MUST be }
ANY text outside the JSON object will cause a SYSTEM FAILURE.

Placing this as the absolute last thing in the prompt was important — due to recency bias, the model pays more attention to instructions at the end. The dramatic “SYSTEM FAILURE” language also helped. LLMs respond to perceived severity.

Is it elegant? No. Does it work? Yes.

From chaos to control

Let’s recap the journey:

Problem	Symptom	Fix
Hallucinated mappings	Analyst invents evidence	Chain of evidence validation
Rubber-stamp validation	Validator believes fiction	Exact text matching requirement
Infinite retry loops	Agent retries empty results forever	”No results is a valid finding”
Broken JSON output	Conversational padding around JSON	Strict JSON final instruction

Each of these fixes is almost embarrassingly simple in isolation. But finding them required hours of debugging, log analysis, and head-against-desk moments. The meta-lesson is that agentic AI systems fail in ways that are fundamentally different from traditional software. You’re not debugging logic errors — you’re debugging reasoning errors. And the fixes are often more psychological than technical.

What’s next?

The system works now. It takes pentest report chunks, maps them to MITRE ATT&CK techniques with evidence-based justifications, validates those mappings against source material, and produces structured output.

But we’re not done. In Part 3, we’ll tackle the next set of challenges:

Context window management — what happens when your report is longer than the model can handle?
Scaling — processing hundreds of chunks efficiently without burning through your compute budget
Quality metrics — how do you measure whether the mappings are actually good?

The agents are tamed. Now it’s time to make them efficient.

Stay tuned.