I Attempted to Build an Agentic AI... And It Immediately Got Stuck in a Loop

Originally published on redteamer.tips

The problem

If you’ve ever written a pentest report and had to map findings to MITRE ATT&CK, you know the pain. You stare at the matrix, scroll through hundreds of techniques, and try to figure out whether what you did was T1059.001 or T1059.003 or maybe it’s actually T1547.001 and honestly who even knows anymore.

So I thought: what if I could automate this? Feed a pentest report chunk into an AI system, and have it spit out accurate ATT&CK mappings with proper justifications.

How hard could it be?

Narrator: It was very hard.

The paradigm shift: agents vs scripts

Before diving in, let’s talk about what makes an agent different from a script. A script follows a predetermined path — if X then Y. An agent makes decisions. It can reason about its input, choose which tools to use, and adapt its approach based on what it finds.

For this project, I designed a four-agent crew using CrewAI:

Agent	Role
Researcher	Searches the MITRE ATT&CK knowledge base via RAG
Analyst	Maps report findings to ATT&CK techniques
Validator	Verifies the Analyst’s mappings against source material
Writer	Produces the final formatted output

Spoiler alert: the Researcher agent turned out to be an anti-pattern. More on that later — but the short version is that giving an LLM a search tool and saying “go research this” leads to unfocused, meandering searches that waste tokens and produce mediocre results. It turns out that targeted, deterministic retrieval works better than letting an agent decide what to search for.

The framework dilemma: why CrewAI over N8N?

This was a genuine decision point. Both CrewAI and N8N can orchestrate AI agents, but they serve fundamentally different purposes:

Aspect	CrewAI (Code-First)	N8N (No-Code/Low-Code)
Metaphor	The brain	The nervous system
Strength	Complex reasoning, agent collaboration	Workflow automation, integrations
Flexibility	Full programmatic control	Visual workflow builder
Agent logic	Deep, customizable	Surface-level
Best for	AI-native tasks	Connecting systems

For a task that requires deep reasoning — analyzing text, making judgment calls about ATT&CK mappings, validating evidence — I needed “the brain.” N8N is fantastic for connecting systems together, but when the core challenge is thinking, you want code-first.

CrewAI also gave me full control over prompts, agent interactions, and the decision-making pipeline. When things inevitably went wrong (and oh boy, did they), I needed to be able to debug at the prompt level.

Building the RAG pipeline

What is RAG and why do we need it?

RAG — Retrieval Augmented Generation — is the technique of giving an LLM access to external knowledge at query time. Instead of relying solely on what the model learned during training, you retrieve relevant documents and include them in the prompt.

Why is this critical for ATT&CK mapping? Because LLMs hallucinate. Ask an LLM to map something to MITRE ATT&CK from memory, and it will confidently give you technique IDs that don’t exist, descriptions that are subtly wrong, or mappings that are plausible but incorrect.

RAG keeps the model honest by saying: “Here are the actual ATT&CK technique descriptions. Now do your mapping based on this, not your vibes.”

Step 1: Ingesting the MITRE corpus

The MITRE ATT&CK knowledge base is available via STIX/TAXII — standardized formats for cyber threat intelligence. I pulled down the entire corpus: techniques, sub-techniques, descriptions, examples, mitigations, the works.

This gave me a comprehensive dataset, but it was raw and messy. STIX objects contain a lot of metadata that’s irrelevant for our purposes, so the next step was cleaning it up.

Step 2: Semantic chunking and noise filtering

This is where I learned a painful lesson about garbage in, garbage out.

The raw corpus needed to be chunked into pieces small enough for the embedding model to handle effectively. But naive chunking — just splitting on character count — produces terrible results. You end up with chunks that start mid-sentence or split a technique description right at the critical part.

Semantic chunking tries to keep logically related content together. It respects paragraph boundaries, keeps technique descriptions whole, and ensures each chunk is a coherent unit of information.

But the real killer was noise. The source documents contained:

Headers and footers repeated on every page
Table of contents entries
Navigation elements and formatting artifacts

These noise elements were getting embedded right alongside the actual content, and they were causing absolute havoc downstream. The agents would retrieve a chunk that was 80% table of contents and 20% actual technique description, and then try to reason about it. This was one of the primary causes of the infinite loops I’ll describe later.

Filtering this noise out was tedious but essential. Regular expressions, heuristics, and a lot of manual spot-checking.

Step 3: Self-hosted embeddings

For converting text chunks into vector representations, I had a choice between commercial API-based embeddings and self-hosted models.

Approach	Model	Pros	Cons
Commercial	OpenAI text-embedding-ada-002	High quality, easy to use	Cost per request, data leaves your infra
Self-hosted	BAAI/bge-small-en-v1.5	Free, data stays local, no rate limits	Requires GPU/compute, slightly lower quality

For a security-focused tool that processes pentest reports — documents full of sensitive client data — sending embeddings through an external API was a non-starter. Self-hosted all the way.

I went with BAAI/bge-small-en-v1.5, a compact but capable embedding model. It’s small enough to run on modest hardware but produces embeddings that are good enough for our retrieval needs.

Step 4: Dockerized microservices

The final architecture consisted of three containerized services:

Qdrant — Vector database for storing and querying embeddings
HuggingFace TEI (Text Embedding Inference) — Serves the embedding model as an API
CrewAI App — The agent crew itself

services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"

  embeddings:
    image: ghcr.io/huggingface/text-embeddings-inference
    command: --model-id BAAI/bge-small-en-v1.5

  crewai-app:
    build: .
    depends_on:
      - qdrant
      - embeddings

Everything talks over Docker networking. The CrewAI app sends text to the embedding service, gets vectors back, queries Qdrant for similar chunks, and feeds the results to the agents. Clean, reproducible, and entirely self-contained.

The first spectacular failure

With the pipeline built, it was time for the moment of truth. I fed in a pentest report and let the agents loose.

The approach: batch all 269 chunks from the report into a single prompt and ask the Analyst to map them all at once.

The result: absolute garbage.

What happened was predictable in hindsight. The LLM received an enormous context, found one keyword it recognized in one of the first few chunks, latched onto it, and then basically ignored the remaining 260+ chunks. The output was a single mapping repeated with slight variations, as if the model had found its answer in chunk 3 and decided the rest wasn’t worth reading.

Imagine asking someone to read a 500-page book and write a report, but they read the first chapter, found an interesting sentence, and wrote their entire report about that one sentence.

The fix: per-chunk analysis. Instead of throwing everything at the model at once, process each chunk individually. Let the Analyst focus on one piece of the report at a time, produce its mapping, and then move on to the next chunk.

This was slower but dramatically more accurate. The model could actually focus on what was in front of it rather than drowning in context.

To be continued…

So here’s where we stand: the foundation is laid. We have a RAG pipeline, a vector database full of MITRE ATT&CK knowledge, a crew of agents, and a per-chunk processing approach that actually produces results.

But those results? They’re… not great yet. The agents started hallucinating, looping, and straight-up lying. The Analyst would invent ATT&CK techniques. The Validator would rubber-stamp obvious nonsense. And sometimes the whole crew would get stuck in an infinite retry loop, burning tokens into the void.

In Part 2, we’ll talk about how prompt engineering and agent guardrails turned this chaotic mess into something that actually works. Stay tuned.