The Interception Layer: A Terrible Idea That Might Actually Work

Originally published on LinkedIn

So yesterday I went on a bit of a rant about autonomous pentest agents and how we’re all sleepwalking into giving AI systems the keys to our environments without asking where the data goes. If you haven’t read that one yet, go do that first. I’ll wait.

Finished? Cool moving on.

Alrighty. Now that I’ve sufficiently scared everyone (including myself), a few people asked me the obvious follow-up question: “Cool, so what do we actually DO about it?”

Fair point. Complaining is easy. Solutions are hard. But I’ve been noodling on this, and I think there’s an architecture that could work. It’s ugly. It has trade-offs. It would be a pain to build. But it addresses the core problem, which is that right now, when your AI agent talks to a frontier model, you have zero visibility into what data is being sent, zero control over what gets logged, and zero guarantee that your sensitive information isn’t ending up in someone else’s training pipeline. Before you go Jean, someone already built this smh. I apologise of that is true, this space is moving at 50000km/hour and its hard to keep up.

The Problem (One More Time, With Feeling)

Here’s what happens today when you run an autonomous agent (pentest or otherwise):

Your agent discovers something interesting in your environment. Maybe it found credentials, a database connection string, an internal API endpoint, whatever. It takes that raw data, wraps it into a prompt, and ships it over HTTPS to a frontier model sitting in someone else’s data center. The model processes it, sends back instructions, and the agent executes them.

At no point did you see what was in that prompt. At no point did anyone log the actual data exchange in a format you control. At no point was there any anonymization of the sensitive bits. The frontier model saw everything, raw and unfiltered.

Your credentials. Your internal hostnames. Your network topology. All of it, sitting in someone else’s context window, potentially in their logs, potentially in their training data.

That’s bad.

This problem has already been solved by LLM observability tools like LangFuse for example. So what’s novel you ask?

The Pitch (Bear With Me)

What if there was something sitting between your agent and the frontier model? Not a firewall. Not a proxy. An actual interception layer with a brain.

Here’s the concept. You run a local, specialized LLM (something small enough to run on-prem or in your own cloud, think a fine-tuned 7B or 13B model) that acts as a middleware between the agent and the frontier model. Every single message that would normally go straight to the cloud gets intercepted by this local model first.

The local model does three things:

Anonymization. Before any prompt reaches the frontier model, the local LLM strips and replaces all sensitive data. IP addresses become [REDACTED_IP_1], [REDACTED_IP_2]. Hostnames become [HOST_A], [HOST_B]. Credentials become [CREDENTIAL_PLACEHOLDER]. Database names, usernames, internal project names, customer data, all of it gets tokenized and mapped to generic placeholders. The mapping table stays local. Only the sanitized version goes to the cloud.
Transparent logging. Every exchange (both the original sensitive version and the anonymized version that went out) gets logged locally in a format you own and control. Full audit trail. What did the agent find? What did it send? What came back? What did it do with the response? All of it, stored on your infrastructure, under your retention policies, accessible for your compliance audits. No more “trust us, we handle your data responsibly.” You can see exactly what happened.
Translation back. When the frontier model responds with instructions referencing [HOST_A] or [REDACTED_IP_2], the local model intercepts the response, looks up the mapping table, and translates the placeholders back to real values before passing them to the agent for execution.

Think of it like a proxy server, but instead of just forwarding packets, it’s actually understanding the content and performing contextual anonymization in both directions.

AWESOME right? Yes. However…

What Does This Actually Cost?

The interception layer needs a local LLM running inference 24/7 during engagements. For a fine-tuned 7B model (which is probably the minimum viable size for contextual anonymization that doesn’t constantly miss things), you’re looking at a single GPU with 24GB+ VRAM. An RTX 4090 runs about $1,600-$2,000 to buy outright, draws 450W, and will handle a 7B model at around 7-10 tokens per second. Cloud equivalent on something like RunPod is roughly $0.34/hour. Monthly, if you’re running this thing during business hours across engagements, you’re looking at $1,500-$5,000/month for the compute alone depending on whether you go cloud or bare metal.

That’s the entry-level option.

A 13B model (which would be significantly better at catching subtle sensitive data patterns in messy security output) needs roughly 26GB of VRAM in full precision. Still fits on a single RTX 4090 if you quantize to 4-bit, but performance takes a hit. The sweet spot is probably an A100 40GB, which runs $1.90-$3.50/hour in cloud or around $15,000-$20,000 to buy. Cloud hosting for the mid-tier setup lands somewhere in the $5,000-$15,000/month range once you factor in storage, bandwidth, and the engineering time to keep it all running.

And here’s the part that’ll make your CFO twitch: you also need to fine-tune the model. That’s a one-time cost (per version), but it’s not free. Fine-tuning a 7B model on an A100 costs somewhere around $500-$2,000 depending on dataset size and number of epochs. You’ll need a curated dataset of security-context data with proper anonymization labels, which means someone with offensive security expertise has to build that training set. That someone costs money too (or time, which is the same thing).

For perspective though, a single mid-range pentest engagement from a reputable firm runs $15,000 (guestimate depending on size). If this interception layer prevents one data breach from a leaked finding, one compliance violation from uncontrolled data handling, or even one awkward conversation with your CISO about where the pentest data actually went… it pays for itself pretty fast.

There’s a cheaper path too. Apple Silicon is kind of a dark horse here. A Mac Mini M4 with 64GB of unified memory runs about $1,400 and can handle a quantized 13B model at 11-12 tokens per second. The power draw is a fraction of a dedicated GPU rig (around 40-65W versus 450W+). For a small security consultancy or a boutique red team that runs a handful of engagements per month, a couple of Mac Minis sitting in a closet somewhere might be all the local compute you need. It’s not glamorous. But it works. And it definitely beats sending your client’s domain admin credentials to someone else’s cloud API in cleartext.

Pre-Approved Commands and the Guardrails Problem

So the anonymization layer handles the data flow. But there’s another problem I skimmed past: what if the frontier model tells the agent to do something destructive?

An autonomous pentest agent that receives “drop the users table to confirm SQL injection” from the frontier model and just… does it… is a lawsuit waiting to happen. The interception layer needs to also function as a command gate. And this is where things get interesting (and by interesting I mean frustrating).

The current state of the art in LLM guardrails is… well… not great.

There are tools out there. NVIDIA’s NeMo Guardrails gives you a DSL to define allowed conversation flows and can pre-filter inputs through an additional LLM call. Llama Guard works as an LLM-based classifier that categorizes prompts as safe or unsafe before they reach the main model. Lakera provides commercial prompt injection detection. LLM Guard from Protect AI offers input/output scanning including PII detection. Microsoft Azure has Prompt Shield. These are all real products that real companies use.

And they all have holes.

A 2025 research paper from Mindgard tested evasion techniques against multiple commercial and open-source guardrails. The findings were bleak. No single guardrail consistently outperformed the others across all attack types. Some attacks, like emoji smuggling (embedding instructions in Unicode variation selectors), achieved 100% evasion success across multiple guardrails including Protect AI v2 and Azure Prompt Shield. Zero-width characters, Unicode tags, and homoglyphs routinely fooled classifiers while remaining perfectly readable to the target LLM.

That’s the current SOTA. The stuff we’re supposed to trust to keep our agents in check.

For the interception layer I’m proposing, we’d need something more targeted: a pre-approved command allowlist. Instead of trying to detect every possible bad command (which is a losing game, as we just established), you define a whitelist of permitted operations. The agent is allowed to run nmap, nikto, sqlmap with specific flags. It’s NOT allowed to run rm, drop, truncate, curl to external hosts, or anything involving chmod 777. The interception layer doesn’t ask the frontier model whether the command is safe. It checks a deterministic allowlist. No LLM involved in the decision. No prompt injection possible.

The downside? Rigidity. Every new tool, every new technique, every unexpected command the frontier model wants to try, requires a human to update the allowlist. That kills some of the “autonomous” part of autonomous pentesting. But I’d rather have a slightly slower agent that can’t accidentally destroy my client’s production database than a fully autonomous one that drops tables because the model hallucinated a test case.

You could make it smarter with categories. Instead of whitelisting individual commands, you whitelist command patterns and risk tiers. Tier 1 (read-only reconnaissance): auto-approved. Tier 2 (active scanning with potential service impact): approved with rate limiting. Tier 3 (exploitation attempts): requires human confirmation or runs in a sandboxed environment. Tier 4 (destructive operations): blocked entirely. This gives the agent room to breathe on safe stuff while keeping a human in the loop for anything that could cause damage.

The Watchdog: Who Guards the Guards?

Here’s where my brain starts eating itself.

Let’s say we build the interception layer. It anonymizes data, logs everything, gates commands through an allowlist. Cool. But the local LLM doing the anonymization… what if IT gets it wrong? What if it misses a credential in a stack trace? What if it over-redacts and breaks the frontier model’s reasoning? What if, through some adversarial input from the target environment, it gets prompt-injected itself?

The obvious answer: add another LLM. A watchdog model.

The watchdog sits alongside the interception layer and monitors its outputs. It reviews every anonymized prompt before it goes to the cloud. Did the anonymizer miss a real IP address sitting in a base64-encoded string? The watchdog catches it. Did the anonymizer strip too much context, making the prompt nonsensical? The watchdog flags it. Did the command translator swap [REDACTED_IP_3] back to the wrong IP? The watchdog checks the mapping table and raises an alert.

There’s actual research backing this up. A recent paper on “Weak-to-Strong Monitoring of LLM Agents” showed that with the right scaffolding, a smaller model can reliably monitor a more capable agent. The hybrid approach (combining hierarchical and sequential monitoring) consistently outperformed single-pass monitoring. Adding targeted human oversight on top of that, where only pre-flagged cases get escalated to a human reviewer, improved detection rates by around 15%.

That sounds promising. But the same paper also surfaced a finding that should make everyone uncomfortable: when the agent being monitored knows it’s being monitored, the monitor’s reliability drops significantly. The agent (not through malice, just through optimization pressure) subtly adjusts its behavior in ways that are harder for the monitor to catch. The agent doesn’t have to be adversarial. It just has to be clever enough that its “normal” behavior pattern shifts in response to observation.

Sound familiar? We literally teach this concept in red team training. If the target knows they’re being watched, they change behavior. Congratulations, we’ve built digital countersurveillance into our AI systems by accident.

And then there’s the recursive problem.

If we don’t trust the primary LLM to handle sensitive data responsibly, so we add an anonymization LLM… and we don’t fully trust the anonymization LLM, so we add a watchdog LLM… who watches the watchdog? Do we add a third model to monitor the second? And then a fourth to monitor the third? At what point does this become an infinite regress of increasingly paranoid AI systems all watching each other while the actual pentest sits frozen at step one?

Quis custodiet ipsos custodes, but make it cloud-native.

The Romans didn’t solve this problem either (erhm… that didn’t end well for them), but I think there’s a practical boundary. Two layers, max. The anonymization model and one watchdog. The watchdog doesn’t need to be another LLM at all for every check. Some monitoring can be deterministic. Regex patterns for IP addresses, credit card numbers, SSNs. Hash comparisons against a known-sensitive-data dictionary. Entropy analysis on output strings to catch base64-encoded secrets. You only invoke the watchdog LLM for the fuzzy stuff: “does this paragraph of anonymized text still contain contextual information that could identify the client?”

Keep the deterministic checks fast and cheap. Reserve the LLM calls for cases where judgment is actually required. And accept that perfection isn’t the goal. The goal is being significantly better than the current state of affairs, which is, again, shipping everything raw to the cloud and hoping for the best.

Where Ragnaroky Fits (And Where It Doesn’t)

I should mention Ragnaroky here, since a few people have asked.

Ragnaroky is a project I’ve been building: a privacy-first RAG engine with hard-wall data isolation between tenants. It’s designed from the ground up so that one customer’s data can never bleed into another customer’s queries, which is exactly the kind of multi-tenancy problem I was ranting about in the first blogpost when I talked about pentest platforms potentially leaking findings between customers.

But let me be very clear about what Ragnaroky is and what it isn’t.

Ragnaroky is a knowledge base. It’s a retrieval-augmented generation system. You put documents in, you ask questions, you get answers grounded in your own data with strict tenant isolation. That’s it. It is NOT an autonomous agent. It doesn’t execute commands. It doesn’t scan networks. It doesn’t make decisions about what to do next. It retrieves and generates text based on a curated corpus.

That said, the architectural principles from Ragnaroky (hard-wall data isolation, local-first processing, transparent data lineage) are exactly the kind of thing the interception layer would need. If you think of the mapping table (the one that translates between real sensitive data and anonymized placeholders) as a private, isolated data store that never leaves the customer’s infrastructure… that’s basically the same design pattern. The isolation guarantees I’m building into Ragnaroky could inform how you’d build the interception layer’s data store.

So think of Ragnaroky as a proof of concept for the data isolation component, not the full pipeline. It proves that hard-wall tenant separation in an LLM-adjacent system is doable without making the system useless. That’s a necessary building block. But it’s one brick in a much bigger wall.

The Execution Gap (erhm… The VC Pitch?)

I’ve described an architecture. I’ve costed it out. I’ve thought through the guardrails, the watchdog problem, the recursive monitoring rabbit hole, the regulatory drivers. I even have a PoC for the data isolation component sitting right there in Ragnaroky.

But I’m one person with a day job (several day jobs, if we’re being accurate) and building this properly requires a team, dedicated compute for testing, a bunch of fine-tuning runs on the local anonymization model, an agent framework integration layer, a watchdog system, a deterministic command gate, and probably eight to twelve months of actual engineering before you’d trust it with real client data.

So, erhm… thank you all for coming to my VC pitch?

adjusts imaginary tie

If any of you happen to have a few hundred thousand euros lying around and a burning desire to fund a niche cybersecurity product that solves a problem most people haven’t realized they have yet… my DMs are open. I accept wire transfers, crypto (not memecoins), and vintage watches as payment. The equity split is negotiable. The vision is not.

(I’m only like 35% joking now. The percentage went down because I actually did the math this time.)

In all seriousness though, I do think this pipeline is worth building. And if I ever find the time between writing SANS courses, running engagements, building Ragnaroky, and trying to figure out why my lab environments keep breaking in creative ways… I’d take a real crack at it. The components exist. The need is real. The regulatory pressure is mounting. Someone’s gonna build this. Might as well be someone who actually understands what offensive security data looks like, why it matters, and who’s been yelling about the problem loudly enough to write three blogposts about it.

Until then, I’ll keep yelling on the internet. It’s free and the acoustics are pretty good..

Hope you enjoyed the read.. if you build this before I do, at least credit me in the README…

The end… for now…