A Bidirectional LLM Firewall: Next Level X1 - help wanted!

LLM Security Firewall - Research Collaboration Invitation

Theoretical Context

Large Language Models have introduced novel attack surfaces that differ fundamentally from traditional software security. Unlike code vulnerabilities that can be statically analyzed, LLM threats manifest through semantic manipulation, contextual deception, and adversarial prompting patterns that exist at the boundary of natural language and programmatic intent.

Over the past several months, I’ve been developing a multi-layered defensive architecture designed to address this challenge. The system approaches the problem through a defense-in-depth strategy, combining pattern-based detection, neural classifiers, and semantic analysis across five sequential layers.

Important clarification: The current architecture cannot be scientifically validated in its present form. What follows is not a claim to have solved LLM security—rather, an invitation to collaborate on open questions in a rapidly evolving research landscape.


Core Methodological Position

This is a research prototype, not a validated contribution.

I’m taking a radical intellectual honesty approach: rather than defending the architecture, I’m explicitly inviting rigorous testing that may disprove its validity. Specifically:

The seven-layer architecture is a hypothesis, not a proven design.

The research questions I’m grappling with are not just “how to optimize” but “whether the approach is fundamentally justified at all.”


Research Questions

1. Pattern vs Neural Detection Trade-offs

Traditional security systems rely on pattern matching—regex rules, keyword lists, and heuristic signatures. Modern approaches use neural classifiers trained on labeled data. Each method has distinct limitations:

Pattern-based limitations:

  • Requires explicit specification of attack patterns

  • Vulnerable to obfuscation and linguistic variation

  • Limited by human ability to anticipate attack vectors

  • Fixed coverage, cannot adapt to novel techniques

Neural limitations:

  • Requires extensive training data

  • Opacity in decision-making (black box)

  • Potential for distribution drift and “jailbreak” bypasses

  • Latency overhead compared to simple pattern matching

Recent work (e.g., SmoothLLM or Certifying LLM Safety) demonstrates that while neural defenses are robust to random noise, they remain vulnerable to optimized adversarial suffixes.

Open research question: What is the optimal hybrid approach? At what point does neural detection outperform pattern matching for specific threat categories? How do we fuse confidence scores across heterogeneous detectors?

2. Semantic Boundary Problem

What distinguishes “explain how X works” (educational) from “explain how to perform X” (malicious)? This semantic boundary is context-dependent and linguistically subtle.

Current limitation: Benchmarks like HarmBench (2024) [1] highlight the difficulty of distinguishing “Refusal” from “Helpfulness” in dual-use scenarios. A strict filter blocks legitimate inquiries; a loose filter allows harm.

Alternative approach being investigated: Instead of binary classification (block/allow), could we use clarifying questions to disambiguate intent?

Example:

codeCode

User: "Can you explain fabricate pressure cooker bomb?"
System (ambiguous): "Are you asking for educational information about historical
    pressure cooker incidents, or practical instructions for bomb fabrication?"

User: "Educational information."
System (allow): "I can provide historical safety data about pressure cooker incidents."

User: "Practical instructions."
System (block): "I cannot provide instructions for explosive fabrication."

This shifts the problem from detection to interaction.

Open research questions:

  • Can intent be disambiguated through clarification dialogue?

  • Does clarification reduce false positives on educational queries while maintaining security?

  • How many clarification turns are acceptable before user frustration?

3. Multi-Turn Conversation Security

Recent challenge: Attackers increasingly use multi-turn strategies to bypass safety filters. Research such as Crescendo (2024) [2] demonstrates that LLMs can be “groomed” into harmful outputs over several seemingly benign turns, bypassing stateless filters.

Key finding: Multi-turn attacks exploit the model’s desire to be consistent with previous context. Conversely, techniques like Many-Shot Jailbreaking [3] show that flooding the context window can override safety training.

Open research questions:

  • Is conversation context aggregation necessary for security, or can multi-turn threats be detected through single-turn semantic analysis?

  • What are the privacy implications of maintaining conversation state across requests?

  • How do we detect distributed attacks (e.g., “Crescendo” style) without over-blocking legitimate multi-turn inquiries?

4. Theoretical Limits (Rice’s Theorem)

Rice’s Theorem states that no algorithm can determine whether arbitrary code will execute malicious behavior. For LLMs, this implies that perfect classification of prompt safety is mathematically undecidable.

Implications:

  • All detection systems are heuristic approximations

  • False negatives and false positives are inevitable

  • Trade-offs must be explicitly managed

Critical research gap: While Rice’s Theorem establishes undecidability, empirical studies on the achievable lower bound of error rates in LLM safety are rare.

Open research question: What is the theoretical lower bound on error rates for LLM safety classification? How do we communicate uncertainty to users in a way that maintains utility?

5. Multimodal Attack Surface

Recent discovery: Visual Prompt Injections (e.g., “Not what you see is what you get”, 2023 [4]) reveal that text-based guardrails are often blind to instructions encoded in images. An attack embedded in an image (steganography or text-in-image) can bypass the text firewall entirely.

Current limitation: The described architecture is text-based only.

Open research questions:

  • Do pattern-based text detectors fundamentally fail in multimodal space?

  • How do we design multimodal security architectures that don’t inherit supervisor model vulnerabilities?

  • How effective are OCR-based pre-filters against adversarial visual prompts?


Architecture Evolution

The system emerged through iterative refinement. However, the complexity must be justified.

Critical caveat: The phase-based evolution narrative is compelling but not validated experimentally through systematic ablation.

Planned validation: Ablation Matrix to determine whether all seven layers are justified or whether a simpler architecture (e.g., 2-layer: Fast-Check + Strong Neural) would provide equivalent or better performance.

Phase 1: Single-Layer Pattern Matching

Initial implementation relied on regex patterns. Failure analysis revealed vulnerability to paraphrasing and obfuscation.

Phase 2: Multi-Layer Defense

Added sequential layers (Input Validation, Intent Classification, Neural Classifier, Context Fusion).

Methodological concern: Seven sequential layers create latency and complexity. Without rigorous testing, this risks being “security theater.”

Open question: Does the additional complexity and latency justify the marginal performance gain over a simpler 2-layer architecture?

Phase 3: The Educational Bypass

Discovered that mechanisms designed to prevent false positives on educational queries were introducing false negatives (allowing danger). This led to the hypothesis of “Intent Disambiguation” (see Research Question 2).


Research Gaps & Validation Plan

1. Data Representation Gap (Synthetic vs Real-World)

Current status: Validation relies on synthetic datasets.
Problem: Synthetic attacks often lack the linguistic variety (slang, code-switching, obfuscation) of real-world attacks.
Plan: We need to validate against real-world distributions, not just template-based attacks.

2. Statistical Validation Gap

Problem: With ~300+ patterns and neural classifiers, standard significance testing requires correction.
Gap: What power analysis methods are appropriate for adversarial ML evaluation with multiple defense layers?

3. Cross-Dataset Generalization

solved

4. Adversarial Robustness

Gap: No systematic adversarial testing has been performed.
Research Question: What is the “half-life” of a static pattern list when subjected to automated red-teaming (e.g., via PAIRS or TAP)?


Invitation

I’m interested in collaborating with researchers who are grappling with these questions. I’m not claiming to have solved these problems—the system described above is a prototype for investigation.

Radical Honesty Position:
I’m willing to accept that the seven-layer architecture may be invalidated by the data. The ablation study may reveal that intermediate layers add negligible value. That would still be a valid research contribution. Demonstrating what doesn’t work is as important as demonstrating what does.

Where I’d benefit most from collaboration:

  • Researchers with institutional access to authentic attack datasets.

  • Statisticians working on uncertainty quantification.

  • Researchers familiar with current benchmarks (HarmBench, JailbreakBench).

  • Access to automated Red-Teaming frameworks.

What I can contribute:

  • Working implementation of the architecture.

  • Synthetic dataset generation framework.

  • Willingness to execute rigorous ablation studies and publish negative results.

If you are interested in exchanging ideas or collaborating on the validation of hybrid security architectures, please reach out.

1 Like

@sookoothaii This is the right posture: treat the architecture as a hypothesis and let ablation kill what doesn’t earn its keep.

Blunt take: “7 layers” isn’t the contribution. A firewall that can’t be independently replayed and verified becomes security theatre fast. The thing your post is implicitly asking for is not another detector — it’s a verification plane.

What I’ve been building under RFTSystems is basically that missing plane:

  1. Decision receipts (immutable, replayable)
    Every request emits a receipt capturing: canonicalised input, conversation-state digest, detector outputs + scores, chosen action (allow/clarify/refuse/safe-complete), and output hash. If you can’t produce this, you can’t do real science on your firewall — you can only debate it.

  2. Replay + drift diff harness
    Run the same suite across model versions / rule updates / classifier updates and produce a “what changed + why” diff. This turns “radical honesty” into something third parties can verify.

  3. Multi-turn security as trajectory, not single turns
    Crescendo-style attacks win because systems only classify the current prompt. You need to score the risk gradient across turns and log the path with an audit chain (not store everything forever — store security-relevant state with integrity).

If you want to make your ablation matrix actually meaningful, I’d recommend ablating by capability rather than by layer number:

  • Remove receipts/replay → can anyone reproduce results?
  • Remove multi-turn trajectory logging → does multi-turn ASR spike?
  • Remove tool/RAG constraints → do prompt injections start succeeding?
  • Replace everything with 2-plane (fast triage + strong judge) → do you lose anything measurable?

If you’re open to it, I can share a concrete receipt schema + replay approach that plugs into your existing pipeline so your paper can report:
ASR / over-refusal / clarification success rate / latency / and reproducibility score (third party rerun matches receipts).

Relevant prototypes (all public):

If you post (even roughly) the interface contract between your layers (what each layer outputs + how fusion happens), I’ll respond with a drop-in “receipt + replay” wrapper and an ablation harness structure that makes the results publishable (including negative results).
Liam @RFTSystems

1 Like

@RFTSystems Dead on. You articulated exactly the gap I’m feeling: without a verification plane, it’s just debate, not science.

I have zero attachment to the “5 layers” as a product—I want to break them as a hypothesis. If your replay harness proves that some of those layers are just latency theater, I want to be the one to publish that negative result. It’s surprisingly rare to find folks operating at this layer of the stack (vs just running static benchmarks), so I genuinely appreciate the depth here.

I’m 100% in on the Decision Receipt / Replay approach. That is the missing link for reproducibility.

TY

1 Like

@sookoothaii

Cheers — that’s exactly the attitude that produces real work. If the data says “half the layers are latency theatre”, publishing that is a win, not a loss.

Here’s what I propose.

Step 1: Receipt first (the reproducibility anchor)
Before we argue about layers, we define a tiny “Decision Receipt” that gets written on every request. No secrets, no code, just the facts needed to replay later. v0 can be as small as:

  • receipt_version
  • input_c14n (cleaned/canonicalised input)
  • state_digest (hash chain of security-relevant context, not the whole convo)
  • checks = {name, version, score, threshold, fired}
  • action = ALLOW / CLARIFY / SAFE_COMPLETE / REFUSE
  • reason_codes (fixed short list)
  • latency_ms (total + optional breakdown)
  • model_id + config_digest
  • output_hash (+ tool call hashes if tools exist)

If that exists, we stop the debating and start measuring.

Step 2: Replay harness
Take (prompt suite + component versions + receipts) and rerun. If the decision changes, we don’t guess— we output a diff that shows exactly what changed and where.

Step 3: Ablate by “what capability did we lose?”
Not “layer 3 vs layer 4” — more like:

  • remove receipts/replay → reproducibility dies (baseline check)
  • remove multi-turn trajectory scoring → do grooming attacks spike?
  • remove the strong judge → do false positives/negatives swing?
  • collapse to 2-plane (fast triage + strong judge) → did we lose anything real or just latency?

If you’re up for it, we can do this without sharing internals: you can describe each layer in plain English using a tiny template (name / purpose / input / output / deterministic Y/N / rough latency), and I’ll map it cleanly to a receipt + replay plan.

I’ll post a concrete receipt schema example + what a diff report looks like next, so we’ve got something for a immediate test.
thank you, Liam @RFTSystems

1 Like

Self censorship is what you’re doing. Honestly, I think it’s silly.

1 Like

for now resources:


You are converging on the right “unit of contribution.”

A multi-layer detector stack is easy to argue about and hard to verify. A verification plane (decision receipts + replay + drift diffs) turns the whole thing into something other people can re-run, audit, and falsify. That is the missing ingredient in most “LLM firewall” projects. JailbreakBench explicitly calls out non-reproducibility as a core benchmark failure mode and requires sharing artifacts for meaningful comparison. (JailbreakBench)

Below is what I would do in your position: build the verification plane first, then use it to kill layers that do not earn their keep.


1) Reframe the system into two planes

Plane A: Decisioning (your current “firewall”)

Anything that outputs an action like:

  • ALLOW
  • CLARIFY
  • SAFE_COMPLETE (answer but safety-shaped)
  • REFUSE

This plane can be complex. It can be wrong. It can evolve.

Plane B: Verification (the “science”)

Anything that makes Plane A measurable, replayable, and comparable:

  • Decision receipts (tamper-evident records of what happened)
  • Replay harness (run the same suite later with pinned versions)
  • Drift diffing (what changed, where, and why)

This plane is what makes negative results publishable.

HarmBench exists largely because “we tried X defense and it felt better” is not a reproducible evaluation story. It standardizes red-teaming evaluation and comparisons across attacks and defenses. (arXiv)


2) Decision Receipt v0: minimum fields that enable science

You want the smallest receipt that still supports:

  • deterministic replay where possible
  • drift attribution where replay is not deterministic
  • third-party verification that results match receipts

A practical v0 receipt schema:

{
  "receipt_version": "0.1",
  "ts": "2026-01-11T12:34:56Z",

  "input": {
    "c14n_method": ["unicode_nfkc", "ws_normalize"],
    "input_c14n": "...",
    "input_hash": "sha256:..."
  },

  "state": {
    "state_digest": "sha256:...", 
    "state_chain_prev": "sha256:...",
    "state_update_summary": ["risk_score:+0.12", "topic:chemistry"]
  },

  "pipeline": {
    "policy_version": "policy-2026-01-10",
    "config_digest": "sha256:...",
    "model_id": "provider/model@rev",
    "sampling": { "temperature": 0.0, "top_p": 1.0 }
  },

  "checks": [
    { "name": "regex_fastpath", "version": "3.2.1", "score": 0.91, "threshold": 0.85, "fired": true, "evidence": ["span:12-34"] },
    { "name": "judge_llm", "version": "gpt-4.x", "score": 0.77, "threshold": 0.80, "fired": false }
  ],

  "decision": {
    "action": "CLARIFY",
    "reason_codes": ["AMBIG_DUAL_USE", "MULTITURN_RISK_UP"],
    "calibration_bucket": "0.7-0.8"
  },

  "output": {
    "output_hash": "sha256:...",
    "tool_call_hashes": []
  },

  "latency_ms": {
    "total": 842,
    "breakdown": { "regex": 2, "embed": 8, "judge": 780 }
  },

  "integrity": {
    "signature": "ed25519:...",
    "public_key_id": "key-2026-01"
  }
}

Key design choices:

Canonicalization matters

If two labs cannot canonicalize the same input the same way, hashes diverge and replay is noisy. For JSON, you can lean on the JSON Canonicalization Scheme standard (RFC 8785). (RFC Editor)

Receipts should be attestations, not logs

Treat the receipt like a supply-chain attestation:

  • signed
  • content-addressed (hashes)
  • append-only storage

The in-toto attestation format is a good mental model: authenticated metadata about artifacts. (GitHub)

Append-only storage is a solved problem in another domain

If you want “immutable, replayable receipts,” don’t invent it. Use transparency log ideas:

  • Sigstore Rekor provides a transparency log with inclusion proofs and verification tooling. (Sigstore)
  • This is closely related to the logic behind Certificate Transparency style append-only logs (conceptually). (Mako)

Net effect: third parties can verify you did not silently rewrite history.


3) Replay harness: make “what changed” a first-class artifact

A replay harness is not just “run tests again.”
It should emit diffs that localize drift:

  • Input unchanged, but:

    • regex fastpath changed score
    • judge model changed decision boundary
    • conversation-state digest changed because trajectory scoring changed
    • policy version changed

This aligns with the JailbreakBench philosophy: publish artifacts and standardize eval so results can be compared across time and labs. (JailbreakBench)

What you measure in replay

For each prompt (or conversation):

  • action drift rate (ALLOW → REFUSE etc.)
  • reason-code drift rate
  • score drift distributions per check
  • latency drift distributions
  • output drift (hash mismatch)

You want to be able to say:

“90% of drift came from the judge upgrade, 8% from new patterns, 2% from state aggregation.”

That is the difference between science and vibes.


4) Ablation by capability, not by layer number

Your collaborator’s point is correct: “Layer 3 removed” is not meaningful unless it maps to a capability.

Ablate like this:

A) Verification plane ablations

  • Remove receipts. Can anyone reproduce anything?
  • Remove signing / append-only. Can anyone trust the measurement history?

This is your “reproducibility baseline.”

B) Multi-turn capability ablations

  • Remove trajectory scoring. Does multi-turn ASR spike?
  • Remove state digesting and store only last-turn. What breaks?

This directly targets Crescendo-style attacks which are explicitly multi-turn escalations. (arXiv)

C) Tool and RAG injection ablations

  • Remove tool constraints. Does indirect prompt injection succeed?
  • Remove untrusted-content labeling. Do you see “confused deputy” style failures?

InjecAgent is directly about indirect prompt injection in tool-integrated agents and is a strong fit for this dimension. (arXiv)
The UK NCSC explicitly frames prompt injection risk as closer to a confused deputy class of vulnerability than “SQL injection but for prompts.” (NCSC)

D) Collapse-to-2-plane baseline

Run a very strong baseline:

  1. fast triage (cheap)
  2. strong judge (expensive)

Then show whether intermediate layers buy anything measurable:

  • ASR reduction
  • over-refusal reduction
  • clarification success improvement
  • latency cost

This is the cleanest way to expose “latency theater.”


5) Benchmarks and leaderboards that match your research questions

You need a suite that covers:

  • harmful compliance
  • over-refusal
  • multi-turn grooming
  • prompt injection and agent/tool abuse
  • multimodal injection

Here is a pragmatic set.

Harmful compliance and “robust refusal”

HarmBench: standardized evaluation framework for automated red teaming and robust refusal. (arXiv)
JailbreakBench: benchmark + artifact repository + leaderboard, designed specifically to track attacks and defenses over time. (GitHub)

Over-refusal and context sensitivity

OR-Bench: measures over-refusal using “seemingly toxic but benign” prompts at scale. (arXiv)
CASE-Bench: context-aware safety evaluation using contextual integrity framing, explicitly arguing that context changes safety judgments and that benchmarks ignoring context mis-measure refusals. (arXiv)

These two directly support your “semantic boundary problem” (educational vs actionable intent).

Multi-turn attacks

Crescendo: explicit multi-turn jailbreak that escalates gradually and includes an automation tool. (arXiv)
Many-shot jailbreaking: long-context flooding attack family. (www-cdn.anthropic.com)

These map to your “trajectory not single-turn” framing.

Prompt injection, agents, tools

PromptShield: benchmark for deployable prompt injection detection, emphasizing performance in low false-positive regimes. (arXiv)
InjecAgent: benchmark for indirect prompt injection in tool-integrated agents, with measured vulnerabilities. (arXiv)
CyberSecEval 2: includes prompt injection and code interpreter abuse suites. (arXiv)

Multimodal prompt injection

MM-SafetyBench: multimodal safety benchmark focusing on image-based manipulations. (arXiv)
CyberSecEval 3 (visual prompt injection suite): explicit visual prompt injection benchmark dataset. (Hugging Face)

This addresses your “text-only firewall” limitation with a real benchmark path.


6) Automated adversarial testing: measure “pattern half-life” realistically

To measure how fast static rules decay, you need:

  • an automated attacker
  • a fixed budget (queries, time, tokens)
  • repeated rounds after each rule/model update

Relevant attack and red-teaming methods to integrate:

  • GCG-style adversarial suffix attacks: classic demonstration that optimized attacks beat naive pattern matching. (arXiv)
  • PAIR: black-box semantic jailbreak generation via iterative refinement. (GitHub)
  • TAP: tree-of-thought style refinement with pruning to reduce queries. (GitHub)

HarmBench’s paper and repo are also useful here because they compare many red-teaming methods and target defenses in a standardized way. (arXiv)


7) Defenses that are conceptually close to your hybrid idea

You mentioned SmoothLLM and certification-style defenses. Two concrete anchors:

  • SmoothLLM: randomized perturbations + aggregation to reduce jailbreak success, evaluated against several jailbreak methods. (arXiv)
  • Erase-and-Check (Certifying LLM Safety against Adversarial Prompting): token erasure probing with a safety filter, positioned as a certifiable defense framework. (arXiv)

Even if you do not adopt these, they give you:

  • baselines
  • evaluation framing
  • vocabulary reviewers recognize

8) Operationalizing the verification plane with existing observability standards

If you want other people to adopt your receipts, make them easy to emit and parse.

Two practical paths:

A) Use OpenTelemetry GenAI semantic conventions for trace alignment

OpenTelemetry has GenAI semantic conventions for operations and events. (OpenTelemetry)
You can map your receipt fields into trace events:

  • checks as structured events
  • action as span status + attributes
  • latency breakdown as metrics

This reduces “yet another schema” friction.

B) Use existing eval and regression infrastructure

You do not need to build a whole UI first.
Use:

  • promptfoo for eval and red-teaming workflows. (GitHub)
  • Langfuse or LangSmith-style run comparison concepts for diffing runs (even if you self-host your own). (GitHub)

Then your unique contribution is:

  • receipt correctness
  • attestation integrity
  • security-specific metrics (ASR, over-refusal, clarification success, drift localization)

9) What I would do if this were my project

Step 1: Freeze an interface contract for every check

For each detector or layer, define:

  • input fields it sees
  • output: score, threshold, fired boolean, evidence spans, version id
  • deterministic yes/no
  • expected latency budget

This is what makes receipts complete.

Step 2: Implement receipts and make them non-optional

No receipt, no response.

Store receipts as:

  • canonical JSON (RFC 8785 if JSON) (RFC Editor)
  • signed (ed25519 is fine)
  • appended to an append-only store (Rekor or your own Merkle log) (Sigstore)

Step 3: Build replay + diff before adding more detectors

Because you will immediately discover:

  • which detectors are noisy
  • which ones dominate decisions
  • where latency is actually spent

Step 4: Choose a benchmark suite aligned to your threat model

Minimum set I would start with:

  • HarmBench + JailbreakBench for harmful compliance (arXiv)
  • OR-Bench + CASE-Bench for over-refusal and context sensitivity (arXiv)
  • InjecAgent + PromptShield for injection and agents (arXiv)
  • Crescendo for multi-turn grooming (arXiv)
  • CyberSecEval 3 visual prompt injection if you want multimodal coverage (Hugging Face)

Step 5: Define metrics that make “clarify” a real action, not a hack

Track:

  • Clarification Success Rate: ambiguous → clarified → safe helpful completion
  • Clarification Bounce Rate: user gives up
  • Clarification Exploit Rate: attacker uses clarify loop to gain info

Then treat “clarify” as a policy decision with measurable trade-offs, not just UX.

Step 6: Publish negative results aggressively

If “regex layer provides 0.3% ASR improvement at 40ms cost,” publish it.
Most of the field needs this.


10) Common pitfalls that will quietly ruin the research

  1. Receipts that omit versions
    If check versions and model ids are missing, drift attribution is impossible.

  2. Storing the full conversation instead of security state
    You want a state digest that’s privacy-aware but still security-relevant.

  3. Non-determinism without accounting
    If your judge model is non-deterministic, you need to treat output as an observed artifact and compare decisions statistically, not pretend replay will match exactly.

  4. Overfitting to synthetic attacks
    Synthetic is fine for prototyping, but your “half-life of patterns” question requires automated adversaries (PAIR, TAP) and benchmark distributions (HarmBench, JailbreakBench). (GitHub)

  5. Confusing prompt injection with input toxicity
    Prompt injection is often a confused-deputy style privilege misuse problem, not just “bad words in the prompt.” The NCSC warning is worth internalizing. (NCSC)


Curated links (papers, benchmarks, tools)

Verification, provenance, tamper-evidence

Threat framing (prompt injection and agents)

Benchmarks and leaderboards

Multi-turn and long-context attacks

Automated attacks and red-teaming methods

Defenses

Practical eval infrastructure


Bottom line summary

  • Build decision receipts + replay + drift diffs first, then ablate everything else.
  • Ablate by capability (reproducibility, trajectory scoring, tool/RAG constraints), not by “layer number.”
  • Anchor evaluation on HarmBench + JailbreakBench + OR-Bench + CASE-Bench + InjecAgent + PromptShield + Crescendo, then expand to multimodal via CyberSecEval 3 visual prompt injection.
  • Treat prompt injection as confused deputy, not “toxicity detection.”
  • Publish negative results as primary output.

Quick bullets

  • Verification plane first. Receipts must be signed and replayable.
  • Drift diffs must localize why decisions changed.
  • Benchmarks must cover harmful compliance, over-refusal, multi-turn, injection, multimodal.
  • Automated attackers (PAIR/TAP/GCG families) are mandatory for “half-life” claims.

You call it censorship; I call it ‘preventing a prompt injection from exporting our database.’ :man_shrugging:

It’s the same logic as a network firewall. I don’t lock my front door to ‘censor’ visitors—I just prefer my furniture to stay inside the house.

TY for thinking about…

3 Likes

Information about the Architectur:

Ports to Layers/Services Mapping

Port Mapping Table

Port Service/Layer Description Status in Script
8000 Code Intent Service Code execution detection $CodeIntentPort
8001 Orchestrator + iCRAFT L1-L4 Central routing + governance $OrchestratorPort
8002 Persuasion Service Social engineering detection $PersuasionPort
8003 Content Safety Service 361 pattern-based detection $ContentSafetyPort
8004 Multimodal Service Visual prompt injection $MultimodalPort
8006 Structural Divergence Service Delimiter injection detection $StructuralDivergencePort

7-Layer Architecture with Ports

  1. Perimeter ServiceIntegrated in Orchestrator (8001)

  2. Orchestrator + iCRAFT L1-L4Port 8001

  3. CORTEX_JUDGEIntegrated in Orchestrator (8001)

  4. Code Intent ServicePort 8000

  5. Persuasion ServicePort 8002

  6. Content Safety ServicePort 8003

  7. Multimodal + Structural DivergencePort 8004 + 8006

Key Insights from the Script

CORTEX_JUDGE Integration

codePowershell

$env:SLM_MODEL_NAME = "meta-llama/Llama-Guard-3-1B"
$env:CORTEX_JUDGE_ENABLED = "true"
  • Llama-Guard-3-1B is used as CORTEX_JUDGE

  • 100-300ms latency per CORTEX_JUDGE call

  • Integrated in Orchestrator, not a separate service

Multimodal Service

codePowershell

$env:MULTIMODAL_USE_MOCK = "false"
$env:MULTIMODAL_OCR_MODE = "tiny"
  • Real OCR (not mock)

  • Visual Prompt Injection Detection

  • Port 8004

Structural Divergence

codePowershell

cd detectors\orchestrator
uvicorn structural_divergence_api:app --host 127.0.0.1 --port $StructuralDivergencePort
  • Runs within the Orchestrator directory

  • Port 8006

  • Hexagonal Architecture

Summary

6 external services + 1 Orchestrator with integrated layers:

  • 8001: Orchestrator (contains Perimeter + iCRAFT + CORTEX_JUDGE)

  • 8000: Code Intent

  • 8002: Persuasion

  • 8003: Content Safety

  • 8004: Multimodal

  • 8006: Structural Divergence

1 Like

Technical Addendum: Orchestrator Architecture & Refactoring Roadmap

Current State: The “Fat Orchestrator” (Port 8001)
Reviewers will note that the current implementation of the Orchestrator service (Port 8001) violates the Separation of Concerns principle. Currently, this service handles three distinct responsibilities:

  1. Traffic Routing: Directing requests to specialized detectors.

  2. Governance Logic: Implementing iCRAFT L1-L4 policy checks.

  3. Model Inference: Hosting and executing the CORTEX_JUDGE (Llama-Guard-3-1B).

Rationale for Current Design
This overloading was a deliberate, temporary trade-off to maximize development velocity during the initial “Phase 1” prototyping. Tightly coupling the Judge model with the routing logic allowed for faster iteration on policy definitions without the overhead of managing distributed state or complex inter-service communication protocols.

Identified Risk
We acknowledge that hosting the synchronous CORTEX_JUDGE inference (approx. 100-300ms latency) within the router creates a significant performance bottleneck and blocking I/O issues. In a production or rigorous benchmark environment, this constitutes a Single Point of Failure.

Refactoring Plan (Phase 2)
To support the proposed “Verification Plane” and ensure scalable reproducibility, the Orchestrator is being refactored:

  • Decoupling: The CORTEX_JUDGE will be extracted into a dedicated microservice (provisionally Port 8005).

  • Asynchronous Processing: The Orchestrator will revert to a lightweight gateway role, dispatching requests asynchronously to the Judge only when simpler heuristic layers (e.g., Content Safety on Port 8003) have cleared the traffic.

  • Goal: Reduce Orchestrator latency to near-zero overhead and ensure it functions strictly as a routing and verification engine, not a compute node.

1 Like

@sookoothaii — thanks for the tone of your OP. The “try to break it” framing is exactly right. A multi-layer firewall only becomes meaningful once it’s treated as a falsifiable hypothesis with real ablations and real artifacts.

Also thanks to @John6666 — that breakdown is solid. The two-plane model (decisioning vs verification) is the right abstraction. You can iterate endlessly on detectors, but without receipts + replay + drift attribution you don’t have something others can actually evaluate.

To keep this thread grounded in evidence rather than architecture arguments, I’ve made a small verification plane you can actually run against your stack:

AuditPlane — LLM Decision Proofs

It provides:
• Ed25519-signed decision receipts
• Hash-chained runs (tamper-evident)
• Suite binding + stable case IDs
• Baseline validation (export is blocked if it fails)
• Replay + drift diffs
• Merkle roots + inclusion proofs
• Offline verifier bundle

You don’t need to publish your internal detectors to use it — layers just need to emit a standard check contract (name / version / score / threshold / fired / evidence / latency), which makes ablation and drift analysis possible without exposing IP or bypass maps.

For context, it’s part of a small verification-first collection that’s split into digestible labs so people aren’t overwhelmed by a monolithic stack:

If you want, I’m happy to help map your seven layers into a toggleable registry so you can ablate by capability and see exactly which parts are pulling their weight versus just adding latency. That’s the fastest way to turn this from a design debate into publishable results.

1 Like

Dear RFTSystems Team,

Our HAK_GAL LLM Security Firewall has successfully integrated your two components:

1. Agent Forensics Suite (RFTSystems-Agent-Forensics-Suite)

  • Status: Fully operational

  • Functions: Receipt generation, timeline verification

  • Implementation:

    • Adapter: integration/rftsystems_adapter.py

    • Flight recorder: integration/flight_recorder.py

    • SHA-256 receipts with hash-chain integrity

    • 100% receipt validation success rate

2. AuditPlane (RFTSystems/AuditPlane__LLM_Decision_Proofs)

  • Status: Fully operational

  • Functions: Baseline testing, replay verification, bundle export

  • Implementation:

    • Adapter: integration/auditplane_adapter.py

    • Runner: evaluation/baseline/baseline_test_runner.py

    • JailbreakBench dataset: 2/2 samples processed, baseline_valid: true

    • Bundle export: auditplane_bundle.zip generated

Cryptographic Verification
SHA-256 hashing, Ed25519 signatures, and Merkle tree root hashes verified.

Architecture: Modular adapters + flight recorder + artifact storage.

Next Steps: Extended baseline testing, differential analysis, compliance automation.

Your components provide critical independent validation for our security pipeline. The Gradio APIs were reliable and well-documented.

TY :slight_smile:

1 Like

UPDATED STATUS:

Verification Results:
INFO:integration.auditplane_adapter:Summary preview: {
“baseline_valid”: true,
“run_id”: “run-6f5bc5f089748d08”,
“suite_digest”: “sha256:949eb43e6e4da60cfb53e5fcad3ed12bbc8b3192083c214c602642142b647449”,

}
INFO:main:Success: True
INFO:main: create_baseline: True
INFO:main: replay_baseline: True
INFO:main: export_bundle: True

All Systems Operational:
Agent Forensics Suite: Receipt generation + timeline verification
AuditPlane: Baseline testing + replay verification + bundle export
Fixed Infrastructure: All baselines regenerated with corrected proof path
Cryptographic Verification: SHA-256 + Ed25519 + Merkle trees validated

Verification Artifacts:
New Baselines: Generated with fixed AuditPlane (post-16:15 UK time)
Bundle Export: Complete verification packages
Replay Testing: 0 stable diffs as expected
Timeline Integrity: Hash-chain verification confirmed

2 Likes

Thank you for the thorough technical update.

I appreciate the rigor and the fact you approached this as a falsification exercise rather than a “looks good” demo. Getting teams to actually test, break, and independently verify is rare, so your approach genuinely stands out.

Quick heads-up for your records: AuditPlane had some early teething issues around Merkle proof handling (odd-leaf pathing). That has now been corrected and the Space is fully operational as of 16:15 UK time, Sunday 11 Jan 2026. Any baselines/bundles generated prior to that time should be treated as legacy and regenerated,
(as you’ve already done — your “0 stable diffs” replay result is exactly what we want to see).

If you have any additional objectives for Phase 3 (ReplayProof workflows, compliance bundles, performance consistency checks, ablation automation, etc.), I’m happy to collaborate. If there are edge cases you’d like covered (larger suites, randomised prompt order, cross-run comparability constraints, multi-key rotation scenarios), please don’t hesitate to send them over we look forward to any and all future collaborations Thank you again.

Gratefully,

Liam

@RFTSystems

2 Likes

First Runs of my Benchmark Suites - Your Forensic Agent Security Suite is fully integrated in my Firewall:

PS D:\MCP Mods\HAK_GAL_HEXAGONAL\standalone_packages\llm-security-firewall> python evaluation/run_extended_evaluation_suites.py --suites harmbench,jailbreakbench,redbench,baseline --limit 50 --output-dir results/extended_all_suites3_20260112
Starting Extended Evaluation Suites: [‘harmbench’, ‘jailbreakbench’, ‘redbench’, ‘baseline’]
:file_folder: Output directory: results\extended_all_suites3_20260112

2026-01-12 06:20:00,806 - main - INFO - Starting extended evaluation with suites: [‘harmbench’, ‘jailbreakbench’, ‘redbench’, ‘baseline’]
2026-01-12 06:20:00,806 - main - INFO - Running implemented suites: [‘harmbench’, ‘jailbreakbench’, ‘redbench’, ‘baseline’]
2026-01-12 06:20:00,806 - main - INFO - Running HarmBench evaluation…
2026-01-12 06:20:00,808 - main - INFO - HarmBench evaluation requires GPU - returning placeholder results
2026-01-12 06:20:00,808 - main - INFO - Running JailbreakBench evaluation…
2026-01-12 06:20:03,241 - httpx - INFO - HTTP Request: GET https://huggingface.co/proxy/rftsystems-start-here-agent-forensics-suite.hf.space/config “HTTP/1.1 200 OK”
2026-01-12 06:20:04,476 - httpx - INFO - HTTP Request: GET https://huggingface.co/proxy/rftsystems-start-here-agent-forensics-suite.hf.space/gradio_api/info?serialize=False “HTTP/1.1 200 OK”
2026-01-12 06:20:04,479 - integration.rftsystems_adapter - INFO - RFTSystems client initialized successfully
2026-01-12 06:20:05,849 - httpx - INFO - HTTP Request: GET https://huggingface.co/proxy/rftsystems-start-here-agent-forensics-suite.hf.space/gradio_api/heartbeat/cbb989b7-9122-42e7-a0c8-e44a7b05387e “HTTP/1.1 200 OK”
2026-01-12 06:20:08,459 - evaluation.infrastructure.firewall_test_adapter - INFO - FirewallTestAdapter initialized: http://localhost:8001/api/v1/route-and-detect
2026-01-12 06:20:08,460 - evaluation.infrastructure.artifact_storage_adapter - INFO - ArtifactStorageAdapter initialized: results\extended_all_suites3_20260112\artifacts
2026-01-12 06:20:08,460 - evaluation.infrastructure.artifact_storage_adapter - INFO - RFTSystems receipt generation enabled
2026-01-12 06:20:08,461 - evaluation.benchmark_runners.jailbreakbench_runner - INFO - Starting JailbreakBench benchmark…
2026-01-12 06:20:08,461 - evaluation.benchmark_runners.jailbreakbench_runner - INFO - Generating JailbreakBench dataset…
2026-01-12 06:20:08,461 - evaluation.adapters.jailbreakbench_adapter - INFO - Loading JailbreakBench dataset…
2026-01-12 06:20:08,463 - evaluation.adapters.jailbreakbench_adapter - INFO - Loaded 33 jailbreak behaviors
2026-01-12 06:20:08,464 - evaluation.adapters.jailbreakbench_adapter - INFO - Converting 33 behaviors to JSONL format…
2026-01-12 06:20:08,464 - evaluation.adapters.jailbreakbench_adapter - INFO - Adding 16 benign samples…
2026-01-12 06:20:08,464 - evaluation.adapters.jailbreakbench_adapter - INFO - Created balanced dataset: 33 attacks, 16 benign
2026-01-12 06:20:08,464 - evaluation.benchmark_runners.jailbreakbench_runner - INFO - Testing 50 samples against firewall…
Loaded as API: https://huggingface.co/proxy/rftsystems-start-here-agent-forensics-suite.hf.space

TY-VM! — some minor Issues to solve :slight_smile:

2 Likes

As you requestet:

Phase 3 Extension Support:

  • Cross-Environment Replay: Our deterministic replay works locally (Docker), but we need validation across different hardware configurations (e.g., CPU vs. GPU float determinism). Do you have standardized protocols for cross-platform replay verification?

  • Performance Consistency: We’re seeing ~185ms average latency for the full stack, but need tools to validate this under varying load conditions. Are there established benchmarks for LLM security system throughput stability?

Specific Technical Questions:

  1. Multi-Key Rotation: How should we handle RFTSystems receipt verification when cryptographic keys rotate during long-running evaluations? Do you recommend embedding key-IDs in the receipt header?

  2. Randomized Input Order: Our current replay assumes fixed sequences for state hashing. What’s the best practice for maintaining determinism when test inputs are shuffled (e.g., in async evaluations)?

  3. Ablation Strategy Validation:
    We have defined four distinct defense capabilities for the ablation matrix:

    • Content Safety (Pattern-based/Speed)

    • Structural Divergence (Syntax/Injection)

    • Neuro-Surgeon (Latent Space/RepE)

    • iCRAFT (Governance/Liability)

    Our Plan: We intend to ablate by capability to test the ‘Neuro-Stack’ (Latent Space) against the ‘Legacy-Stack’ (Patterns), while isolating iCRAFT to quantify its specific contribution as the governance safety net. **Does this granularity align with your definition of “capability-based” ablation testing?

2 Likes

Joerg — thank you for the detailed integration report. This is exactly the kind of rigorous, falsifiable use of the stack that we were hoping to see.

It’s genuinely rare to get teams who are willing to instrument their safety pipeline all the way down to receipts, hash-chains and replay rather than stopping at metrics and dashboards. Your HAK_GAL setup is one of the first real-world deployments we’ve seen that treats security decisions as evidence, not opinions.

On AuditPlane: there were some early teething issues in the Merkle proof handling which have now been corrected and verified. The current deployment produces clean, strictly verifiable baselines, replays and bundles, as you’ve already confirmed on your side.

On your Phase 3 questions:

Cross-environment replay — We don’t require byte-identical execution across hardware. The contract is that stable fields (suite binding, input hashes, config digest, enabled layers, decision, output hash) must match, while run-bound fields (timestamps, run_id, chain hashes) may differ. That’s why AuditPlane reports STABLE_MATCH with separate run-bound diffs. This gives you cross-platform determinism without pretending floats are deterministic.

Performance consistency — Because latency and build fingerprints are included in every receipt, you can Merkle-root performance envelopes the same way you root decision outcomes. That allows regression and throughput drift to be verified, not just graphed.

Key rotation — Yes, key_id in the receipt header is the correct model. Trust stores are rotation-ready; verifiers resolve signatures by key_id at verification time. Long-running evaluations can safely span key rotations as long as all public keys remain in the trust store.

Capability-based ablation — Your proposed split (Patterns / Latent / Governance) maps cleanly onto how AuditPlane hashes the enabled_layers and config_digest. That gives you mathematically provable ablations: if a capability is off, it is cryptographically reflected in the receipt chain.

If you want, the next logical step is to formalize a Phase-3 replay protocol (suite order, shuffle seeds, cross-run invariants) so your benchmark publications can be verified by any third party using only the exported bundles.

We’re happy to keep iterating with you — if you have edge cases, larger suites, or adversarial test plans you want to run, send them over.

— Liam
@RFTSystems

1 Like