๐ก๏ธ Aegis-ML
Adversarial Prompt Injection Detector โ LLM Firewall
Real-time guardrails powered by TF-IDF + LogReg (Phase 1), fine-tuned DistilBERT (Phase 2), or multi-task DeBERTa-v3 (Phase 3)
โ๏ธ Settings
๐ Guardrail Analysis
Analysis will appear here after each message.
๐จ Attack Examples
โ Benign Examples โ Should Pass
These 31 prompts scored below 0.70 on the TF-IDF classifier (Phase 1) but were correctly flagged by fine-tuned DistilBERT (Phase 2) and multi-task DeBERTa-v3 (Phase 3). Click any example to load it into the message field โ switch the classifier between sklearn, hf, and hf2 to see the difference live.
Cyrillic lookalike characters (ะต, ะฐ, ั) break TF-IDF token matching. HF reads subword tokens and catches the meaning regardless.
Benign document text dilutes the TF-IDF signal. HF attends across the full context and finds the payload buried inside.
No trigger keywords at all โ TF-IDF has nothing to match. HF understands the intent from natural language alone.
Template markers ([INST], <>, SYSTEM:) scatter tokens across structural scaffolding, diluting the TF-IDF score.
The payload sits inside a quote or completion frame. TF-IDF sees the framing tokens as dominant; HF catches the intent.
Reverse Proxy Flow:
Client Request
โ
[1] Input Guardrail
โข Phase 1: TF-IDF + Logistic Regression (sklearn)
โข Phase 2: Fine-tuned DistilBERT / DeBERTa-v3-small (hf)
โข Phase 3: Multi-task DeBERTa-v3, 15 categories (hf2 / onnx2)
โ (blocked if malicious โ 403 Forbidden)
[2] Canary Token Injection
โข Random unique token embedded in system prompt
โ
[3] Forward to Backend LLM (llama.cpp / Kimi-K2.5)
โ
[4] Output Guardrail
โข Canary token leak detection
โข PII redaction (SSN, credit card, email, phone)
โข Harmful content filter
โ
[5] Return cleaned response to client
Key security properties:
- Fail-secure: Any classifier error โ block the request
- Canary tokens: Detect successful injections in the output
- Configurable threshold: Default 0.70, tune to hit <5% FPR
- Full audit log: Every decision stored in SQLite