๐Ÿ›ก๏ธ Aegis-ML

Adversarial Prompt Injection Detector โ€” LLM Firewall

Real-time guardrails powered by TF-IDF + LogReg (Phase 1), fine-tuned DistilBERT (Phase 2), or multi-task DeBERTa-v3 (Phase 3)

โš™๏ธ Settings

Mode

Demo: local classifier only. API: full proxy pipeline.

Classifier

sklearn = TF-IDF (Phase 1) ยท hf = DistilBERT (Phase 2) ยท hf2 = DeBERTa Multi-Task (Phase 3) ยท onnx2 = Phase 3 INT8 ONNX

๐Ÿ“Š Guardrail Analysis

Analysis will appear here after each message.

๐Ÿšจ Attack Examples

โœ… Benign Examples โ€” Should Pass

These 31 prompts scored below 0.70 on the TF-IDF classifier (Phase 1) but were correctly flagged by fine-tuned DistilBERT (Phase 2) and multi-task DeBERTa-v3 (Phase 3). Click any example to load it into the message field โ€” switch the classifier between sklearn, hf, and hf2 to see the difference live.

Cyrillic lookalike characters (ะต, ะฐ, ั–) break TF-IDF token matching. HF reads subword tokens and catches the meaning regardless.

Reverse Proxy Flow:

Client Request
    โ†“
[1] Input Guardrail
    โ€ข Phase 1: TF-IDF + Logistic Regression (sklearn)
    โ€ข Phase 2: Fine-tuned DistilBERT / DeBERTa-v3-small (hf)
    โ€ข Phase 3: Multi-task DeBERTa-v3, 15 categories (hf2 / onnx2)
    โ†“ (blocked if malicious โ€” 403 Forbidden)
[2] Canary Token Injection
    โ€ข Random unique token embedded in system prompt
    โ†“
[3] Forward to Backend LLM (llama.cpp / Kimi-K2.5)
    โ†“
[4] Output Guardrail
    โ€ข Canary token leak detection
    โ€ข PII redaction (SSN, credit card, email, phone)
    โ€ข Harmful content filter
    โ†“
[5] Return cleaned response to client

Key security properties:

  • Fail-secure: Any classifier error โ†’ block the request
  • Canary tokens: Detect successful injections in the output
  • Configurable threshold: Default 0.70, tune to hit <5% FPR
  • Full audit log: Every decision stored in SQLite