CMMC Expert 32B v2.0
Notice: These models are provided for proof-of-concept and testing purposes only. Production-grade models are not publicly shared. For inquiries regarding production models or commercial licensing, please contact the maintainer: Nathan Maine.
A locally-hosted, fine-tuned language model specialized in CMMC 2.0, NIST 800-171, NIST 800-53, NIST CSF, HIPAA, DFARS, and cybersecurity compliance frameworks.
This is the 32B variant — the deep analysis option for gap assessments, SSP drafting, and detailed implementation guidance. It achieves the best eval loss (1.073) of the entire model suite. Part of a four-model suite (7B, 14B, 32B, 72B) sharing the same compliance knowledge base.
What's New in v2.0
- 40% more training data — 18,747 total examples (up from 16,906 in v1.0)
- 6 new authoritative sources — NIST SP 800-53 Rev. 5 full catalog, NIST CSF 2.0, eCFR regulations (CMMC/DFARS/HIPAA), Federal Register documents, DoD PDFs
- Expanded LoRA coverage — All 7 transformer modules targeted (v1.0 used only 4)
- Best eval loss in suite — 1.073 (6% better than 7B and 14B variants)
- Automated data pipeline — Reproducible scraping, filtering, and deduplication via cmmc-data-pipeline
Quick Start (Ollama)
# Download and run
ollama pull Nathan-Maine/cmmc-expert-32b-v2.0
# Ask a compliance question
ollama run cmmc-expert-32b-v2.0 "What access controls are required for CMMC Level 2?"
# Or use the OpenAI-compatible API
curl http://localhost:11434/api/generate -d '{
"model": "cmmc-expert-32b-v2.0",
"prompt": "What are the key differences between CMMC Level 1 and Level 2?",
"stream": false
}'
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen2.5-32B-Instruct |
| Parameters | 32.5 billion |
| Fine-Tuning Method | QLoRA (4-bit NF4 base, LoRA rank 32, alpha 64) |
| Quantization | q4_k_m (GGUF) |
| File Size | 18.9 GB |
| Context Length | 32,768 tokens |
| Training Hardware | NVIDIA A100-SXM4-80GB |
| Training Time | ~9.6 hours |
| Training Framework | HuggingFace TRL + PEFT + bitsandbytes |
Security Domain Coverage
Models are fine-tuned for complete security domain coverage, including vulnerability analysis, incident response scenarios, and access control failure modes required for professional SSP and POA&M generation. Behavioral guardrails and policy enforcement are handled at the governed-llm-gateway layer.
Base model migration to Meta Llama 3.1/3.3 (US-origin, open weights) is in progress.
Compliance Framework Coverage
Trained across eight overlapping frameworks to support cross-framework mapping:
| Framework | Coverage |
|---|---|
| CMMC 2.0 (32 CFR Part 170) | All three levels — 17 L1 practices, 110 L2, 134 L3, assessment methodology |
| NIST SP 800-171 Rev. 2 & 3 | 110 security requirements across 14 families |
| NIST SP 800-172 | Enhanced security requirements for critical CUI programs |
| NIST SP 800-53 Rev. 5 | Full catalog of 1,189 controls across 20 families |
| NIST SP 800-37 | Risk Management Framework (RMF) steps and authorization |
| NIST CSF 2.0 | Govern, Identify, Protect, Detect, Respond, Recover functions |
| HIPAA Security Rule | Administrative, physical, and technical safeguards |
| DFARS Clauses | 252.204-7008/7009/7012/7019/7020/7021/7024/7025, 252.239-7009/7010 |
Training Data
14,906 training + 3,841 validation examples (~4.5M tokens) assembled from 11 curated sources:
v1.0 Legacy Sources (13,434 examples)
| Source | Examples | Share |
|---|---|---|
| NIST Cybersecurity (filtered from 424K) | 6,372 | 33.9% |
| CMMC Full | 4,787 | 25.5% |
| CMMC Balanced | 994 | 5.3% |
| HIPAA Compliance | 961 | 5.1% |
| CMMC Core | 320 | 1.7% |
v2.0 New Sources (1,841 examples via automated pipeline)
| Source | Examples | Share |
|---|---|---|
| NIST CSRC (SP 800-53 Rev. 5 controls) | 773 | 4.1% |
| DoD Documents (PDFs) | 519 | 2.8% |
| eCFR Regulations (CMMC/DFARS/HIPAA) | 75 | 0.4% |
| NIST SP 800-171 Rev. 3 | 63 | 0.3% |
| NIST CSF 2.0 | 61 | 0.3% |
| Federal Register | 350 | 1.9% |
v2.0 Data Processing Pipeline:
- Automated scraping — 6 authoritative sources scraped via dedicated modules
- Relevance filtering — eCFR filtered to only CMMC-relevant DFARS clauses (252.204-70xx, 252.239-70xx), CMMC (32 CFR 170), and HIPAA (45 CFR 164)
- Format conversion — Raw records converted to chat-style instruction/response pairs
- Quality filtering — Removed entries <100 chars, entries >8,000 chars, OCR artifacts
- Deduplication — Exact dedup (xxhash) + near-dedup (MinHash LSH, 128 permutations, Jaccard 0.8 threshold, 5-gram shingles)
- Cross-version dedup — v2.0 records deduplicated against v1.0 corpus to prevent overlap
- Validation split — 80/20 stratified split maintaining source distribution
Pipeline source code: github.com/NathanMaine/cmmc-data-pipeline
Training Configuration
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Learning Rate | 1e-4 (cosine decay) |
| Warmup | 5% of steps |
| Optimizer | 8-bit AdamW |
| Batch Size | 1 (effective 16 with gradient accumulation x16) |
| LoRA Rank | 32 |
| LoRA Alpha | 64 |
| LoRA Dropout | 0.05 |
| LoRA Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Max Sequence Length | 2048 |
| Packing | Enabled |
| Base Quantization | 4-bit NF4 with double quantization |
Evaluation Results
Training Metrics
| Metric | Value |
|---|---|
| Final Train Loss | 1.005 |
| Average Train Loss | 1.139 |
| Eval Loss at Epoch 1 | 1.128 |
| Final Eval Loss (Epoch 2) | 1.073 |
| Best Mean Token Accuracy | 77.9% |
| Final Mean Token Accuracy | 76.8% |
| Total Training Steps | 561 |
| Tokens Processed | ~18M |
Cross-Model Comparison (v2.0 Suite)
| Model | Eval Loss | Token Accuracy | GGUF Size | Training Time |
|---|---|---|---|---|
| 7B | 1.142 | 76.5% | 5.1 GB | 3.1 hrs |
| 14B | 1.144 | 75.9% | ~10 GB | 5.7 hrs |
| 32B | 1.073 | 77.9% | 18.9 GB | 9.6 hrs |
| 72B | 1.048 | 45 GB | 13.0 hrs |
The 32B model achieves the best eval loss in the suite — 6% better than both the 7B and 14B variants — making it the top choice for tasks requiring deep reasoning such as gap assessments, SSP control narratives, and cross-framework mapping.
Intended Uses
- SSP Drafting — Draft detailed System Security Plan control descriptions with NIST/CMMC citations. The 32B model excels at producing thorough, multi-paragraph narratives for each control family.
- Gap Analysis — Identify controls required for specific CMMC levels and contract requirements with deep contextual reasoning across related controls.
- Assessment Prep — Generate evidence checklists and assessment objective narratives with nuanced implementation guidance.
- Cross-Framework Mapping — Map controls between CMMC, NIST 800-53, HIPAA, and DFARS with detailed justifications.
- Detailed Implementation Guidance — Provide step-by-step implementation plans with consideration of dependencies, resource requirements, and common pitfalls.
- Policy Drafting — Create policies aligned to specific CMMC practices with appropriate depth and specificity.
- DFARS Clause Analysis — Identify requirements from contract language.
- Regulatory Research — Understand eCFR regulations and Federal Register guidance.
- Training & Education — Always-available compliance reference for teams.
Limitations
- Not a substitute for qualified compliance professionals. This model is a tool to accelerate compliance work, not replace human judgment.
- Knowledge cutoff. The model's knowledge is based on training data available at the time of fine-tuning (February 2026). Always verify against current published frameworks.
- No retrieval augmentation. The model generates responses from trained knowledge only — it does not search or retrieve external documents at inference time.
- Citation accuracy. While the model generally cites correct control numbers and framework sections, always verify specific citations against authoritative sources.
Out-of-Scope Uses
- Legal advice. This model does not provide legal opinions on compliance status.
- Automated compliance certification. CMMC certification requires human assessors (C3PAOs).
- Processing actual CUI/ITAR data. The model itself does not process or store sensitive data, but users should follow their organization's data handling policies.
Hardware Requirements
| Mode | GPU (VRAM) | CPU-Only (RAM) | Storage |
|---|---|---|---|
| Inference | 24 GB | 32 GB | 20 GB |
| Training | 80 GB+ | N/A | 80 GB |
Supported OS: Linux, macOS, Windows (WSL2)
The Model Suite
This is the 32B model — the deep analysis option for gap assessments, SSP drafting, and detailed implementation guidance. The full suite includes:
| Model | Parameters | GGUF Size | Eval Loss | Best For |
|---|---|---|---|---|
| cmmc-expert-7b-v2.0 | 7.6B | 5.1 GB | 1.142 | Quick lookups, day-to-day queries |
| cmmc-expert-14b-v2.0 | 14.7B | ~10 GB | 1.144 | Detailed analysis, multi-control reasoning |
| cmmc-expert-32b-v2.0 | 32.5B | 18.9 GB | 1.073 | Deep gap assessments, SSP drafting |
| cmmc-expert-72b-v2.0 | 72.7B | 45 GB | 1.048 | Complex multi-framework analysis |
Source Code
- Model training & evaluation: github.com/NathanMaine/cmmc-compliance-ai-model
- Data pipeline: github.com/NathanMaine/cmmc-data-pipeline
Known Issues
- Repetition bug — The model may repeat content, lists, or entire sections multiple times within a single response. This is a known training artifact being addressed in future versions.
- Verbose responses — Tends to over-explain in some contexts where a concise answer would be more appropriate.
Citation
@misc{maine2026cmmcexpert,
title={CMMC Expert v2.0: Fine-Tuned Language Models for Cybersecurity Compliance},
author={Nathan Maine},
year={2026},
url={https://github.com/NathanMaine/cmmc-compliance-ai-model}
}
Contact
- Author: Nathan Maine
- Website: nathanmaine.com
- LinkedIn: linkedin.com/in/nathanmaine
- Email: nmaine@gmail.com
- Downloads last month
- 36
4-bit
Model tree for Nathan-Maine/cmmc-expert-32b-v2.0
Collection including Nathan-Maine/cmmc-expert-32b-v2.0
Evaluation results
- Eval Loss (Final)self-reported1.073
- Mean Token Accuracy (Best)self-reported0.779