CMMC Expert 14B v2.0
Notice: These models are provided for proof-of-concept and testing purposes only. Production-grade models are not publicly shared. For inquiries regarding production models or commercial licensing, please contact the maintainer: Nathan Maine.
A locally-hosted, fine-tuned language model specialized in CMMC 2.0, NIST 800-171, NIST 800-53, NIST CSF, HIPAA, DFARS, and cybersecurity compliance frameworks.
This is the 14B variant — balanced speed and reasoning depth for detailed compliance analysis. Part of a four-model suite (7B, 14B, 32B, 72B) sharing the same compliance knowledge base.
What's New in v2.0
- 40% more training data — 18,747 total examples (up from 16,906 in v1.0)
- 6 new authoritative sources — NIST SP 800-53 Rev. 5 full catalog, NIST CSF 2.0, eCFR regulations (CMMC/DFARS/HIPAA), Federal Register documents, DoD PDFs
- Expanded LoRA coverage — All 7 transformer modules targeted (v1.0 used only 4)
- Improved eval loss — 1.144 (down from v1.0)
- Automated data pipeline — Reproducible scraping, filtering, and deduplication via cmmc-data-pipeline
Quick Start (Ollama)
# Download and run
ollama pull Nathan-Maine/cmmc-expert-14b-v2.0
# Ask a compliance question
ollama run cmmc-expert-14b-v2.0 "What access controls are required for CMMC Level 2?"
# Or use the OpenAI-compatible API
curl http://localhost:11434/api/generate -d '{
"model": "cmmc-expert-14b-v2.0",
"prompt": "What are the key differences between CMMC Level 1 and Level 2?",
"stream": false
}'
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen2.5-14B-Instruct |
| Parameters | 14.7 billion |
| Fine-Tuning Method | QLoRA (4-bit NF4 base, LoRA rank 64, alpha 128) |
| Quantization | q5_k_m (GGUF) |
| File Size | 9.8 GB |
| Context Length | 32,768 tokens |
| Training Hardware | NVIDIA A100-SXM4-80GB |
| Training Time | ~6.5 hours |
| Training Framework | HuggingFace TRL + PEFT + bitsandbytes |
Security Domain Coverage
Models are fine-tuned for complete security domain coverage, including vulnerability analysis, incident response scenarios, and access control failure modes required for professional SSP and POA&M generation. Behavioral guardrails and policy enforcement are handled at the governed-llm-gateway layer.
Base model migration to Meta Llama 3.1/3.3 (US-origin, open weights) is in progress.
Compliance Framework Coverage
Trained across eight overlapping frameworks to support cross-framework mapping:
| Framework | Coverage |
|---|---|
| CMMC 2.0 (32 CFR Part 170) | All three levels — 17 L1 practices, 110 L2, 134 L3, assessment methodology |
| NIST SP 800-171 Rev. 2 & 3 | 110 security requirements across 14 families |
| NIST SP 800-172 | Enhanced security requirements for critical CUI programs |
| NIST SP 800-53 Rev. 5 | Full catalog of 1,189 controls across 20 families |
| NIST SP 800-37 | Risk Management Framework (RMF) steps and authorization |
| NIST CSF 2.0 | Govern, Identify, Protect, Detect, Respond, Recover functions |
| HIPAA Security Rule | Administrative, physical, and technical safeguards |
| DFARS Clauses | 252.204-7008/7009/7012/7019/7020/7021/7024/7025, 252.239-7009/7010 |
Training Data
14,906 training + 3,841 validation examples (~4.5M tokens) assembled from 11 curated sources:
v1.0 Legacy Sources (13,434 examples)
| Source | Examples | Share |
|---|---|---|
| NIST Cybersecurity (filtered from 424K) | 6,372 | 33.9% |
| CMMC Full | 4,787 | 25.5% |
| CMMC Balanced | 994 | 5.3% |
| HIPAA Compliance | 961 | 5.1% |
| CMMC Core | 320 | 1.7% |
v2.0 New Sources (1,841 examples via automated pipeline)
| Source | Examples | Share |
|---|---|---|
| NIST CSRC (SP 800-53 Rev. 5 controls) | 773 | 4.1% |
| DoD Documents (PDFs) | 519 | 2.8% |
| eCFR Regulations (CMMC/DFARS/HIPAA) | 75 | 0.4% |
| NIST SP 800-171 Rev. 3 | 63 | 0.3% |
| NIST CSF 2.0 | 61 | 0.3% |
| Federal Register | 350 | 1.9% |
v2.0 Data Processing Pipeline:
- Automated scraping — 6 authoritative sources scraped via dedicated modules
- Relevance filtering — eCFR filtered to only CMMC-relevant DFARS clauses (252.204-70xx, 252.239-70xx), CMMC (32 CFR 170), and HIPAA (45 CFR 164)
- Format conversion — Raw records converted to chat-style instruction/response pairs
- Quality filtering — Removed entries <100 chars, entries >8,000 chars, OCR artifacts
- Deduplication — Exact dedup (xxhash) + near-dedup (MinHash LSH, 128 permutations, Jaccard 0.8 threshold, 5-gram shingles)
- Cross-version dedup — v2.0 records deduplicated against v1.0 corpus to prevent overlap
- Validation split — 80/20 stratified split maintaining source distribution
Pipeline source code: github.com/NathanMaine/cmmc-data-pipeline
Training Configuration
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Learning Rate | 2e-4 (cosine decay) |
| Warmup | 5% of steps |
| Optimizer | 8-bit AdamW |
| Batch Size | 4 (effective 32 with gradient accumulation x8) |
| LoRA Rank | 64 |
| LoRA Alpha | 128 |
| LoRA Dropout | 0.05 |
| LoRA Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Max Sequence Length | 2048 |
| Packing | Enabled |
| Base Quantization | 4-bit NF4 with double quantization |
Evaluation Results
Training Metrics
| Metric | Value |
|---|---|
| Final Train Loss | 1.009 (best: 0.966) |
| Average Train Loss | 1.151 |
| Final Eval Loss | 1.144 |
| Mean Token Accuracy | 77.7% |
| Total Training Steps | ~1,398 |
| Tokens Processed | ~18M |
Training Curve (Selected Steps)
| Step | Epoch | Train Loss | Token Accuracy |
|---|---|---|---|
| 200 | ~0.4 | 1.250 | 72.5% |
| 500 | ~1.1 | 1.150 | 75.0% |
| 800 | ~1.7 | 1.080 | 76.5% |
| 1100 | ~2.4 | 1.020 | 77.2% |
| 1398 | 3.0 | 1.009 | 77.7% |
v1.0 vs v2.0 Comparison
| Metric | v1.0 | v2.0 | Change |
|---|---|---|---|
| Training Examples | 13,434 | 14,906 | +11% |
| Validation Examples | 3,472 | 3,841 | +11% |
| Eval Loss | 1.250 | 1.144 | -8.5% (better) |
| LoRA Target Modules | 4 | 7 | +75% coverage |
| Data Sources | 5 | 11 | +6 new sources |
Intended Uses
- SSP Generation — Draft System Security Plan control descriptions with NIST/CMMC citations
- Gap Analysis — Identify controls required for specific CMMC levels and contract requirements
- Assessment Prep — Generate evidence checklists and assessment objective narratives
- Cross-Framework Mapping — Map controls between CMMC, NIST 800-53, HIPAA, and DFARS
- Policy Drafting — Create policies aligned to specific CMMC practices
- DFARS Clause Analysis — Identify requirements from contract language
- Regulatory Research — Understand eCFR regulations and Federal Register guidance
- Training & Education — Always-available compliance reference for teams
Limitations
- Not a substitute for qualified compliance professionals. This model is a tool to accelerate compliance work, not replace human judgment.
- Knowledge cutoff. The model's knowledge is based on training data available at the time of fine-tuning (February 2026). Always verify against current published frameworks.
- 14B trade-off. Provides stronger reasoning than the 7B but requires more VRAM. For maximum reasoning depth on complex multi-framework analysis, consider the 32B or 72B variants.
- No retrieval augmentation. The model generates responses from trained knowledge only — it does not search or retrieve external documents at inference time.
- Citation accuracy. While the model generally cites correct control numbers and framework sections, always verify specific citations against authoritative sources.
Out-of-Scope Uses
- Legal advice. This model does not provide legal opinions on compliance status.
- Automated compliance certification. CMMC certification requires human assessors (C3PAOs).
- Processing actual CUI/ITAR data. The model itself does not process or store sensitive data, but users should follow their organization's data handling policies.
Hardware Requirements
| Mode | GPU (VRAM) | CPU-Only (RAM) | Storage |
|---|---|---|---|
| Inference | 16 GB | 24 GB | 15 GB |
| Training | 24 GB+ | N/A | 50 GB |
Supported OS: Linux, macOS, Windows (WSL2)
The Model Suite
This is the 14B model — balanced speed and reasoning depth for detailed compliance analysis. The full suite includes:
| Model | Parameters | GGUF Size | Best For |
|---|---|---|---|
| cmmc-expert-7b-v2.0 | 7.6B | 5.1 GB | Quick lookups, day-to-day queries |
| cmmc-expert-14b-v2.0 | 14.7B | 9.8 GB | Detailed analysis, multi-control reasoning |
| cmmc-expert-32b-v2.0 | 32.5B | ~19 GB | Deep gap assessments, SSP drafting |
| cmmc-expert-72b-v2.0 | 72.7B | ~42 GB | Complex multi-framework analysis |
Source Code
- Model training & evaluation: github.com/NathanMaine/cmmc-compliance-ai-model
- Data pipeline: github.com/NathanMaine/cmmc-data-pipeline
Known Issues
- Repetition bug — The model may repeat content, lists, or entire sections multiple times within a single response. This is a known training artifact being addressed in future versions.
- Verbose responses — Tends to over-explain in some contexts where a concise answer would be more appropriate.
Citation
@misc{maine2026cmmcexpert,
title={CMMC Expert v2.0: Fine-Tuned Language Models for Cybersecurity Compliance},
author={Nathan Maine},
year={2026},
url={https://github.com/NathanMaine/cmmc-compliance-ai-model}
}
Contact
- Author: Nathan Maine
- Website: nathanmaine.com
- LinkedIn: linkedin.com/in/nathanmaine
- Email: nmaine@gmail.com
- Downloads last month
- 38
5-bit
Model tree for Nathan-Maine/cmmc-expert-14b-v2.0
Collection including Nathan-Maine/cmmc-expert-14b-v2.0
Evaluation results
- Eval Loss (Final)self-reported1.144
- Mean Token Accuracyself-reported0.777