CMMC Expert 14B v2.0

Notice: These models are provided for proof-of-concept and testing purposes only. Production-grade models are not publicly shared. For inquiries regarding production models or commercial licensing, please contact the maintainer: Nathan Maine.

A locally-hosted, fine-tuned language model specialized in CMMC 2.0, NIST 800-171, NIST 800-53, NIST CSF, HIPAA, DFARS, and cybersecurity compliance frameworks.

This is the 14B variant — balanced speed and reasoning depth for detailed compliance analysis. Part of a four-model suite (7B, 14B, 32B, 72B) sharing the same compliance knowledge base.

What's New in v2.0

  • 40% more training data — 18,747 total examples (up from 16,906 in v1.0)
  • 6 new authoritative sources — NIST SP 800-53 Rev. 5 full catalog, NIST CSF 2.0, eCFR regulations (CMMC/DFARS/HIPAA), Federal Register documents, DoD PDFs
  • Expanded LoRA coverage — All 7 transformer modules targeted (v1.0 used only 4)
  • Improved eval loss — 1.144 (down from v1.0)
  • Automated data pipeline — Reproducible scraping, filtering, and deduplication via cmmc-data-pipeline

Quick Start (Ollama)

# Download and run
ollama pull Nathan-Maine/cmmc-expert-14b-v2.0

# Ask a compliance question
ollama run cmmc-expert-14b-v2.0 "What access controls are required for CMMC Level 2?"

# Or use the OpenAI-compatible API
curl http://localhost:11434/api/generate -d '{
  "model": "cmmc-expert-14b-v2.0",
  "prompt": "What are the key differences between CMMC Level 1 and Level 2?",
  "stream": false
}'

Model Details

Property Value
Base Model Qwen2.5-14B-Instruct
Parameters 14.7 billion
Fine-Tuning Method QLoRA (4-bit NF4 base, LoRA rank 64, alpha 128)
Quantization q5_k_m (GGUF)
File Size 9.8 GB
Context Length 32,768 tokens
Training Hardware NVIDIA A100-SXM4-80GB
Training Time ~6.5 hours
Training Framework HuggingFace TRL + PEFT + bitsandbytes

Security Domain Coverage

Models are fine-tuned for complete security domain coverage, including vulnerability analysis, incident response scenarios, and access control failure modes required for professional SSP and POA&M generation. Behavioral guardrails and policy enforcement are handled at the governed-llm-gateway layer.

Base model migration to Meta Llama 3.1/3.3 (US-origin, open weights) is in progress.

Compliance Framework Coverage

Trained across eight overlapping frameworks to support cross-framework mapping:

Framework Coverage
CMMC 2.0 (32 CFR Part 170) All three levels — 17 L1 practices, 110 L2, 134 L3, assessment methodology
NIST SP 800-171 Rev. 2 & 3 110 security requirements across 14 families
NIST SP 800-172 Enhanced security requirements for critical CUI programs
NIST SP 800-53 Rev. 5 Full catalog of 1,189 controls across 20 families
NIST SP 800-37 Risk Management Framework (RMF) steps and authorization
NIST CSF 2.0 Govern, Identify, Protect, Detect, Respond, Recover functions
HIPAA Security Rule Administrative, physical, and technical safeguards
DFARS Clauses 252.204-7008/7009/7012/7019/7020/7021/7024/7025, 252.239-7009/7010

Training Data

14,906 training + 3,841 validation examples (~4.5M tokens) assembled from 11 curated sources:

v1.0 Legacy Sources (13,434 examples)

Source Examples Share
NIST Cybersecurity (filtered from 424K) 6,372 33.9%
CMMC Full 4,787 25.5%
CMMC Balanced 994 5.3%
HIPAA Compliance 961 5.1%
CMMC Core 320 1.7%

v2.0 New Sources (1,841 examples via automated pipeline)

Source Examples Share
NIST CSRC (SP 800-53 Rev. 5 controls) 773 4.1%
DoD Documents (PDFs) 519 2.8%
eCFR Regulations (CMMC/DFARS/HIPAA) 75 0.4%
NIST SP 800-171 Rev. 3 63 0.3%
NIST CSF 2.0 61 0.3%
Federal Register 350 1.9%

v2.0 Data Processing Pipeline:

  1. Automated scraping — 6 authoritative sources scraped via dedicated modules
  2. Relevance filtering — eCFR filtered to only CMMC-relevant DFARS clauses (252.204-70xx, 252.239-70xx), CMMC (32 CFR 170), and HIPAA (45 CFR 164)
  3. Format conversion — Raw records converted to chat-style instruction/response pairs
  4. Quality filtering — Removed entries <100 chars, entries >8,000 chars, OCR artifacts
  5. Deduplication — Exact dedup (xxhash) + near-dedup (MinHash LSH, 128 permutations, Jaccard 0.8 threshold, 5-gram shingles)
  6. Cross-version dedup — v2.0 records deduplicated against v1.0 corpus to prevent overlap
  7. Validation split — 80/20 stratified split maintaining source distribution

Pipeline source code: github.com/NathanMaine/cmmc-data-pipeline

Training Configuration

Parameter Value
Epochs 3
Learning Rate 2e-4 (cosine decay)
Warmup 5% of steps
Optimizer 8-bit AdamW
Batch Size 4 (effective 32 with gradient accumulation x8)
LoRA Rank 64
LoRA Alpha 128
LoRA Dropout 0.05
LoRA Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Max Sequence Length 2048
Packing Enabled
Base Quantization 4-bit NF4 with double quantization

Evaluation Results

Training Metrics

Metric Value
Final Train Loss 1.009 (best: 0.966)
Average Train Loss 1.151
Final Eval Loss 1.144
Mean Token Accuracy 77.7%
Total Training Steps ~1,398
Tokens Processed ~18M

Training Curve (Selected Steps)

Step Epoch Train Loss Token Accuracy
200 ~0.4 1.250 72.5%
500 ~1.1 1.150 75.0%
800 ~1.7 1.080 76.5%
1100 ~2.4 1.020 77.2%
1398 3.0 1.009 77.7%

v1.0 vs v2.0 Comparison

Metric v1.0 v2.0 Change
Training Examples 13,434 14,906 +11%
Validation Examples 3,472 3,841 +11%
Eval Loss 1.250 1.144 -8.5% (better)
LoRA Target Modules 4 7 +75% coverage
Data Sources 5 11 +6 new sources

Intended Uses

  • SSP Generation — Draft System Security Plan control descriptions with NIST/CMMC citations
  • Gap Analysis — Identify controls required for specific CMMC levels and contract requirements
  • Assessment Prep — Generate evidence checklists and assessment objective narratives
  • Cross-Framework Mapping — Map controls between CMMC, NIST 800-53, HIPAA, and DFARS
  • Policy Drafting — Create policies aligned to specific CMMC practices
  • DFARS Clause Analysis — Identify requirements from contract language
  • Regulatory Research — Understand eCFR regulations and Federal Register guidance
  • Training & Education — Always-available compliance reference for teams

Limitations

  • Not a substitute for qualified compliance professionals. This model is a tool to accelerate compliance work, not replace human judgment.
  • Knowledge cutoff. The model's knowledge is based on training data available at the time of fine-tuning (February 2026). Always verify against current published frameworks.
  • 14B trade-off. Provides stronger reasoning than the 7B but requires more VRAM. For maximum reasoning depth on complex multi-framework analysis, consider the 32B or 72B variants.
  • No retrieval augmentation. The model generates responses from trained knowledge only — it does not search or retrieve external documents at inference time.
  • Citation accuracy. While the model generally cites correct control numbers and framework sections, always verify specific citations against authoritative sources.

Out-of-Scope Uses

  • Legal advice. This model does not provide legal opinions on compliance status.
  • Automated compliance certification. CMMC certification requires human assessors (C3PAOs).
  • Processing actual CUI/ITAR data. The model itself does not process or store sensitive data, but users should follow their organization's data handling policies.

Hardware Requirements

Mode GPU (VRAM) CPU-Only (RAM) Storage
Inference 16 GB 24 GB 15 GB
Training 24 GB+ N/A 50 GB

Supported OS: Linux, macOS, Windows (WSL2)

The Model Suite

This is the 14B model — balanced speed and reasoning depth for detailed compliance analysis. The full suite includes:

Model Parameters GGUF Size Best For
cmmc-expert-7b-v2.0 7.6B 5.1 GB Quick lookups, day-to-day queries
cmmc-expert-14b-v2.0 14.7B 9.8 GB Detailed analysis, multi-control reasoning
cmmc-expert-32b-v2.0 32.5B ~19 GB Deep gap assessments, SSP drafting
cmmc-expert-72b-v2.0 72.7B ~42 GB Complex multi-framework analysis

Source Code

Known Issues

  • Repetition bug — The model may repeat content, lists, or entire sections multiple times within a single response. This is a known training artifact being addressed in future versions.
  • Verbose responses — Tends to over-explain in some contexts where a concise answer would be more appropriate.

Citation

@misc{maine2026cmmcexpert,
  title={CMMC Expert v2.0: Fine-Tuned Language Models for Cybersecurity Compliance},
  author={Nathan Maine},
  year={2026},
  url={https://github.com/NathanMaine/cmmc-compliance-ai-model}
}

Contact

Downloads last month
38
GGUF
Model size
15B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nathan-Maine/cmmc-expert-14b-v2.0

Base model

Qwen/Qwen2.5-14B
Quantized
(138)
this model

Collection including Nathan-Maine/cmmc-expert-14b-v2.0

Evaluation results