Llama-Opus-Z8
Model Details
Model Name: Daemontatox/Llama-Opus-Z8
Base Model: allura-forge/Llama-3.3-8B-Instruct
Model Type: Causal Language Model (Instruction-Tuned)
Architecture: Llama 3.3 (8B parameters)
Fine-tuning Methods: Supervised Fine-Tuning (SFT) + Group Relative Policy Optimization (GRPO)
License: Llama 3.3 Community License
Model Description
Llama-Opus-Z8 is a fine-tuned version of the Llama 3.3 8B Instruct model, enhanced through a two-stage training process: Supervised Fine-Tuning followed by reinforcement learning using Group Relative Policy Optimization. This model leverages the extracted Llama 3.3 8B weights (originally accessible only via Meta's Llama API) and applies advanced alignment techniques for improved reasoning and instruction-following capabilities.
Base Model Background
The base model (allura-forge/Llama-3.3-8B-Instruct) represents Llama 3.3 8B Instruct weights extracted from Meta's Llama API. While initially configured with 8K context, the model supports extension to 128K context through appropriate RoPE scaling configuration.
Training Methodology
Stage 1: Supervised Fine-Tuning (SFT)
- High-quality instruction-following datasets
- Supervised learning to establish baseline performance
- Expert demonstration mimicking
Stage 2: Group Relative Policy Optimization (GRPO)
- Reinforcement learning phase for enhanced reasoning
- Group-based advantage estimation (no separate critic model needed)
- KL divergence constraints for stable policy updates
- 50% reduction in memory requirements compared to PPO
- Online learning with iterative model improvement
GRPO Key Advantages
- Memory Efficient: Eliminates need for separate value/critic network
- Computationally Efficient: ~50% less compute than traditional PPO
- Stable Training: KL divergence constraints prevent drastic policy changes
- Group-based Baseline: Uses mean reward from multiple completions per prompt
- Variance Reduction: Comparative group scoring reduces update variance
Intended Use
Primary Use Cases
- Conversational AI and chat applications
- Complex reasoning tasks
- Code generation and analysis
- Mathematical problem-solving
- Instruction following
- Question answering
Out-of-Scope Use
- Tasks requiring real-time information beyond training cutoff
- Use cases violating the Llama 3.3 Community License
- Applications requiring 128K+ context without proper configuration
Technical Specifications
Parameters: 8 billion
Precision: BF16/FP16
Context Length: 8K (default), extensible to 128K with RoPE scaling
Vocabulary Size: 128,256 tokens
Architecture: Optimized transformer with GQA (Grouped Query Attention)
Training Details
Training Framework: Likely uses TRL (Transformers Reinforcement Learning) library
GRPO Parameters:
- Beta (KL coefficient): Typically 0.001-0.01
- Epsilon (clipping): ~0.2
- Group size: Multiple completions per prompt
- Iterations per batch: Configurable (μ parameter)
Compute Requirements: GRPO enables training on consumer hardware (single H100 or similar)
Performance Characteristics
Expected improvements over base model:
- Enhanced reasoning capabilities through RL optimization
- Better alignment with human preferences
- Improved performance on mathematical and coding benchmarks
- More stable and controlled generation
Inference Examples
vLLM
# Install vLLM
pip install vllm
# Python inference
from vllm import LLM, SamplingParams
# Initialize model
llm = LLM(
model="Daemontatox/Llama-Opus-Z8",
tensor_parallel_size=1, # Adjust for multi-GPU
dtype="bfloat16",
max_model_len=8192, # Or 131072 for 128K context
trust_remote_code=True
)
# Sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=2048,
repetition_penalty=1.1
)
# Generate
prompts = [
"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nExplain quantum entanglement in simple terms.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Generated: {output.outputs[0].text}")
# CLI inference
vllm serve Daemontatox/Llama-Opus-Z8 \
--dtype bfloat16 \
--max-model-len 8192 \
--tensor-parallel-size 1
# Query the server
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Daemontatox/Llama-Opus-Z8",
"prompt": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWrite a Python function to calculate Fibonacci numbers.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
"max_tokens": 512,
"temperature": 0.7
}'
SGLang
# Install SGLang
pip install "sglang[all]"
# Launch server
python -m sglang.launch_server \
--model-path Daemontatox/Llama-Opus-Z8 \
--dtype bfloat16 \
--port 30000 \
--context-length 8192
# Python client
import sglang as sgl
@sgl.function
def reasoning_task(s, question):
s += sgl.system("You are a helpful AI assistant specialized in reasoning.")
s += sgl.user(question)
s += sgl.assistant(sgl.gen("answer", max_tokens=512, temperature=0.7))
# Initialize runtime
runtime = sgl.Runtime(
model_path="Daemontatox/Llama-Opus-Z8",
base_url="http://localhost:30000"
)
sgl.set_default_backend(runtime)
# Generate
state = reasoning_task.run(
question="Solve: If x + 5 = 12, what is x?"
)
print(state["answer"])
# Batch inference with SGLang
import sglang as sgl
runtime = sgl.Runtime(
model_path="Daemontatox/Llama-Opus-Z8",
tp_size=1
)
prompts = [
"Explain machine learning",
"Write a sorting algorithm",
"What is consciousness?"
]
# Parallel generation
outputs = runtime.generate(
prompts,
sampling_params={
"temperature": 0.7,
"top_p": 0.9,
"max_new_tokens": 256
}
)
for output in outputs:
print(output["text"])
Modular MAX
# Install MAX
# pip install max
from max import engine
# Load model
model = engine.InferenceSession(
model_path="Daemontatox/Llama-Opus-Z8",
device="gpu",
precision="bfloat16"
)
# Prepare input
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain neural networks briefly."}
]
# Format with chat template
formatted = model.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Generate
response = model.generate(
formatted,
max_tokens=512,
temperature=0.7,
top_p=0.9
)
print(response)
# MAX with streaming
from max import engine
model = engine.InferenceSession("Daemontatox/Llama-Opus-Z8")
prompt = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWrite a haiku about AI.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
# Stream tokens
for token in model.generate_stream(
prompt,
max_tokens=100,
temperature=0.8
):
print(token, end="", flush=True)
# MAX CLI
max serve Daemontatox/Llama-Opus-Z8 \
--precision bfloat16 \
--device gpu \
--port 8080
# Query
curl -X POST http://localhost:8080/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "What is the capital of France?",
"max_tokens": 50,
"temperature": 0.7
}'
Chat Template
# Llama 3.3 format
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{assistant_response}<|eot_id|>
Limitations
- Knowledge cutoff: Training data dependent
- May require prompt engineering for optimal performance
- Context length limitations (8K default)
- Potential for hallucinations in complex reasoning
- GRPO-trained models may show reward hacking if reward functions are poorly designed
Ethical Considerations
- Model outputs should be verified for factual accuracy
- Not suitable for making critical decisions without human oversight
- May reflect biases present in training data
- Users should comply with Llama 3.3 Community License terms
Citation
@misc{llama-opus-z8,
title={Llama-Opus-Z8: SFT + GRPO Fine-tuned Llama 3.3 8B},
author={Daemontatox},
year={2025},
howpublished={\url{https://huggingface.co/Daemontatox/Llama-Opus-Z8}}
}
@article{deepseekmath2024,
title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
author={DeepSeek-AI},
journal={arXiv preprint arXiv:2402.03300},
year={2024}
}
Acknowledgments
- Base model: allura-forge for extracting Llama 3.3 8B weights
- Training methodology: DeepSeek-AI for GRPO algorithm
- Framework: Meta AI for Llama 3.3 architecture
Contact
For issues, questions, or contributions, please contact via Hugging Face model repository.
- Downloads last month
- 31