Instructions to use RESMP-DEV/GLM-4.7-Flash-Trellis-MM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RESMP-DEV/GLM-4.7-Flash-Trellis-MM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="RESMP-DEV/GLM-4.7-Flash-Trellis-MM")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("RESMP-DEV/GLM-4.7-Flash-Trellis-MM") model = AutoModelForCausalLM.from_pretrained("RESMP-DEV/GLM-4.7-Flash-Trellis-MM") - Trellis
How to use RESMP-DEV/GLM-4.7-Flash-Trellis-MM with Trellis:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use RESMP-DEV/GLM-4.7-Flash-Trellis-MM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RESMP-DEV/GLM-4.7-Flash-Trellis-MM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RESMP-DEV/GLM-4.7-Flash-Trellis-MM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/RESMP-DEV/GLM-4.7-Flash-Trellis-MM
- SGLang
How to use RESMP-DEV/GLM-4.7-Flash-Trellis-MM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RESMP-DEV/GLM-4.7-Flash-Trellis-MM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RESMP-DEV/GLM-4.7-Flash-Trellis-MM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RESMP-DEV/GLM-4.7-Flash-Trellis-MM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RESMP-DEV/GLM-4.7-Flash-Trellis-MM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use RESMP-DEV/GLM-4.7-Flash-Trellis-MM with Docker Model Runner:
docker model run hf.co/RESMP-DEV/GLM-4.7-Flash-Trellis-MM
GLM-4.7-Flash-Trellis-3.8bpw
Trellis-quantized GLM-4.7-Flash — a 30B-A3B MoE model compressed to 3.78 bits per weight using sensitivity-aware mixed-precision quantization.
| Metric | Value |
|---|---|
| Effective bits | 3.78 bpw |
| Compression | 4.2× vs FP16 |
| Model size | ~14 GB (vs ~60 GB FP16) |
| Parameters | 29.3B |
| Format | HuggingFace sharded safetensors |
Model Description
This is a quantized version of zai-org/GLM-4.7-Flash, the strongest model in the 30B class that balances performance and efficiency.
GLM-4.7-Flash features:
- 30B-A3B MoE architecture (64 experts + shared expert, 2-4 active per token)
- Multi-head Latent Attention (MLA) for 8× KV cache compression
- State-of-the-art reasoning (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
- Bilingual (English + Chinese)
Quantization Details
Quantized using Trellis (EXL3-style) with Metal Marlin acceleration:
Bit Allocation
| Bit Width | Tensors | Parameters | % of Model |
|---|---|---|---|
| 6-bit | 3,037 | 9.4B | 32.2% |
| 3-bit | 2,710 | 8.6B | 29.3% |
| 2-bit | 2,736 | 8.6B | 29.3% |
| 4-bit | 575 | 2.1B | 7.2% |
| 5-bit | 196 | 591M | 2.0% |
Sensitivity-Aware Allocation
- 8-bit: Router weights, embeddings, LM head, layer norms
- 6-bit: Gate layers, attention projections with high outlier ratios
- 4-5 bit: Standard attention layers (q/k/v/o projections)
- 2-3 bit: MoE expert layers (lowest sensitivity)
Quantization Statistics
- Average MSE: 0.000223
- Average RMSE: 0.0149
- Quantization time: ~110 seconds (RTX 3090 Ti)
- Method: Trellis with Hadamard preprocessing, Viterbi nearest-neighbor, group-wise scales (g=128)
Files
GLM-4.7-Flash-Trellis-MM/
├── model-00001-of-00007.safetensors # ~2 GB each
├── model-00002-of-00007.safetensors
├── model-00003-of-00007.safetensors
├── model-00004-of-00007.safetensors
├── model-00005-of-00007.safetensors
├── model-00006-of-00007.safetensors
├── model-00007-of-00007.safetensors
├── model.safetensors.index.json # Weight map
├── base_weights.safetensors # Embeddings, norms (FP16)
├── config.json # Model config
├── tokenizer.json # Tokenizer
├── tokenizer_config.json
└── quantization_index.json # Quantization metadata
Usage
With Metal Marlin (Apple Silicon)
from metal_marlin.trellis import TrellisForCausalLM
from transformers import AutoTokenizer
model = TrellisForCausalLM.from_pretrained(
"RESMP-DEV/GLM-4.7-Flash-Trellis-3.8bpw",
device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Tensor Format
Each quantized tensor has 4 components:
{name}__indices: Packed uint8 Trellis indices{name}__scales: FP16 per-group scales (group_size=128){name}__su: FP16 row scaling factors{name}__sv: FP16 column scaling factors
Hardware Requirements
| Device | VRAM | Notes |
|---|---|---|
| Apple M2 Ultra | 64 GB+ | Via Metal Marlin |
| Apple M4 Max | 36 GB+ | Via Metal Marlin |
Benchmarks
Original Model Performance (from Z.AI)
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B |
|---|---|---|---|
| AIME 2025 | 91.6 | 85.0 | 91.7 |
| GPQA | 75.2 | 73.4 | 71.5 |
| SWE-bench Verified | 59.2 | 22.0 | 34.0 |
| τ²-Bench | 79.5 | 49.0 | 47.7 |
| BrowseComp | 42.8 | 2.29 | 28.3 |
Quantized Model (Metal Marlin, M4 Max)
| Metric | Value |
|---|---|
| Decode | 5.4 tok/s |
| Prefill (2K) | 42 tok/s |
| Memory | 16.9 GB |
Limitations
- Not compatible with standard transformers — requires Trellis-aware inference code
- No speculative decoding yet
- Quality loss: ~1-2% on benchmarks vs FP16 (typical for 3-4 bit quantization)
Credits
- Original model: Z.AI / GLM Team
- Quantization method: Trellis/EXL3
- Quantization toolkit: Metal Marlin
Citation
If you use this model, please cite the original GLM-4.5 paper:
@misc{glm2025glm45,
title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
author={GLM Team and Aohan Zeng and Xin Lv and others},
year={2025},
eprint={2508.06471},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.06471},
}
License
This quantized model inherits the MIT License from the original GLM-4.7-Flash model.
- Downloads last month
- 8
Model tree for RESMP-DEV/GLM-4.7-Flash-Trellis-MM
Base model
zai-org/GLM-4.7-Flash