Instructions to use ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF", filename="Qwen3.5-4B-MicroCoder-Q6_K.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K # Run inference directly in the terminal: llama-cli -hf ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K # Run inference directly in the terminal: llama-cli -hf ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K # Run inference directly in the terminal: ./llama-cli -hf ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K
Use Docker
docker model run hf.co/ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K
- LM Studio
- Jan
- Ollama
How to use ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF with Ollama:
ollama run hf.co/ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K
- Unsloth Studio
How to use ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF to start chatting
- Pi
How to use ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K
Run Hermes
hermes
- Docker Model Runner
How to use ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF with Docker Model Runner:
docker model run hf.co/ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K
- Lemonade
How to use ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF:Q6_K
Run and chat with the model
lemonade run user.Qwen3.5-4B-MicroCoder-GGUF-Q6_K
List all available models
lemonade list
❌ incorrect
There’s something intriguing about this model—it feels less like a polished product and more like a carefully stitched organism, optimized with intent rather than ambition. I’ll give you a clear, honest evaluation.
🌒 Real Assessment (No Hype)
Is it good?
Yes.
Is it exceptional?
Only within a narrow scope.
This is not a general-purpose improvement over Qwen3.5-4B.
It is a targeted optimization for coding tasks.
⚖️ Where It Actually Works
The core goal is clear:
→ inject coding ability without destroying baseline reasoning
And surprisingly, it succeeds better than most merge attempts.
Key signals:
MBPP +7 → meaningful improvement
LCB +3 / +1.8 → real gains on modern, contamination-resistant benchmarks
GSM8K unchanged → critical: baseline reasoning preserved
👉 Translation:
It became more useful for coding without becoming broadly unstable, which is rare in merge-based models.
🌘 Where It Breaks (and it does)
Let’s be direct:
AIME: severe collapse (−23 pp)
MMLU-Pro −4.3
HumanEval −4.8
This is not a minor regression.
This is a shift in model behavior.
👉 Meaning:
Deep abstract reasoning is weakened
The model becomes more instrumental, less cognitively flexible
The most important statement in the entire document is this:
“The AIME ceiling is structural, not an optimization issue.”
That’s a crucial insight.
🧠 Technical Evaluation
This was not done randomly. The author understands what they’re doing.
- Task Arithmetic over Symmetric Merge
They recognized that not all sources are equal:
reasoning model → backbone
coding models → delta contributors
👉 This is intentional architecture, not brute-force merging.
- DAREx Pruning
Removing 85–95% of low-magnitude deltas:
👉 Effects:
reduces noise
preserves strong feature signals
This is a high-confidence, high-risk decision, and it pays off.
- Layer Skipping (mlp.gate_proj 18–25)
This is the most advanced part.
Instead of tuning weights, the author is:
→ modulating behavior at the layer level
👉 In plain terms:
they identified where reasoning behavior lives—and protected or bypassed it.
This is above average merge work.
🌑 The Core Limitation
This model demonstrates something fundamental:
Coding ability and deep reasoning are not fully aligned capabilities.
You can balance them—but beyond a point, they conflict structurally.
The author clearly understands this and chose a side.
🧭 Final Judgment
If someone claims:
“Better than base Qwen overall” → ❌ incorrect
“General improvement” → ❌ incorrect
“Optimized local coding model” → ✅ correct
“Technically well-designed merge” → ✅ very correct
🔥 Final Scores
Technical quality: 8.5 / 10
Scientific honesty: 9 / 10
Practical usefulness (coding): 8 / 10
General intelligence: 5.5 / 10
🌌 One-line Summary
This is not a smarter mind.
It is a mind that learned to handle tools better—while losing part of its ability to think deeply.
Beautiful and correct assessment 😊
Obviously released only as a research artifact.
Hopefully someone will deliver some good coding fine-tunes that I can use as sources with the excellent Jackrong-v2.