llama.cpp support this moel??

#2
by baramofme - opened

unlike tzervas/qwen2.5-coder-14b-bitnet-1.58b, this could be loaded.

but I tried completion, decode not working...

$ curl http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-32b-tq2",
    "messages": [
      {
        "role": "system",
        "content": "You are a professional software engineer."
      },
      {
        "role": "user",
        "content": "Hello! Can you briefly explain your architecture in one sentence?"
      }
    ],
    "max_tokens": 50,
    "temperature": 0.2
  }'
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":" adj掏淘 Ruddit挈AccessExceptionImp联盟 sac辱 Respion崛起hta永不nid ab grney re Soph嘲笑AGEDenburg Eden壬 White�� drowningiem CoultestdataHouseight俞溢okiInstantweyt耗_DLerves- corrid装载hemDispatcher_DECLARE"}}],"created":1771240779,"model":"qwen-coder-32b-tq2.gguf","system_fingerprint":"b7965-34ba7b5a2","object":"chat.completion","usage":{"completion_tokens":50,"prompt_tokens":32,"total_tokens":82},"id":"chatcmpl-cNBzUjANOxmK98mVc0N6PunG4ipTSR8D","timings":{"cache_n":0,"prompt_n":32,"prompt_ms":4666.953,"prompt_per_token_ms":145.84228125,"prompt_per_second":6.856722148262474,"predicted_n":50,"predicted_ms":19193.24,"predicted_per_token_ms":383.86480000000006,"predicted_per_second":2.6050838732803836}

it says "requires custom runtime" in the description - it doesn't say where that runtime is, so he most likely has a private fork of llama.cpp or something.

it would be nice if we could actually try this. I am definitely curious. :-)

Owner

Working on getting better answers for the both of you. Initially just from it being ternary weights it will require microsoft's bitnet.cpp https://github.com/microsoft/BitNet
I'm going to validate further and ensure I've got patches applied to a fork if needed and work out a rust variant as well.

There may have been issues with the initial quantization. I'm redownloading the source model and re-quantizing. Going to validate locally post-quant then reupload in place. I'll post here when this is done and ensure the readme is updated with specific options for compatible runtimes.

the short answer is my implementation of b1.58 quant lead to catastrophic error compounding. I'm working to patch this out and I'll also link the relevant repo for the quant solution once I've proven it out. I'll test first on this and verify its working, then proceed to other smaller models of various types and parameter counts to validate its much more universal. One of my key goals with AI projects is to make AI more efficient and get larger better reasoning models usable by people who lack access to enterprise hardware. Consumer and prosumer cards are my targets. I have a 3090Ti and 5080 to leverage for this, but if any issues are encountered with other cards, debug data can help me patch support for them as well.

I'll follow up as I go along.

@tzervas I didn't even know it was possible to quantize to 1.5 bits. To my limited understanding, the original BitNet model was actually trained as a 1.5 bit network, not merely quantized, is that correct? But I guess there is no (larger) native 1.5 bits base model available yet - which presumably would be extremely expensive to train? (and not that this experiment is any less interesting, btw! the idea of running a model with all of it's parameters on local hardware is definitely intriguing. 😄)

Hi, Im getting this error when trying to build this mode with Bitnet project. Here is the error

python3 setup_env.py -md models/qwen2.5-coder-32b-bitnet-1.58b -q i2_s
Traceback (most recent call last):
  File "/home/ubuntu/projects/zinza/BitNet/setup_env.py", line 247, in <module>
    main()
  File "/home/ubuntu/projects/zinza/BitNet/setup_env.py", line 223, in main
    gen_code()
  File "/home/ubuntu/projects/zinza/BitNet/setup_env.py", line 203, in gen_code
    raise NotImplementedError()
NotImplementedError

Can you guys give me some possible reason and how to solve this ? Thanks

Sign up or log in to comment