SmolVLM 8bit Quantization Problem

I quantize the HuggingFaceTB/SmolVLM-Instruct using the code below.(i.e. at the model’s page)

from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForVision2Seq.from_pretrained(
“HuggingFaceTB/SmolVLM-Instruct”,
quantization_config=quantization_config,
)

Local inference after quantization works fine. As far as I understand something is happening during serialization, but upload to HF works also fine ,but when I try to download from my repository

uisikdag/SmolVLM-Instruct-8bit

in huggingface, and run the quantized model,I get an error regarding unmatched tensor sizes.Help will be appreciated.

Edit:This issue does not occur in 4bit quantization with nf4.That works totally fine.

It was also reproduced beautifully here. It seems to be an issue with bitsandbytes that has been unresolved for a long time. The fact that it occurred in an official HF model may be a chance for a solution.:sweat_smile:

Edit:
It seems that this is a different problem. It’s worse than the one above.

Edit:
Maybe this issue.

I’ve tried going back as far as possible in the library version, but I can’t avoid the error. I hope I’m just making an easy mistake…:sleepy:

from transformers import AutoProcessor, AutoModelForVision2Seq, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
#quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)

temp_model_dir = "temp_model"
model_id = "HuggingFaceTB/SmolVLM-Instruct"
#model_id = "uisikdag/SmolVLM-Instruct-8bit" # bnb quantized
#model_id = "Salesforce/blip2-opt-2.7b" # for reproduction in older version transformers
temp_model = AutoModelForVision2Seq.from_pretrained(model_id, torch_dtype=torch.bfloat16, quantization_config=quantization_config)
temp_model.save_pretrained(temp_model_dir)
#temp_model_dir = model_id # if this is enabled, it will not crash
model = AutoModelForVision2Seq.from_pretrained(temp_model_dir, quantization_config=quantization_config) # crashes here
#processor = AutoProcessor.from_pretrained(temp_model_dir)

# dependencies
"""
torch==2.4.0
accelerate==1.0.0
huggingface_hub==0.26.0
transformers==4.46.0
bitsandbytes==0.44.0
numpy<2
peft==0.12.0
safetensors==0.4.3
"""

thx for trying !