SmolVLM 8bit Quantization Problem

uisikdag · November 28, 2024, 2:35pm

I quantize the HuggingFaceTB/SmolVLM-Instruct using the code below.(i.e. at the model’s page)

from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForVision2Seq.from_pretrained(
“HuggingFaceTB/SmolVLM-Instruct”,
quantization_config=quantization_config,
)

Local inference after quantization works fine. As far as I understand something is happening during serialization, but upload to HF works also fine ,but when I try to download from my repository

uisikdag/SmolVLM-Instruct-8bit

in huggingface, and run the quantized model,I get an error regarding unmatched tensor sizes.Help will be appreciated.

Edit:This issue does not occur in 4bit quantization with nf4.That works totally fine.

John6666 · November 28, 2024, 3:19pm

It was also reproduced beautifully here. It seems to be an issue with bitsandbytes that has been unresolved for a long time. The fact that it occurred in an official HF model may be a chance for a solution.

github.com/bitsandbytes-foundation/bitsandbytes

8 bit quantization and finetuning with lora is not working - receiving runtime error

opened 10:53AM - 28 Feb 24 UTC

solomonmanuelraj

waiting for info

### System Info Hi Team, when i am running the above qlora code for owl-vit …model (google/owlvit-base-patch32) with below 4 bits bnbconfig , the fine tuning is taking place without any error. bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) once i change the config with below information bnb_config = BitsAndBytesConfig( load_in_8bit=True ) i receive the following error trace. ######################################################################################### RuntimeError Traceback (most recent call last) Cell In[25], line 1 ----> 1 trainer.train() File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1535 hf_hub_utils.enable_progress_bars() 1536 else: -> 1537 return inner_training_loop( 1538 args=args, 1539 resume_from_checkpoint=resume_from_checkpoint, 1540 trial=trial, 1541 ignore_keys_for_eval=ignore_keys_for_eval, 1542 ) File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/transformers/trainer.py:1854, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval) 1851 self.control = self.callback_handler.on_step_begin(args, self.state, self.control) 1853 with self.accelerator.accumulate(model): -> 1854 tr_loss_step = self.training_step(model, inputs) 1856 if ( 1857 args.logging_nan_inf_filter 1858 and not is_torch_tpu_available() 1859 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step)) 1860 ): 1861 # if loss is nan or inf simply add the average of previous logged losses 1862 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged) File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/transformers/trainer.py:2744, in Trainer.training_step(self, model, inputs) 2742 scaled_loss.backward() 2743 else: -> 2744 self.accelerator.backward(loss) 2746 return loss.detach() / self.args.gradient_accumulation_steps File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/accelerate/accelerator.py:1907, in Accelerator.backward(self, loss, **kwargs) 1905 return 1906 elif self.scaler is not None: -> 1907 self.scaler.scale(loss).backward(**kwargs) 1908 else: 1909 loss.backward(**kwargs) File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/torch/_tensor.py:492, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs) 482 if has_torch_function_unary(self): 483 return handle_torch_function( 484 Tensor.backward, 485 (self,), (...) 490 inputs=inputs, 491 ) --> 492 torch.autograd.backward( 493 self, gradient, retain_graph, create_graph, inputs=inputs 494 ) File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/torch/autograd/init.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs) 246 retain_graph = create_graph 248 # The reason we repeat the same comment below is that 249 # some Python versions print out the first line of a multi-line function 250 # calls in the traceback and some print out the last line --> 251 Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 252 tensors, 253 grad_tensors, 254 retain_graph, 255 create_graph, 256 inputs, 257 allow_unreachable=True, 258 accumulate_grad=True, 259 ) File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/torch/autograd/function.py:288, in BackwardCFunction.apply(self, *args) 282 raise RuntimeError( 283 "Implementing both 'backward' and 'vjp' for a custom " 284 "Function is not allowed. You should only implement one " 285 "of them." 286 ) 287 user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn --> 288 return user_fn(self, *args) File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/bitsandbytes/autograd/functions.py:491, in MatMul8bitLt.backward(ctx, grad_output) 485 print("state.CxB",state.CxB) 486 print("State ",state) 488 CB = ( 489 undo_layout(state.CxB, state.tile_indices) 490 .to(ctx.dtype_A) --> 491 .mul(state.SCB.unsqueeze(1).mul(1.0 / 127.0)) 492 ) 493 grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A) 494 else: RuntimeError: The size of tensor a (32) must match the size of tensor b (4) at non-singleton dimension 0 ####################################################################################### need your help. thanks. ### Reproduction Need your help to solve this problem ### Expected behavior it is working fine for 4 bits quantization and finetuning but same code, dataset with 8 bit it is not working. it gives runtime error ( RuntimeError: The size of tensor a (32) must match the size of tensor b (4) at non-singleton dimension 0)

Edit:
It seems that this is a different problem. It’s worse than the one above.

Edit:
Maybe this issue.

github.com/unslothai/unsloth

Can't load CodeLlama-13b

opened 09:35AM - 14 Jun 24 UTC

user799595

currently fixing URGENT BUG

I would like to finetune CodeLlama-13b in a memory efficient way. I was able …to do it with CodeLlama-7b, but failing with 13b. I can't load the model `unsloth/codellama-13b-bnb-4bit`: ```python model, tokenizer = unsloth.FastLanguageModel.from_pretrained('codellama/CodeLlama-13b-hf', load_in_4bit=True) ``` > ValueError: Supplied state dict for model.layers.28.mlp.gate_proj.weight does not contain `bitsandbytes__*` and possibly other `quantized_stats` components. I tried to quantize it first, but that also failed ```python model, tokenizer = unsloth.FastLanguageModel.from_pretrained('codellama/CodeLlama-13b-hf', load_in_4bit=False) model.save_pretrained_gguf('./codellama-13b-bnb-4bit', tokenizer=tokenizer) ``` > RuntimeError: The weights trying to be saved contained shared tensors [{'model.layers.26.self_attn.q_proj.weight', 'model.layers.31.self_attn.v_proj.weight', 'model.layers.32.self_attn.q_proj.weight', 'model.layers.39.self_attn.q_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.35.self_attn.q_proj.weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.33.self_attn.q_proj.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.layers.33.self_attn.k_proj.weight', 'model.layers.35.self_attn.k_proj.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.layers.36.self_attn.q_proj.weight', 'model.layers.36.self_attn.k_proj.weight', 'model.layers.37.self_attn.q_proj.weight', 'model.layers.39.self_attn.k_proj.weight', 'model.layers.30.self_attn.o_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.38.self_attn.k_proj.weight', 'model.layers.34.self_attn.q_proj.weight', 'model.layers.33.self_attn.v_proj.weight', 'model.layers.32.self_attn.o_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.layers.33.self_attn.o_proj.weight', 'model.layers.36.self_attn.v_proj.weight', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.30.self_attn.v_proj.weight', 'model.layers.32.self_attn.v_proj.weight', 'model.layers.34.self_attn.v_proj.weight', 'model.layers.31.self_attn.k_proj.weight', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.29.self_attn.q_proj.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.34.self_attn.o_proj.weight', 'model.layers.36.self_attn.o_proj.weight', 'model.layers.39.self_attn.v_proj.weight', 'model.layers.39.self_attn.o_proj.weight', 'model.layers.34.self_attn.k_proj.weight', 'model.layers.32.self_attn.k_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.35.self_attn.v_proj.weight', 'model.layers.37.self_attn.v_proj.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.37.self_attn.o_proj.weight', 'model.layers.37.self_attn.k_proj.weight', 'model.layers.35.self_attn.o_proj.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.38.self_attn.q_proj.weight', 'model.layers.30.self_attn.k_proj.weight', 'model.layers.38.self_attn.v_proj.weight', 'model.layers.38.self_attn.o_proj.weight'}, {'model.layers.37.mlp.gate_proj.weight', 'model.layers.31.mlp.down_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model.layers.33.mlp.up_proj.weight', 'model.layers.35.mlp.gate_proj.weight', 'model.layers.31.mlp.gate_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.35.mlp.down_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.38.mlp.gate_proj.weight', 'model.layers.33.mlp.down_proj.weight', 'model.layers.29.mlp.up_proj.weight', 'model.layers.30.mlp.down_proj.weight', 'model.layers.36.mlp.down_proj.weight', 'model.layers.29.mlp.down_proj.weight', 'model.layers.33.mlp.gate_proj.weight', 'model.layers.37.mlp.up_proj.weight', 'model.layers.31.mlp.up_proj.weight', 'model.layers.37.mlp.down_proj.weight', 'model.layers.32.mlp.gate_proj.weight', 'model.layers.39.mlp.down_proj.weight', 'model.layers.34.mlp.down_proj.weight', 'model.layers.39.mlp.gate_proj.weight', 'model.layers.32.mlp.up_proj.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.36.mlp.up_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.34.mlp.gate_proj.weight', 'model.layers.38.mlp.up_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.36.mlp.gate_proj.weight', 'model.layers.38.mlp.down_proj.weight', 'model.layers.35.mlp.up_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.28.mlp.down_proj.weight', 'model.layers.32.mlp.down_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.34.mlp.up_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.39.mlp.up_proj.weight'}, {'model.layers.37.input_layernorm.weight', 'model.layers.32.post_attention_layernorm.weight', 'model.layers.35.input_layernorm.weight', 'model.layers.35.post_attention_layernorm.weight', 'model.layers.31.input_layernorm.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.36.input_layernorm.weight', 'model.layers.34.post_attention_layernorm.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.27.input_layernorm.weight', 'model.layers.37.post_attention_layernorm.weight', 'model.norm.weight', 'model.layers.28.post_attention_layernorm.weight', 'model.layers.38.post_attention_layernorm.weight', 'model.layers.34.input_layernorm.weight', 'model.layers.30.input_layernorm.weight', 'model.layers.38.input_layernorm.weight', 'model.layers.30.post_attention_layernorm.weight', 'model.layers.29.post_attention_layernorm.weight', 'model.layers.32.input_layernorm.weight', 'model.layers.28.input_layernorm.weight', 'model.layers.31.post_attention_layernorm.weight', 'model.layers.39.input_layernorm.weight', 'model.layers.33.input_layernorm.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.39.post_attention_layernorm.weight', 'model.layers.29.input_layernorm.weight', 'model.layers.36.post_attention_layernorm.weight', 'model.layers.33.post_attention_layernorm.weight'}] that are mismatching the transformers base configuration. Try saving using `safe_serialization=False` or remove this tensor sharing. Is CodeLlama-13b not supported? Should I be using a different model?

John6666 · November 28, 2024, 5:37pm

I’ve tried going back as far as possible in the library version, but I can’t avoid the error. I hope I’m just making an easy mistake…

from transformers import AutoProcessor, AutoModelForVision2Seq, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
#quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)

temp_model_dir = "temp_model"
model_id = "HuggingFaceTB/SmolVLM-Instruct"
#model_id = "uisikdag/SmolVLM-Instruct-8bit" # bnb quantized
#model_id = "Salesforce/blip2-opt-2.7b" # for reproduction in older version transformers
temp_model = AutoModelForVision2Seq.from_pretrained(model_id, torch_dtype=torch.bfloat16, quantization_config=quantization_config)
temp_model.save_pretrained(temp_model_dir)
#temp_model_dir = model_id # if this is enabled, it will not crash
model = AutoModelForVision2Seq.from_pretrained(temp_model_dir, quantization_config=quantization_config) # crashes here
#processor = AutoProcessor.from_pretrained(temp_model_dir)

# dependencies
"""
torch==2.4.0
accelerate==1.0.0
huggingface_hub==0.26.0
transformers==4.46.0
bitsandbytes==0.44.0
numpy<2
peft==0.12.0
safetensors==0.4.3
"""

uisikdag · November 29, 2024, 12:25am

thx for trying !

Topic		Replies	Views
An error i ve been trying to fix for days now Intermediate	4	749	November 19, 2024
BitsandBytes conflict with Accelerate 🤗Accelerate	7	1425	March 6, 2026
Qlora - 8 bit quantization using bitsandbytes gives error for owl-vit model Intermediate	1	525	April 12, 2024
8 bit precision error Models	0	454	March 30, 2024
Loading quantized model on CPU only 🤗Transformers	6	19557	February 3, 2025

SmolVLM 8bit Quantization Problem

Related topics