LFM2βVL-450M
LFM2βVL is Liquid AI's first series of multimodal models, designed to process text and images with variable resolutions. Built on the LFM2 backbone, it is optimized for low-latency and edge AI applications.
We're releasing the weights of two post-trained checkpoints with 450M (for highly constrained devices) and 1.6B (more capable yet still lightweight) parameters.
- 2Γ faster inference speed on GPUs compared to existing VLMs while maintaining competitive accuracy
- Flexible architecture with user-tunable speed-quality tradeoffs at inference time
- Native resolution processing up to 512Γ512 with intelligent patch-based handling for larger images, avoiding upscaling and distortion
Find more about our vision-language model in the LFM2-VL post and its language backbone in the LFM2 blog post.
π Model details
Due to their small size, we recommend fine-tuning LFM2-VL models on narrow use cases to maximize performance. They were trained for instruction following and lightweight agentic flows. Not intended for safetyβcritical decisions.
| Property | LFM2-VL-450M | LFM2-VL-1.6B |
|---|---|---|
| Parameters (LM only) | 350M | 1.2B |
| Vision encoder | SigLIP2 NaFlex base (86M) | SigLIP2 NaFlex shapeβoptimized (400M) |
| Backbone layers | hybrid conv+attention | hybrid conv+attention |
| Context (text) | 32,768 tokens | 32,768 tokens |
| Image tokens | dynamic, userβtunable | dynamic, userβtunable |
| Vocab size | 65,536 | 65,536 |
| Precision | bfloat16 | bfloat16 |
| License | LFM Open License v1.0 | LFM Open License v1.0 |
Supported languages: English
Generation parameters: We recommend the following parameters:
- Text:
temperature=0.1,min_p=0.15,repetition_penalty=1.05 - Vision:
min_image_tokens=64max_image_tokens=256,do_image_splitting=True
Chat template: LFM2-VL uses a ChatML-like chat template as follows:
<|startoftext|><|im_start|>system
You are a helpful multimodal assistant by Liquid AI.<|im_end|>
<|im_start|>user
<image>Describe this image.<|im_end|>
<|im_start|>assistant
This image shows a Caenorhabditis elegans (C. elegans) nematode.<|im_end|>
Images are referenced with a sentinel (<image>), which is automatically replaced with the image tokens by the processor.
You can apply it using the dedicated .apply_chat_template() function from Hugging Face transformers.
Architecture
- Hybrid backbone: Language model tower (LFM2-1.2B or LFM2-350M) paired with SigLIP2 NaFlex vision encoders (400M shape-optimized or 86M base variant)
- Native resolution processing: Handles images up to 512Γ512 pixels without upscaling and preserves non-standard aspect ratios without distortion
- Tiling strategy: Splits large images into non-overlapping 512Γ512 patches and includes thumbnail encoding for global context (in 1.6B model)
- Efficient token mapping: 2-layer MLP connector with pixel unshuffle reduces image tokens (e.g., 256Γ384 image β 96 tokens, 1000Γ3000 β 1,020 tokens)
- Inference-time flexibility: User-tunable maximum image tokens and patch count for speed/quality tradeoff without retraining
Training approach
- Builds on the LFM2 base model with joint mid-training that fuses vision and language capabilities using a gradually adjusted text-to-image ratio
- Applies joint SFT with emphasis on image understanding and vision tasks
- Leverages large-scale open-source datasets combined with in-house synthetic vision data, selected for balanced task coverage
- Follows a progressive training strategy: base model β joint mid-training β supervised fine-tuning
π How to run LFM2-VL
ONNXRuntime
from transformers import AutoConfig, AutoProcessor
from transformers.image_utils import load_image
import onnxruntime
import numpy as np
from huggingface_hub import hf_hub_download
# 1. Load config, processor, and model
model_id = "onnx-community/LFM2-VL-450M-ONNX"
config = AutoConfig.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
local_dir = 'LFM2-VL-450M-ONNX'
vision_model_path = hf_hub_download(model_id, "vision_encoder.onnx", subfolder="onnx", local_dir=local_dir) # Download vision graph
hf_hub_download(model_id, "vision_encoder.onnx_data", subfolder="onnx", local_dir=local_dir) # Download vision weights
embed_model_path = hf_hub_download(model_id, "embed_tokens.onnx", subfolder="onnx", local_dir=local_dir) # Download embed_tokens graph
hf_hub_download(model_id, "embed_tokens.onnx_data", subfolder="onnx", local_dir=local_dir) # Download embed_tokens weights
decoder_model_path = hf_hub_download(model_id, "decoder_model_merged.onnx", subfolder="onnx", local_dir=local_dir) # Download decoder graph
hf_hub_download(model_id, "decoder_model_merged.onnx_data", subfolder="onnx", local_dir=local_dir) # Download decoder weights
## Load sessions
providers = ['CPUExecutionProvider']
vision_session = onnxruntime.InferenceSession(vision_model_path, providers=providers)
embed_session = onnxruntime.InferenceSession(embed_model_path, providers=providers)
decoder_session = onnxruntime.InferenceSession(decoder_model_path, providers=providers)
## Set config values
text_config = config.text_config
num_key_value_heads = text_config.num_key_value_heads
head_dim = text_config.hidden_size // text_config.num_attention_heads
num_hidden_layers = text_config.num_hidden_layers
eos_token_id = text_config.eos_token_id
hidden_size = text_config.hidden_size
conv_L_cache = text_config.conv_L_cache
layer_types = text_config.layer_types
image_token_index = config.image_token_index
# 2. Prepare inputs
image_url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = load_image(image_url)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "What is in this image?"},
],
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
tokenize=True,
)
input_ids = inputs['input_ids'].numpy()
attention_mask = inputs['attention_mask'].numpy()
has_vision_inputs = 'pixel_values' in inputs
pixel_values = inputs['pixel_values'].numpy() if has_vision_inputs else None
pixel_attention_mask = inputs['pixel_attention_mask'].numpy().astype(np.int64) if has_vision_inputs else None
spatial_shapes = inputs['spatial_shapes'].numpy() if has_vision_inputs else None
batch_size = input_ids.shape[0]
past_cache_values = {}
for i in range(num_hidden_layers):
if layer_types[i] == 'full_attention':
for kv in ('key', 'value'):
past_cache_values[f'past_key_values.{i}.{kv}'] = np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
elif layer_types[i] == 'conv':
past_cache_values[f'past_conv.{i}'] = np.zeros([batch_size, hidden_size, conv_L_cache], dtype=np.float32)
else:
raise ValueError(f"Unsupported layer type: {layer_types[i]}")
# 3. Generation loop
max_new_tokens = 1024
generated_tokens = np.array([[]], dtype=np.int64)
image_features = None
for i in range(max_new_tokens):
inputs_embeds = embed_session.run(None, {'input_ids': input_ids})[0]
if has_vision_inputs and image_features is None:
## Only compute vision features if not already computed
image_features = vision_session.run(None, dict(
pixel_values=pixel_values,
pixel_attention_mask=pixel_attention_mask,
spatial_shapes=spatial_shapes,
))[0]
## Merge text and vision embeddings
inputs_embeds[input_ids == image_token_index] = image_features.reshape(-1, image_features.shape[-1])
logits, *present_cache_values = decoder_session.run(None, dict(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
**past_cache_values,
))
## Update values for next generation loop
input_ids = logits[:, -1].argmax(-1, keepdims=True)
attention_mask = np.concatenate([attention_mask, np.ones((batch_size, 1), dtype=attention_mask.dtype)], axis=-1)
for j, key in enumerate(past_cache_values):
past_cache_values[key] = present_cache_values[j]
generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
if np.isin(input_ids, eos_token_id).any():
break
## (Optional) Streaming
print(processor.decode(input_ids[0], skip_special_tokens=False), end='', flush=True)
print()
# 4. Output result
print(processor.batch_decode(generated_tokens, skip_special_tokens=False)[0])
π§ How to fine-tune
We recommend fine-tuning LFM2-VL models on your use cases to maximize performance.
| Notebook | Description | Link |
|---|---|---|
| SFT (TRL) | Supervised Fine-Tuning (SFT) notebook with a LoRA adapter using TRL. | ![]() |
π Performance
| Model | RealWorldQA | MM-IFEval | InfoVQA (Val) | OCRBench | BLINK | MMStar | MMMU (Val) | MathVista | SEEDBench_IMG | MMVet | MME | MMLU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| InternVL3-2B | 65.10 | 38.49 | 66.10 | 831 | 53.10 | 61.10 | 48.70 | 57.60 | 75.00 | 67.00 | 2186.40 | 64.80 |
| InternVL3-1B | 57.00 | 31.14 | 54.94 | 798 | 43.00 | 52.30 | 43.20 | 46.90 | 71.20 | 58.70 | 1912.40 | 49.80 |
| SmolVLM2-2.2B | 57.50 | 19.42 | 37.75 | 725 | 42.30 | 46.00 | 41.60 | 51.50 | 71.30 | 34.90 | 1792.50 | - |
| LFM2-VL-1.6B | 65.23 | 37.66 | 58.68 | 742 | 44.40 | 49.53 | 38.44 | 51.10 | 71.97 | 48.07 | 1753.04 | 50.99 |
| Model | RealWorldQA | MM-IFEval | InfoVQA (Val) | OCRBench | BLINK | MMStar | MMMU (Val) | MathVista | SEEDBench_IMG | MMVet | MME | MMLU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SmolVLM2-500M | 49.90 | 11.27 | 24.64 | 609 | 40.70 | 38.20 | 34.10 | 37.50 | 62.20 | 29.90 | 1448.30 | - |
| LFM2-VL-450M | 52.29 | 26.18 | 46.51 | 655 | 41.98 | 40.87 | 33.11 | 44.70 | 63.50 | 33.76 | 1239.06 | 40.16 |
We obtained MM-IFEval and InfoVQA (Val) scores for InternVL 3 and SmolVLM2 models using VLMEvalKit.
π¬ Contact
If you are interested in custom solutions with edge deployment, please contact our sales team.
Citation
@article{liquidai2025lfm2,
title={LFM2 Technical Report},
author={Liquid AI},
journal={arXiv preprint arXiv:2511.23404},
year={2025}
}
- Downloads last month
- 37
Model tree for onnx-community/LFM2-VL-450M-ONNX
Base model
LiquidAI/LFM2-VL-450M