Liquid AI
Try LFM β€’ Documentation β€’ LEAP

LFM2‑VL-450M

LFM2‑VL is Liquid AI's first series of multimodal models, designed to process text and images with variable resolutions. Built on the LFM2 backbone, it is optimized for low-latency and edge AI applications.

We're releasing the weights of two post-trained checkpoints with 450M (for highly constrained devices) and 1.6B (more capable yet still lightweight) parameters.

  • 2Γ— faster inference speed on GPUs compared to existing VLMs while maintaining competitive accuracy
  • Flexible architecture with user-tunable speed-quality tradeoffs at inference time
  • Native resolution processing up to 512Γ—512 with intelligent patch-based handling for larger images, avoiding upscaling and distortion

Find more about our vision-language model in the LFM2-VL post and its language backbone in the LFM2 blog post.

πŸ“„ Model details

Due to their small size, we recommend fine-tuning LFM2-VL models on narrow use cases to maximize performance. They were trained for instruction following and lightweight agentic flows. Not intended for safety‑critical decisions.

Property LFM2-VL-450M LFM2-VL-1.6B
Parameters (LM only) 350M 1.2B
Vision encoder SigLIP2 NaFlex base (86M) SigLIP2 NaFlex shape‑optimized (400M)
Backbone layers hybrid conv+attention hybrid conv+attention
Context (text) 32,768 tokens 32,768 tokens
Image tokens dynamic, user‑tunable dynamic, user‑tunable
Vocab size 65,536 65,536
Precision bfloat16 bfloat16
License LFM Open License v1.0 LFM Open License v1.0

Supported languages: English

Generation parameters: We recommend the following parameters:

  • Text: temperature=0.1, min_p=0.15, repetition_penalty=1.05
  • Vision: min_image_tokens=64 max_image_tokens=256, do_image_splitting=True

Chat template: LFM2-VL uses a ChatML-like chat template as follows:

<|startoftext|><|im_start|>system
You are a helpful multimodal assistant by Liquid AI.<|im_end|>
<|im_start|>user
<image>Describe this image.<|im_end|>
<|im_start|>assistant
This image shows a Caenorhabditis elegans (C. elegans) nematode.<|im_end|>

Images are referenced with a sentinel (<image>), which is automatically replaced with the image tokens by the processor.

You can apply it using the dedicated .apply_chat_template() function from Hugging Face transformers.

Architecture

  • Hybrid backbone: Language model tower (LFM2-1.2B or LFM2-350M) paired with SigLIP2 NaFlex vision encoders (400M shape-optimized or 86M base variant)
  • Native resolution processing: Handles images up to 512Γ—512 pixels without upscaling and preserves non-standard aspect ratios without distortion
  • Tiling strategy: Splits large images into non-overlapping 512Γ—512 patches and includes thumbnail encoding for global context (in 1.6B model)
  • Efficient token mapping: 2-layer MLP connector with pixel unshuffle reduces image tokens (e.g., 256Γ—384 image β†’ 96 tokens, 1000Γ—3000 β†’ 1,020 tokens)
  • Inference-time flexibility: User-tunable maximum image tokens and patch count for speed/quality tradeoff without retraining

Training approach

  • Builds on the LFM2 base model with joint mid-training that fuses vision and language capabilities using a gradually adjusted text-to-image ratio
  • Applies joint SFT with emphasis on image understanding and vision tasks
  • Leverages large-scale open-source datasets combined with in-house synthetic vision data, selected for balanced task coverage
  • Follows a progressive training strategy: base model β†’ joint mid-training β†’ supervised fine-tuning

πŸƒ How to run LFM2-VL

ONNXRuntime

from transformers import AutoConfig, AutoProcessor
from transformers.image_utils import load_image
import onnxruntime
import numpy as np
from huggingface_hub import hf_hub_download

# 1. Load config, processor, and model
model_id = "onnx-community/LFM2-VL-450M-ONNX"
config = AutoConfig.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

local_dir = 'LFM2-VL-450M-ONNX'
vision_model_path = hf_hub_download(model_id, "vision_encoder.onnx", subfolder="onnx", local_dir=local_dir)         # Download vision graph
hf_hub_download(model_id, "vision_encoder.onnx_data", subfolder="onnx", local_dir=local_dir)                        # Download vision weights
embed_model_path = hf_hub_download(model_id, "embed_tokens.onnx", subfolder="onnx", local_dir=local_dir)            # Download embed_tokens graph
hf_hub_download(model_id, "embed_tokens.onnx_data", subfolder="onnx", local_dir=local_dir)                          # Download embed_tokens weights
decoder_model_path = hf_hub_download(model_id, "decoder_model_merged.onnx", subfolder="onnx", local_dir=local_dir)  # Download decoder graph
hf_hub_download(model_id, "decoder_model_merged.onnx_data", subfolder="onnx", local_dir=local_dir)                  # Download decoder weights

## Load sessions
providers = ['CPUExecutionProvider']
vision_session = onnxruntime.InferenceSession(vision_model_path, providers=providers)
embed_session = onnxruntime.InferenceSession(embed_model_path, providers=providers)
decoder_session = onnxruntime.InferenceSession(decoder_model_path, providers=providers)

## Set config values
text_config = config.text_config
num_key_value_heads = text_config.num_key_value_heads
head_dim = text_config.hidden_size // text_config.num_attention_heads
num_hidden_layers = text_config.num_hidden_layers
eos_token_id = text_config.eos_token_id
hidden_size = text_config.hidden_size
conv_L_cache = text_config.conv_L_cache
layer_types = text_config.layer_types
image_token_index = config.image_token_index

# 2. Prepare inputs
image_url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = load_image(image_url)
messages = [
  {
    "role": "user",
    "content": [
      {"type": "image", "image": image},
      {"type": "text", "text": "What is in this image?"},
    ],
  },
]
inputs = processor.apply_chat_template(
  messages,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
  tokenize=True,
)

input_ids = inputs['input_ids'].numpy()
attention_mask = inputs['attention_mask'].numpy()
has_vision_inputs = 'pixel_values' in inputs
pixel_values = inputs['pixel_values'].numpy() if has_vision_inputs else None
pixel_attention_mask = inputs['pixel_attention_mask'].numpy().astype(np.int64) if has_vision_inputs else None
spatial_shapes = inputs['spatial_shapes'].numpy() if has_vision_inputs else None

batch_size = input_ids.shape[0]
past_cache_values = {}
for i in range(num_hidden_layers):
  if layer_types[i] == 'full_attention':
    for kv in ('key', 'value'):
      past_cache_values[f'past_key_values.{i}.{kv}'] = np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
  elif layer_types[i] == 'conv':
    past_cache_values[f'past_conv.{i}'] = np.zeros([batch_size, hidden_size, conv_L_cache], dtype=np.float32)
  else:
    raise ValueError(f"Unsupported layer type: {layer_types[i]}")

# 3. Generation loop
max_new_tokens = 1024
generated_tokens = np.array([[]], dtype=np.int64)
image_features = None
for i in range(max_new_tokens):
  inputs_embeds = embed_session.run(None, {'input_ids': input_ids})[0]

  if has_vision_inputs and image_features is None:
    ## Only compute vision features if not already computed
    image_features = vision_session.run(None, dict(
      pixel_values=pixel_values,
      pixel_attention_mask=pixel_attention_mask,
      spatial_shapes=spatial_shapes,
    ))[0]

    ## Merge text and vision embeddings
    inputs_embeds[input_ids == image_token_index] = image_features.reshape(-1, image_features.shape[-1])

  logits, *present_cache_values = decoder_session.run(None, dict(
    inputs_embeds=inputs_embeds,
    attention_mask=attention_mask,
    **past_cache_values,
  ))

  ## Update values for next generation loop
  input_ids = logits[:, -1].argmax(-1, keepdims=True)
  attention_mask = np.concatenate([attention_mask, np.ones((batch_size, 1), dtype=attention_mask.dtype)], axis=-1)
  for j, key in enumerate(past_cache_values):
    past_cache_values[key] = present_cache_values[j]

  generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
  if np.isin(input_ids, eos_token_id).any():
    break

  ## (Optional) Streaming
  print(processor.decode(input_ids[0], skip_special_tokens=False), end='', flush=True)
print()

# 4. Output result
print(processor.batch_decode(generated_tokens, skip_special_tokens=False)[0])

πŸ”§ How to fine-tune

We recommend fine-tuning LFM2-VL models on your use cases to maximize performance.

Notebook Description Link
SFT (TRL) Supervised Fine-Tuning (SFT) notebook with a LoRA adapter using TRL. Colab link

πŸ“ˆ Performance

Model RealWorldQA MM-IFEval InfoVQA (Val) OCRBench BLINK MMStar MMMU (Val) MathVista SEEDBench_IMG MMVet MME MMLU
InternVL3-2B 65.10 38.49 66.10 831 53.10 61.10 48.70 57.60 75.00 67.00 2186.40 64.80
InternVL3-1B 57.00 31.14 54.94 798 43.00 52.30 43.20 46.90 71.20 58.70 1912.40 49.80
SmolVLM2-2.2B 57.50 19.42 37.75 725 42.30 46.00 41.60 51.50 71.30 34.90 1792.50 -
LFM2-VL-1.6B 65.23 37.66 58.68 742 44.40 49.53 38.44 51.10 71.97 48.07 1753.04 50.99
Model RealWorldQA MM-IFEval InfoVQA (Val) OCRBench BLINK MMStar MMMU (Val) MathVista SEEDBench_IMG MMVet MME MMLU
SmolVLM2-500M 49.90 11.27 24.64 609 40.70 38.20 34.10 37.50 62.20 29.90 1448.30 -
LFM2-VL-450M 52.29 26.18 46.51 655 41.98 40.87 33.11 44.70 63.50 33.76 1239.06 40.16

We obtained MM-IFEval and InfoVQA (Val) scores for InternVL 3 and SmolVLM2 models using VLMEvalKit.

πŸ“¬ Contact

If you are interested in custom solutions with edge deployment, please contact our sales team.

Citation

@article{liquidai2025lfm2,
 title={LFM2 Technical Report},
 author={Liquid AI},
 journal={arXiv preprint arXiv:2511.23404},
 year={2025}
}
Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for onnx-community/LFM2-VL-450M-ONNX

Quantized
(13)
this model