Transformers documentation
GLM-OCR
This model was released on {release_date} and added to Hugging Face Transformers on 2026-01-27.
GLM-OCR
Overview
GLM-OCR is a multimodal OCR (Optical Character Recognition) model designed for complex document understanding from Z.ai. The model combines a CogViT visual encoder (pre-trained on large-scale image-text data), a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder.
Key features of GLM-OCR include:
- Lightweight: Only 0.9B parameters while achieving state-of-the-art performance (94.62 on OmniDocBench V1.5)
- Multi-task: Excels at text recognition, formula recognition, table recognition, and information extraction
- Multi-modal: Processes document images for text, formula, and table extraction
This model was contributed by the zai-org team. The original code can be found here.
Usage example
Single image inference
from transformers import AutoProcessor, GlmOcrForConditionalGeneration
import torch
model_id = "zai-org/GLM-OCR"
processor = AutoProcessor.from_pretrained(model_id)
model = GlmOcrForConditionalGeneration.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"},
{"type": "text", "text": "Text Recognition:"},
],
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))Batch inference
The model supports batching multiple images for efficient processing.
from transformers import AutoProcessor, GlmOcrForConditionalGeneration
import torch
model_id = "zai-org/GLM-OCR"
processor = AutoProcessor.from_pretrained(model_id)
model = GlmOcrForConditionalGeneration.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
)
# First document
message1 = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"},
{"type": "text", "text": "Text Recognition:"},
],
}
]
# Second document
message2 = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "Text Recognition:"},
],
}
]
messages = [message1, message2]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
padding=True,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=128)
print(processor.batch_decode(output, skip_special_tokens=True))Flash Attention 2
GLM-OCR supports Flash Attention 2 for faster inference. First, install the latest version of Flash Attention:
pip install -U flash-attn --no-build-isolation
Then load the model with one of the supported kernels of the kernels-community:
from transformers import GlmOcrForConditionalGeneration
import torch
model = GlmOcrForConditionalGeneration.from_pretrained(
"zai-org/GLM-OCR",
dtype=torch.bfloat16,
attn_implementation="kernels-community/flash-attn2", # other options: kernels-community/vllm-flash-attn3, kernels-community/paged-attention
device_map="auto",
)GlmOcrConfig
class transformers.GlmOcrConfig
< source >( text_config = None vision_config = None image_token_id = 59280 video_token_id = 59281 image_start_token_id = 59256 image_end_token_id = 59257 video_start_token_id = 59258 video_end_token_id = 59259 tie_word_embeddings = False **kwargs )
Parameters
- text_config (`) — The config object or dictionary of the text backbone.
- vision_config (`) — The config object or dictionary of the vision backbone.
- image_token_id (`, defaults to 59280) — The image token index used as a placeholder for input images.
- video_token_id (`, defaults to 59281) — The video token index used as a placeholder for input videos.
- image_start_token_id (int, optional, defaults to 59256) — The image start token index to encode the start of image.
- image_end_token_id (int, optional, defaults to 59257) — The image end token index to encode the end of image.
- video_start_token_id (int, optional, defaults to 59258) — The video start token index to encode the start of video.
- video_end_token_id (int, optional, defaults to 59259) — The video end token index to encode the end of video.
- ```python —
from transformers import GlmOcrForConditionalGeneration, GlmOcrConfig
This is the configuration class to store the configuration of a GlmOcrModel. It is used to instantiate a Glm Ocr model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the zai-org/GLM-OCR
Configuration objects inherit from [PreTrainedConfig] and can be used to control the model outputs. Read the documentation from [PreTrainedConfig] for more information.
GlmOcrVisionConfig
class transformers.GlmOcrVisionConfig
< source >( depth = 24 hidden_size = 1024 hidden_act = 'silu' attention_bias = True attention_dropout = 0.0 num_heads = 16 in_channels = 3 image_size = 336 patch_size = 14 rms_norm_eps = 1e-05 spatial_merge_size = 2 temporal_patch_size = 2 out_hidden_size = 1536 intermediate_size = 4096 initializer_range = 0.02 **kwargs )
Parameters
- depth (`
, defaults to24`) — Number of Transformer layers in the vision encoder. - hidden_size (`
, defaults to1024`) — Dimension of the hidden representations. - hidden_act (`
, defaults tosilu) -- The non-linear activation function (function or string) in the decoder. For example,“gelu”,“relu”,“silu”`, etc. - attention_bias (`
, defaults toTrue`) — Whether to use a bias in the query, key, value and output projection layers during self-attention. - attention_dropout (`
, defaults to0.0`) — The dropout ratio for the attention probabilities. - num_heads (`
, defaults to16`) — Number of attention heads for each attention layer in the Transformer decoder. - in_channels (`
, defaults to3`) — The number of input channels. - image_size (`
, defaults to336`) — The size (resolution) of each image. - patch_size (`
, defaults to14`) — The size (resolution) of each patch. - rms_norm_eps (`
, defaults to1e-05`) — The epsilon used by the rms normalization layers. - spatial_merge_size (`
, defaults to2`) — The size of the spatial merge window used to reduce the number of visual tokens by merging neighboring patches. - temporal_patch_size (`
, defaults to2`) — Temporal patch size used in the 3D patch embedding for video inputs. - out_hidden_size (
int, optional, defaults to 4096) — The output hidden size of the vision model. - intermediate_size (`
, defaults to4096`) — Dimension of the MLP representations. - initializer_range (`
, defaults to0.02`) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
This is the configuration class to store the configuration of a GlmOcrModel. It is used to instantiate a Glm Ocr model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the zai-org/GLM-OCR
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
Example:
>>> from transformers import GlmOcrVisionConfig, GlmOcrVisionModel
>>> # Initializing a GlmOcrVisionConfig GLM-4.1V-9B style configuration
>>> configuration = GlmOcrVisionConfig()
>>> # Initializing a model (with random weights) from the GLM-4.1V-9B configuration
>>> model = GlmOcrVisionModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configGlmOcrTextConfig
class transformers.GlmOcrTextConfig
< source >( vocab_size: int | None = 59392 hidden_size: int | None = 1024 intermediate_size: int | None = 4096 num_hidden_layers: int | None = 16 num_attention_heads: int | None = 16 num_key_value_heads: int | None = 8 hidden_act: str | None = 'silu' max_position_embeddings: int | None = 131072 initializer_range: float | None = 0.02 rms_norm_eps: int | None = 1e-05 use_cache: bool | None = True attention_dropout: float | None = 0.0 rope_parameters: transformers.modeling_rope_utils.RopeParameters | dict[str, transformers.modeling_rope_utils.RopeParameters] | None = None pad_token_id: int | None = None **kwargs )
Parameters
- vocab_size (
int, optional, defaults to59392) — Vocabulary size of the model. Defines the number of different tokens that can be represented by theinput_ids. - hidden_size (
int, optional, defaults to1024) — Dimension of the hidden representations. - intermediate_size (
int, optional, defaults to4096) — Dimension of the MLP representations. - num_hidden_layers (
int, optional, defaults to16) — Number of hidden layers in the Transformer decoder. - num_attention_heads (
int, optional, defaults to16) — Number of attention heads for each attention layer in the Transformer decoder. - num_key_value_heads (
int, optional, defaults to8) — This is the number of key_value heads that should be used to implement Grouped Query Attention. Ifnum_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), ifnum_key_value_heads=1the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details, check out this paper. If it is not specified, will default tonum_attention_heads. - hidden_act (
str, optional, defaults tosilu) — The non-linear activation function (function or string) in the decoder. For example,"gelu","relu","silu", etc. - max_position_embeddings (
int, optional, defaults to131072) — The maximum sequence length that this model might ever be used with. - initializer_range (
float, optional, defaults to0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - rms_norm_eps (
int, optional, defaults to1e-05) — The epsilon used by the rms normalization layers. - use_cache (
bool, optional, defaults toTrue) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant ifconfig.is_decoder=Trueor when the model is a decoder-only generative model. - attention_dropout (
float, optional, defaults to0.0) — The dropout ratio for the attention probabilities. - rope_parameters (
Union[~modeling_rope_utils.RopeParameters, dict[str, ~modeling_rope_utils.RopeParameters]], optional) — Dictionary containing the configuration parameters for the RoPE embeddings. The dictionary should contain a value forrope_thetaand optionally parameters used for scaling in case you want to use RoPE with longermax_position_embeddings. - pad_token_id (
int, optional) — Token id used for padding in the vocabulary.
This is the configuration class to store the configuration of a GlmOcrModel. It is used to instantiate a Glm Ocr model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the zai-org/GLM-OCR
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
Example:
>>> from transformers import GlmOcrTextModel, GlmOcrConfig
>>> # Initializing a GLM-OCR style configuration
>>> configuration = GlmOcrConfig()
>>> # Initializing a model from the GLM-OCR style configuration
>>> model = GlmOcrTextModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configGlmOcrVisionModel
forward
< source >( hidden_states: Tensor grid_thw: Tensor **kwargs ) → torch.Tensor
The GlmOcrVisionModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
- forward
GlmOcrTextModel
class transformers.GlmOcrTextModel
< source >( config: GlmOcrTextConfig )
Parameters
- config (GlmOcrTextConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Glm Ocr Text Model outputting raw hidden-states without any specific head on to.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None use_cache: bool | None = None **kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs] ) → BaseModelOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
Returns
BaseModelOutputWithPast or tuple(torch.FloatTensor)
A BaseModelOutputWithPast or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (None) and inputs.
The GlmOcrTextModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.If
past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.past_key_values (
Cache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
- forward
GlmOcrModel
class transformers.GlmOcrModel
< source >( config )
Parameters
- config (GlmOcrModel) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Glm Ocr Model outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None pixel_values: torch.Tensor | None = None pixel_values_videos: torch.FloatTensor | None = None image_grid_thw: torch.LongTensor | None = None video_grid_thw: torch.LongTensor | None = None rope_deltas: torch.LongTensor | None = None mm_token_type_ids: torch.IntTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → GlmOcrModelOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - pixel_values (
torch.Tensorof shape(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained usingimage_processor_class. Seeimage_processor_class.__call__for details (processor_classusesimage_processor_classfor processing images). - pixel_values_videos (
torch.FloatTensorof shape(batch_size, num_frames, num_channels, frame_size, frame_size), optional) — The tensors corresponding to the input video. Pixel values for videos can be obtained usingvideo_processor_class. Seevideo_processor_class.__call__for details (processor_classusesvideo_processor_classfor processing videos). - image_grid_thw (
torch.LongTensorof shape(num_images, 3), optional) — The temporal, height and width of feature shape of each image in LLM. - video_grid_thw (
torch.LongTensorof shape(num_videos, 3), optional) — The temporal, height and width of feature shape of each video in LLM. - rope_deltas (
torch.LongTensorof shape(batch_size, ), optional) — The rope index difference between sequence length and multimodal rope. - mm_token_type_ids (
torch.IntTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens matching each modality. For example text (0), image (1), video (2). Multimodal token type ids can be obtained using AutoProcessor. See ProcessorMixin.call() for details.
Returns
GlmOcrModelOutputWithPast or tuple(torch.FloatTensor)
A GlmOcrModelOutputWithPast or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (None) and inputs.
The GlmOcrModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional, defaults toNone) — Sequence of hidden-states at the output of the last layer of the model.past_key_values (
Cache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple[torch.FloatTensor], optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple[torch.FloatTensor], optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
rope_deltas (
torch.LongTensorof shape(batch_size, ), optional) — The rope index difference between sequence length and multimodal rope.
get_image_features
< source >( pixel_values: FloatTensor image_grid_thw: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. - image_grid_thw (
torch.LongTensorof shape(num_images, 3), optional) — The temporal, height and width of feature shape of each image in LLM.
Returns
BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A BaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (GlmOcrConfig) and inputs.
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.pooler_output (
torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
get_placeholder_mask
< source >( input_ids: LongTensor inputs_embeds: FloatTensor image_features: torch.FloatTensor | None = None video_features: torch.FloatTensor | None = None )
Obtains multimodal placeholder mask from input_ids or inputs_embeds, and checks that the placeholder token count is
equal to the length of multimodal features. If the lengths are different, an error is raised.
get_rope_index
< source >( input_ids: LongTensor mm_token_type_ids: IntTensor image_grid_thw: torch.LongTensor | None = None video_grid_thw: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None **kwargs )
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. - mm_token_type_ids (
torch.IntTensorof shape(batch_size, sequence_length)) — Token type ids matching each modality to a different value in the input sequence, i.e. text (0), image (1), video (2). - image_grid_thw (
torch.LongTensorof shape(num_images, 3), optional) — The temporal, height and width of feature shape of each image in LLM. - video_grid_thw (
torch.LongTensorof shape(num_videos, 3), optional) — The temporal, height and width of feature shape of each video in LLM. - attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Calculate the 3D rope index based on image and video’s sizes. The utility expects a vision + text
sequence and will error out otherwise. For pure text sequence, please rely on model’s auto-inferred
position ids. In a mixed vision + text sequence, vision tokens use 3D RoPE (temporal, height, width)
while text tokens use standard 1D RoPE.
Example: Temporal patches: 3; Height patches: 2; Width patches: 2 Each vision input results in (temporal x height × width) positions. Here: 3 x 2 × 2 = 12 positions total.
Temporal position IDs are spaced by:
interval = tokens_per_second * temporal_patch_size / fps
If fps = 1; tokens_per_second = 25; temporal_patch_size = 2, temporal IDs increase by 50 for each temporal patch:
[0, 0, 0, 0, 50, 50, 50, 50, 100, 100, 100, 100]
Height IDs repeat per row: [0, 0, 1, 1, ...]
Width IDs alternate per column: [0, 1, 0, 1, ...]
Text tokens follow standard 1D RoPE and the position IDs grow consequently with a step of 1
get_video_features
< source >( pixel_values_videos: FloatTensor video_grid_thw: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
- pixel_values_videos (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input videos. - video_grid_thw (
torch.LongTensorof shape(num_videos, 3), optional) — The temporal, height and width of feature shape of each video in LLM.
Returns
BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A BaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (GlmOcrConfig) and inputs.
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.pooler_output (
torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
get_vision_position_ids
< source >( start_position: int grid_thw: list[int, int, int] | torch.Tensor temp_merge_size: int = 1 spatial_merge_size: int = 1 time_interval: int = 1 device: str | torch.device | None = None ) → torch.LongTensor of shape (3, sequence_length)
Parameters
- start_position (
int) — Offset added to all computed positional indices. - grid_thw (
Sequence[int]ortorch.Tensorof shape(3,)) — The (T, H, W) grid representing the feature layout of the current image or video after patch embedding. - temp_merge_size (
int, optional) — Factor by which the temporal dimension is reduced in the backbone. The temporal grid size is divided by this value. Defaults to 1. - spatial_merge_size (
int, optional) — Factor by which the spatial dimensions (H and W) are reduced in the backbone. Both H and W are divided by this value. Defaults to 1. - time_interval (
int, optional) — Spacing factor applied between consecutive temporal position indices.Defaults to 1. - device (
strortorch.device, optional) — Device on which the resulting tensor is allocated. IfNone, uses the current default device.
Returns
torch.LongTensor of shape (3, sequence_length)
Positional indices for temporal, height, and width dimensions,
flattened into sequence form and offset by start_position.
Compute 3D positional indices for vision tokens derived from a single image or video input.
The positions are generated from the input grid defined by temporal (T), height (H), and
width (W) dimensions. Temporal and spatial dimensions can be downscaled according to the
merge sizes used in the vision backbone. The resulting positions are offset by start_position.
- forward
GlmOcrForConditionalGeneration
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None labels: torch.LongTensor | None = None pixel_values: torch.Tensor | None = None pixel_values_videos: torch.FloatTensor | None = None image_grid_thw: torch.LongTensor | None = None video_grid_thw: torch.LongTensor | None = None mm_token_type_ids: torch.IntTensor | None = None logits_to_keep: int | torch.Tensor = 0 **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → GlmOcrCausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - labels (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]. - pixel_values (
torch.Tensorof shape(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained usingimage_processor_class. Seeimage_processor_class.__call__for details (processor_classusesimage_processor_classfor processing images). - pixel_values_videos (
torch.FloatTensorof shape(batch_size, num_frames, num_channels, frame_size, frame_size), optional) — The tensors corresponding to the input video. Pixel values for videos can be obtained usingvideo_processor_class. Seevideo_processor_class.__call__for details (processor_classusesvideo_processor_classfor processing videos). - image_grid_thw (
torch.LongTensorof shape(num_images, 3), optional) — The temporal, height and width of feature shape of each image in LLM. - video_grid_thw (
torch.LongTensorof shape(num_videos, 3), optional) — The temporal, height and width of feature shape of each video in LLM. - mm_token_type_ids (
torch.IntTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens matching each modality. For example text (0), image (1), video (2). Multimodal token type ids can be obtained using AutoProcessor. See ProcessorMixin.call() for details. - logits_to_keep (
Union[int, torch.Tensor], optional, defaults to0) — If anint, compute logits for the lastlogits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).
Returns
GlmOcrCausalLMOutputWithPast or tuple(torch.FloatTensor)
A GlmOcrCausalLMOutputWithPast or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (GlmOcrConfig) and inputs.
The GlmOcrForConditionalGeneration forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
loss (
torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
Cache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple[torch.FloatTensor], optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple[torch.FloatTensor], optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
rope_deltas (
torch.LongTensorof shape(batch_size, ), optional) — The rope index difference between sequence length and multimodal rope.
Example:
>>> from PIL import Image
>>> import httpx
>>> from io import BytesIO
>>> from transformers import AutoProcessor, GlmOcrForConditionalGeneration
>>> model = GlmOcrForConditionalGeneration.from_pretrained("zai-org/GLM-4.1V-9B-Thinking")
>>> processor = AutoProcessor.from_pretrained("zai-org/GLM-4.1V-9B-Thinking")
>>> messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> with httpx.stream("GET", url) as response:
... image = Image.open(BytesIO(response.read()))
>>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
>>> inputs = processor(text=[text], images=[image], vision_infos=[vision_infos])
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"The image shows a street scene with a red stop sign in the foreground. In the background, there is a large red gate with Chinese characters ..."get_image_features
< source >( pixel_values: FloatTensor image_grid_thw: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. - image_grid_thw (
torch.LongTensorof shape(num_images, 3), optional) — The temporal, height and width of feature shape of each image in LLM.
Returns
BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A BaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (GlmOcrConfig) and inputs.
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.pooler_output (
torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Example:
>>> from PIL import Image
>>> from transformers import AutoProcessor, GlmOcrForConditionalGeneration
>>> model = GlmOcrForConditionalGeneration.from_pretrained("zai-org/GLM-OCR")
>>> processor = AutoProcessor.from_pretrained("zai-org/GLM-OCR")
>>> messages = [
... {
... "role": "user", "content": [
... {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
... {"type": "text", "text": "Where is the cat standing?"},
... ]
... },
... ]
>>> inputs = processor.apply_chat_template(
... messages,
... tokenize=True,
... return_dict=True,
... return_tensors="pt",
... add_generation_prompt=True
... )
>>> # Generate
>>> generate_ids = model.generate(**inputs)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True)[0]get_video_features
< source >( pixel_values_videos: FloatTensor video_grid_thw: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
- pixel_values_videos (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input videos. - video_grid_thw (
torch.LongTensorof shape(num_videos, 3), optional) — The temporal, height and width of feature shape of each video in LLM.
Returns
BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A BaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (GlmOcrConfig) and inputs.
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.pooler_output (
torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Example:
>>> from PIL import Image
>>> from transformers import AutoProcessor, GlmOcrForConditionalGeneration
>>> model = GlmOcrForConditionalGeneration.from_pretrained("zai-org/GLM-OCR")
>>> processor = AutoProcessor.from_pretrained("zai-org/GLM-OCR")
>>> messages = [
... {
... "role": "user", "content": [
... {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
... {"type": "text", "text": "Where is the cat standing?"},
... ]
... },
... ]
>>> inputs = processor.apply_chat_template(
... messages,
... tokenize=True,
... return_dict=True,
... return_tensors="pt",
... add_generation_prompt=True
... )
>>> # Generate
>>> generate_ids = model.generate(**inputs)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True)[0]- forward