| --- |
| base_model: |
| - lmms-lab/LLaVA-Video-7B-Qwen2 |
| datasets: |
| - lmms-lab/LLaVA-Video-178K |
| language: |
| - en |
| library_name: transformers |
| license: cc-by-nc-sa-4.0 |
| metrics: |
| - accuracy |
| pipeline_tag: video-text-to-text |
| tags: |
| - Action |
| - Video |
| - MQA |
| - multimodal |
| - VLM |
| - LLaVAction |
| - MLLMs |
| model-index: |
| - name: LLaVAction-7B |
| results: |
| - task: |
| type: multimodal |
| dataset: |
| name: EgoSchema |
| type: egoschema |
| metrics: |
| - type: accuracy |
| value: 59 |
| name: accuracy |
| verified: true |
| - task: |
| type: multimodal |
| dataset: |
| name: MVBench |
| type: mvbench |
| metrics: |
| - type: accuracy |
| value: 61.1 |
| name: accuracy |
| verified: true |
| - task: |
| type: multimodal |
| dataset: |
| name: NextQA |
| type: nextqa |
| metrics: |
| - type: accuracy |
| value: 82.8 |
| name: accuracy |
| verified: true |
| - task: |
| type: multimodal |
| dataset: |
| name: PercepTest |
| type: percepTest |
| metrics: |
| - type: accuracy |
| value: 70.2 |
| name: accuracy |
| verified: true |
| - task: |
| type: multimodal |
| dataset: |
| name: LongVideoBench |
| type: longvideobench |
| metrics: |
| - type: accuracy |
| value: 58.6 |
| name: accuracy |
| verified: true |
| - task: |
| type: multimodal |
| dataset: |
| name: VideoMME |
| type: videomme |
| metrics: |
| - type: accuracy |
| value: 63.9 |
| name: accuracy |
| verified: true |
| - type: accuracy |
| value: 71.4 |
| name: accuracy |
| verified: true |
| --- |
| |
| # LLaVAction-7B |
|
|
| <div align="center"> |
| <h2>LLaVAction: evaluating and training multi-modal large language models for action recognition |
| </h2> |
|
|
| [Shaokai Ye](https://yeshaokai.github.io/)<sup>1**</sup> |
| [Haozhe Qi](https://people.epfl.ch/haozhe.qi)<sup>1**</sup> |
|
|
| [Alexander Mathis](https://mathislab.org/)<sup>1</sup><sup>†</sup> |
| [Mackenzie Weygandt Mathis](https://www.mackenziemathislab.org/mackenziemathis)<sup>1</sup><sup>†</sup><sup>‡</sup> |
|
|
| <sup>1</sup> EPFL |
|
|
| <sup>**</sup> First authors <sup>†</sup> Senior Authors <sup>‡</sup> Corresponding Author |
| |
| \[[arXiv Paper](https://arxiv.org/abs/2503.18712)\] \[[Project Page](https://mmathislab.github.io/llavaction/)\] \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\] |
| |
| </div> |
| |
| ## Model Summary |
| The LLaVAction-7B model is trained on EPIC-KITCHENS-100-MQA, based on Qwen2 language model with a context window of 32K tokens. |
| This model supports at most 64 frames. |
| |
| - **Project Page**: [https://mmathislab.github.io/llavaction/](https://mmathislab.github.io/llavaction/) |
| - **Paper**: For more details, please check our [paper](https://arxiv.org/abs/2503.18712) |
| - **Repository**: [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction) |
| - **Point of Contact**: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis) |
| - **Languages**: English |
| - |
| ## Useage |
| |
| ### Intended use |
| The model was trained on EPIC-KITCHENS-100-MQA [dataset release pending] and [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). It has improved capability on understanding human egocentric actions from videos. |
| |
| |
| ### Generation |
| We provide the simple generation process for using our model. For more details, you could refer to our [Github](https://github.com/AdaptiveMotorControlLab/LLaVAction). |
| |
| ```python |
| !pip install llavaction |
| |
| from llavaction.model.builder import load_pretrained_model |
| from llavaction.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token |
| from llavaction.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX |
| from llavaction.conversation import conv_templates, SeparatorStyle |
| from PIL import Image |
| import requests |
| import copy |
| import torch |
| import sys |
| import warnings |
| from decord import VideoReader, cpu |
| import numpy as np |
| warnings.filterwarnings("ignore") |
| |
| #Your video (it assumes an egocentric view point) |
| video_path = "XXXX" |
| |
| #These are the prompts we trained with, but you can test others: |
| perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?" |
| task_prompt = "Describe in details what you see from the video frames." |
| |
| def load_video(video_path, max_frames_num,fps=1,force_sample=False): |
| if max_frames_num == 0: |
| return np.zeros((1, 336, 336, 3)) |
| vr = VideoReader(video_path, ctx=cpu(0),num_threads=1) |
| total_frame_num = len(vr) |
| video_time = total_frame_num / vr.get_avg_fps() |
| fps = round(vr.get_avg_fps()/fps) |
| frame_idx = [i for i in range(0, len(vr), fps)] |
| if len(frame_idx) > max_frames_num or force_sample: |
| sample_fps = max_frames_num |
| uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int) |
| frame_idx = uniform_sampled_frames.tolist() |
| frame_time = [i/vr.get_avg_fps() for i in frame_idx] |
| spare_frames = vr.get_batch(frame_idx).asnumpy() |
| # import pdb;pdb.set_trace() |
| return spare_frames,frame_time,video_time |
| |
| pretrained = "MLAdaptiveIntelligence/LLaVAction-7B" |
| model_name = "llava_qwen" |
| device = "cuda" |
| device_map = "auto" |
| tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args |
| model.eval() |
| max_frames_num = 64 |
| video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True) |
| video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16) |
| video = [video] |
| conv_template = "qwen_1_5" # Make sure you use correct chat template for different models |
| time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. " |
| question = DEFAULT_IMAGE_TOKEN + f" |
| {time_instruction} |
| {perspective_prompt} {task_prompt}" |
| |
| conv = copy.deepcopy(conv_templates[conv_template]) |
| conv.append_message(conv.roles[0], question) |
| conv.append_message(conv.roles[1], None) |
| prompt_question = conv.get_prompt() |
| input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device) |
| |
| cont = model.generate( |
| input_ids, |
| images=video, |
| modalities= ["video"], |
| do_sample=False, |
| temperature=0, |
| max_new_tokens=4096, |
| ) |
| text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip() |
| print(text_outputs) |
| ``` |
| |
| |
| ## Training |
| |
| See details in Ye et al. 2025: arxiv.org/abs/2503.18712 |
| |
| ### Model |
| - **Architecture**: SO400M + Qwen2 |
| - **Initialized Model**: lmms-lab/LLaVA-Video-7B-Qwen2 |
| - **Data**: A mixture of LLaVA-178K and EPIC-KITCHENS-100-MQA, 2 epochs, full model |
| - **Precision**: bfloat16 |
| |
| |
| ### Hardware & Software |
| GPUs: 32 * Nvidia GH-200 (for whole model series training) |
| Orchestration: HuggingFace Trainer |
| Neural networks: PyTorch |
| |
| ## Citation |
| |
| arXiv: arxiv.org/abs/2503.18712 |
| |
| ```bibtex |
| @article{YeQi2025llavaction, |
| title={LLaVAction: evaluating and training multi-modal large language models for action recognition}, |
| author={Ye, Shaokai and Qi, Haozhe and Mathis, Alexander and Mathis, Mackenzie W.}, |
| journal={arXiv preprint}, |
| year={2025} |
| } |
| ``` |