NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing
Paper
• 2603.02802 • Published
• 7
NOVA is a pair-free video editing model built on WAN 1.3B Fun InP. It uses sparse keyframe control (e.g., a single edited first frame) to guide dense video synthesis, trained without requiring paired before/after video data.
The framework consists of a sparse branch providing semantic guidance through user-edited keyframes and a dense branch that incorporates motion and texture information from the original video to maintain high fidelity and coherence.
For full installation and training instructions, please visit the GitHub repository.
You can run inference using the infer_nova.py script. Below is an example for single GPU inference:
python infer_nova.py \
--dataset_path ./example_videos \
--metadata_file_name metadata.csv \
--ckpt_path /path/to/checkpoints/stepXXX.ckpt \
--output_path ./inference_results \
--text_encoder_path /path/to/models_t5_umt5-xxl-enc-bf16.pth \
--image_encoder_path /path/to/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth \
--vae_path /path/to/Wan2.1_VAE.pth \
--dit_path /path/to/diffusion_pytorch_model.safetensors \
--num_samples 5 \
--num_inference_steps 50 \
--num_frames 81 \
--height 480 \
--width 832 \
--first_only
@article{pan2026nova,
title={NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing},
author={Tianlin Pan and Jiayi Dai and Chenpu Yuan and Zhengyao Lv and Binxin Yang and Hubery Yin and Chen Li and Jing Lyu and Caifeng Shan and Chenyang Si},
journal={arXiv preprint arXiv:2603.02802},
year={2026}
}