MultiCOIN: Multi-Modal COntrollable Video INbetweening

Results

Given keyframes augmented with multi-modal controls (such as trajectories, depth, target regions, and text prompts), our unified model synthesizes high-quality, temporally consistent in-between frames that respect both the keyframes and the specified controls.

Abstract

Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce MultiCOIN, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

Pipeline diagram — Given a video X, we extract multi-modal motion controls through two generators: the Sparse Motion Generator via optical flow and the Sparse Depth Generator for depth maps, both producing sparse RGB points for trajectory and depth control. An Augmented Frame Generator computes target regions and masks to enable fine-grained content control. All control signals are encoded via a dual-branch embedder architecture that separately captures motion and content features. In addition, a text prompt condition is processed by a text encoder to provide semantic guidance over the generated content. At inference, the model flexibly integrates these multi-modal controls for interpolation.

Trajectory + Depth

Input Keyframes

Output Video

Trajectory + Depth

Input Keyframes

Output Video

Depth

Input Keyframes

Output Videos

Trajectory

Input Keyframes

Output Video

Trajectory

Input Keyframes

Output Video

Trajectory

Input Keyframes

Output Video

Trajectory + Prompt

Input Keyframes

Output Video

Target Region

First Frame + Target Region

Output Video

Target Region + Prompt

First Frame + Target Region

Output Video

Trajectory

Trajectory + Depth

Trajectory

Trajectory + Depth

Trajectory

Trajectory + Prompt ("a flower rotates")

Trajectory

BibTeX


      @misc{tanveer2025multicoinmultimodalcontrollablevideo,
      title={MultiCOIN: Multi-Modal COntrollable Video INbetweening}, 
      author={Maham Tanveer and Yang Zhou and Simon Niklaus and Ali Mahdavi Amiri and Hao Zhang and Krishna Kumar Singh and Nanxuan Zhao},
      year={2025},
      eprint={2510.08561},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.08561},}

MultiCOIN: Multi-Modal COntrollable Video INbetweening