Peekaboo: Interactive Video Generation via Masked-Diffusion

CVPR, 2024

Yash Jain1*, Anshul Nasery2*, Vibhav Vineet3, Harkirat Behl3,
1Microsoft, 2University of Washington, 3Microsoft Research,
*equal contribution


  • We introduce Peekaboo, which allows interactive video generation by inducing spatio-temporal and motion control in the output of any UNet based off-the-shelf video generation model
  • Peekaboo is completely training-free, and has zero inference latency overheads. It can be deployed on any text-to-video UNet based diffusion model readily.
  • We also propose four new quantitative evaluation benchmarks for interactive video generation based upon LaSOT, DAVIS-16,ssv2 and IMC datasets.

Method

Peekaboo proposes converting attention modules of an off-the-shelf 3D UNet into masked spatio-temporal mixed attention modules. We propose to use local context for generating individual objects and hence, guide the generation process using attention masks. For each of spatial-, cross-, and temporal-attentions, we compute attention masks such that foreground pixels and background pixels attend only within their own region. We illustrate these mask computations for an input mask which changes temporally as shown on the left. Green pixels are background pixels and orange are foreground. This masking is applied for a fixed number of steps, after which free generation is allowed. Hence, foreground and background pixels are hidden from each other before being visible, akin to a game of Peekaboo.

More Results

Motion Control

Peekaboo allows us to control the trajectory of an object precisely.


Position and Size control

Peekaboo allows us to control the position and size of an object through bounding boxes.


Quantitative Evaluation

Benchmarks

We propose two new benchmark datasets for evaluating spatio-temporal control in videos.

Evaluation Methodolgy

For each prompt-bounding box input pair, we generate a video using the baseline model and our method. We then use an OwL-ViT model to label the generated video with frame-wise bounding boxes.

We propose the following metrics to measure the quality of interactive video generation models.

Results

We present the results below.

Method DAVIS16 LaSOT ssv2-ST IMC
mIoU % (↑) AP50 % (↑) Cvg. % (↑) CD (↓) mIoU % (↑) AP50 % (↑) Cvg. % (↑) CD (↓) mIoU % (↑) AP50 % (↑) Cvg. % (↑) CD (↓) mIoU % (↑) AP50 % (↑) Cvg. % (↑) CD (↓)
LLM-VD 26.1 15.2 96 0.19 13.5 4.6 98 0.24 27.2 21.2 61 0.12 33.5 24.7 97 0.14
ModelScope 19.6 5.7 100 0.25 4.0 0.7 96 0.33 12.0 6.6 44.7 0.17 9.6 2.4 93.3 0.25
w/ Peekaboo 26.0 16.6 93 0.18 14.6 10.2 98 0.25 33.2 35.8 63.7 0.10 36.1 33.3 96.6 0.13
ZeroScope 11.7 0.1 100 0.22 3.6 0.4 100 0.3 13.9 9.3 42.0 0.22 12.6 0.6 88.0 0.26
w/ Peekaboo 20.6 17.9 100 0.19 11.5 11.9 100 0.28 34.7 39.8 56.3 0.17 36.3 33.8 96.3 0.12

As demonstrated by mIoU and CD, the videos generated by the method endow the baselines with spatio-temporal control. The method also increases the quality of the main objects in the scene, as seen by higher coverage and AP50 scores.

Template for this webpage was taken from MotionCtrl.