We introduce Peekaboo, which allows interactive video generation by inducing
spatio-temporal and motion control in the output of any UNet based off-the-shelf video
generation model
Peekaboo is completely training-free, and has zero inference latency
overheads. It can be deployed on any text-to-video UNet based diffusion model
readily.
We also propose four new quantitative evaluation benchmarks for interactive video generation based upon LaSOT, DAVIS-16,ssv2 and IMC datasets.
Method
Peekaboo proposes converting attention modules of an off-the-shelf 3D UNet into masked spatio-temporal
mixed attention modules.
We propose to use local context for generating individual objects and hence, guide the generation
process using attention masks.
For each of spatial-, cross-, and temporal-attentions, we compute attention masks such that foreground
pixels and background pixels attend only within their own region. We illustrate these mask computations
for an input mask which changes temporally as shown on the left.
Green pixels are background pixels and orange are foreground. This masking is applied for a fixed number of steps, after which free generation is
allowed. Hence, foreground and background
pixels are hidden from each other before being visible, akin to a game of Peekaboo.
More Results
Motion Control
Peekaboo allows us to control the trajectory of an object precisely.
Position and Size control
Peekaboo allows us to control the position and size of an object through bounding boxes.
Quantitative Evaluation
Benchmarks
We propose two new benchmark datasets for evaluating spatio-temporal control in videos.
ssv2-ST - We use Something-Something v2 dataset to
obtain the generation prompts and ground truth masks from real action videos. We filter out a set of 295
prompts. The details for this filtering are in the appendix. We then use an off-the-shelf
OWL-ViT-large open-vocabulary object detector to obtain the bounding box
annotations of the object in the videos. This set represents bounding box and prompt pairs of real-world
videos, serving as a test bed for both the quality and control of methods for generating realistic
videos with spatio-temporal control.
Interactive Motion Control (IMC) - We also curate a set of prompts and bounding boxes which are
manually defined. We use GPT-4 to generate prompts and pick a set of 34 prompts of objects in their
natural contexts. These prompts are varied in the type of object, size of the object and the type of
motion exhibited. We then annotate 3 sets of bounding boxes for each prompt, where the location, path
taken, speed and size are varied. This set of 102 prompt-bounding box pairs serve as our custom
evaluation set for spatial control. Note that since ssv2-ST dataset has a lot of inanimate objects, we
bias this dataset to contain more living objects. This dataset represents possible input pairs that real
users may generate.
LaSOT - We repurpose a large-scale object tracking dataset, LaSOT for evaluating control in video generation. This dataset contains prompt-bbox-video triplets for a large number of classes. The videos have frame level annotations specifying the location of the object in the video. We subsample the videos to 8 FPS and then randomly pick up 2 clips per video from the test set of this dataset. This gives us 450 total clips across 70 different object categories.
DAVIS-16 - DAVIS-16 is another video object segmentation dataset that we consider. We take videos from its test set, manually annotating them with prompts. We use the provided segmentation masks to create input bboxes. This gives us 40 prompt-bbox pairs in total, where each video has a different subject.
Evaluation Methodolgy
For each prompt-bounding box input pair, we generate a video using the baseline model and our method. We
then use an OwL-ViT model to label the generated video with frame-wise bounding boxes.
We propose the following metrics to measure the quality of interactive video generation models.
Coverage: This metric is based on the fraction of videos where the generated object is
correctly detected within bounding boxes. It evaluates the model's ability to generate recognizable
objects.
Mean Intersection-over-Union (mIoU): Calculated by comparing the detected bounding
boxes and the input mask on filtered videos. This score assesses the spatio-temporal control of the
generation method.
Centroid Distance (CD): Measures the distance between the centroid of the generated
object and the input mask, normalized to 1. This metric evaluates the control of the generation
location.
Average Precision@50% (AP50): Represents the average precision of the detected and
input bounding boxes over all videos. AP50 assesses the spatial control of the generation method, while
also considering the model's ability to match the input bounding boxes.
Results
We present the results below.
Method
DAVIS16
LaSOT
ssv2-ST
IMC
mIoU % (↑)
AP50 % (↑)
Cvg. % (↑)
CD (↓)
mIoU % (↑)
AP50 % (↑)
Cvg. % (↑)
CD (↓)
mIoU % (↑)
AP50 % (↑)
Cvg. % (↑)
CD (↓)
mIoU % (↑)
AP50 % (↑)
Cvg. % (↑)
CD (↓)
LLM-VD
26.1
15.2
96
0.19
13.5
4.6
98
0.24
27.2
21.2
61
0.12
33.5
24.7
97
0.14
ModelScope
19.6
5.7
100
0.25
4.0
0.7
96
0.33
12.0
6.6
44.7
0.17
9.6
2.4
93.3
0.25
w/ Peekaboo
26.0
16.6
93
0.18
14.6
10.2
98
0.25
33.2
35.8
63.7
0.10
36.1
33.3
96.6
0.13
ZeroScope
11.7
0.1
100
0.22
3.6
0.4
100
0.3
13.9
9.3
42.0
0.22
12.6
0.6
88.0
0.26
w/ Peekaboo
20.6
17.9
100
0.19
11.5
11.9
100
0.28
34.7
39.8
56.3
0.17
36.3
33.8
96.3
0.12
As demonstrated by mIoU and CD, the
videos generated by the method endow the baselines with spatio-temporal control. The method also
increases the quality of the main objects in the scene, as seen by higher coverage and AP50 scores.
Template for this webpage was taken from MotionCtrl.