Peekaboo

Peekaboo: Interactive Video Generation via Masked-Diffusion

CVPR, 2024

Yash Jain^1*, Anshul Nasery^2*, Vibhav Vineet³, Harkirat Behl³,
¹Microsoft, ²University of Washington, ³Microsoft Research,
^*equal contribution

arXiv

Code

HuggingFace Demo

We introduce Peekaboo, which allows interactive video generation by inducing spatio-temporal and motion control in the output of any UNet based off-the-shelf video generation model

Peekaboo is completely training-free, and has zero inference latency overheads. It can be deployed on any text-to-video UNet based diffusion model readily.

We also propose four new quantitative evaluation benchmarks for interactive video generation based upon LaSOT, DAVIS-16,ssv2 and IMC datasets.

A Horse galloping through a meadow

A Panda playing Peekaboo

An Eagle flying in the sky

Method

Peekaboo proposes converting attention modules of an off-the-shelf 3D UNet into masked spatio-temporal mixed attention modules. We propose to use local context for generating individual objects and hence, guide the generation process using attention masks. For each of spatial-, cross-, and temporal-attentions, we compute attention masks such that foreground pixels and background pixels attend only within their own region. We illustrate these mask computations for an input mask which changes temporally as shown on the left. Green pixels are background pixels and orange are foreground. This masking is applied for a fixed number of steps, after which free generation is allowed. Hence, foreground and background pixels are hidden from each other before being visible, akin to a game of Peekaboo.

More Results

Motion Control

Peekaboo allows us to control the trajectory of an object precisely.

A Horse galloping through a meadow

A Barrel drifting down a river

A bear walking down some rocks

A butterfly fluttering in a field

A wolf jumping in the snow

A frog jumping to catch a fly.

Darth Vader surfing in the sea

A helicopter hovering over the city

An Eagle flying in the sky

Position and Size control

Peekaboo allows us to control the position and size of an object through bounding boxes.

A panda eating bamboo on some rocks

A panda eating bamboo on some rocks

An bear dancing on some rocks

A Rhino standing in a forest

A campfire in the night

A panda washing dishes

Quantitative Evaluation

Benchmarks

We propose two new benchmark datasets for evaluating spatio-temporal control in videos.

ssv2-ST - We use Something-Something v2 dataset to obtain the generation prompts and ground truth masks from real action videos. We filter out a set of 295 prompts. The details for this filtering are in the appendix. We then use an off-the-shelf OWL-ViT-large open-vocabulary object detector to obtain the bounding box annotations of the object in the videos. This set represents bounding box and prompt pairs of real-world videos, serving as a test bed for both the quality and control of methods for generating realistic videos with spatio-temporal control.

Interactive Motion Control (IMC) - We also curate a set of prompts and bounding boxes which are manually defined. We use GPT-4 to generate prompts and pick a set of 34 prompts of objects in their natural contexts. These prompts are varied in the type of object, size of the object and the type of motion exhibited. We then annotate 3 sets of bounding boxes for each prompt, where the location, path taken, speed and size are varied. This set of 102 prompt-bounding box pairs serve as our custom evaluation set for spatial control. Note that since ssv2-ST dataset has a lot of inanimate objects, we bias this dataset to contain more living objects. This dataset represents possible input pairs that real users may generate.

LaSOT - We repurpose a large-scale object tracking dataset, LaSOT for evaluating control in video generation. This dataset contains prompt-bbox-video triplets for a large number of classes. The videos have frame level annotations specifying the location of the object in the video. We subsample the videos to 8 FPS and then randomly pick up 2 clips per video from the test set of this dataset. This gives us 450 total clips across 70 different object categories.

DAVIS-16 - DAVIS-16 is another video object segmentation dataset that we consider. We take videos from its test set, manually annotating them with prompts. We use the provided segmentation masks to create input bboxes. This gives us 40 prompt-bbox pairs in total, where each video has a different subject.

Evaluation Methodolgy

For each prompt-bounding box input pair, we generate a video using the baseline model and our method. We then use an OwL-ViT model to label the generated video with frame-wise bounding boxes.

We propose the following metrics to measure the quality of interactive video generation models.

Coverage: This metric is based on the fraction of videos where the generated object is correctly detected within bounding boxes. It evaluates the model's ability to generate recognizable objects.

Mean Intersection-over-Union (mIoU): Calculated by comparing the detected bounding boxes and the input mask on filtered videos. This score assesses the spatio-temporal control of the generation method.

Centroid Distance (CD): Measures the distance between the centroid of the generated object and the input mask, normalized to 1. This metric evaluates the control of the generation location.

Average Precision@50% (AP50): Represents the average precision of the detected and input bounding boxes over all videos. AP50 assesses the spatial control of the generation method, while also considering the model's ability to match the input bounding boxes.

Results

We present the results below.

Method DAVIS16 LaSOT ssv2-ST IMC

mIoU % (↑) AP50 % (↑) Cvg. % (↑) CD (↓) mIoU % (↑) AP50 % (↑) Cvg. % (↑) CD (↓) mIoU % (↑) AP50 % (↑) Cvg. % (↑) CD (↓) mIoU % (↑) AP50 % (↑) Cvg. % (↑) CD (↓)

LLM-VD 26.1 15.2 96 0.19 13.5 4.6 98 0.24 27.2 21.2 61 0.12 33.5 24.7 97 0.14

ModelScope 19.6 5.7 100 0.25 4.0 0.7 96 0.33 12.0 6.6 44.7 0.17 9.6 2.4 93.3 0.25

w/ Peekaboo 26.0 16.6 93 0.18 14.6 10.2 98 0.25 33.2 35.8 63.7 0.10 36.1 33.3 96.6 0.13

ZeroScope 11.7 0.1 100 0.22 3.6 0.4 100 0.3 13.9 9.3 42.0 0.22 12.6 0.6 88.0 0.26

w/ Peekaboo 20.6 17.9 100 0.19 11.5 11.9 100 0.28 34.7 39.8 56.3 0.17 36.3 33.8 96.3 0.12

As demonstrated by mIoU and CD, the videos generated by the method endow the baselines with spatio-temporal control. The method also increases the quality of the main objects in the scene, as seen by higher coverage and AP50 scores.

Template for this webpage was taken from MotionCtrl.

Method	DAVIS16				LaSOT				ssv2-ST				IMC
Method	mIoU % (↑)	AP50 % (↑)	Cvg. % (↑)	CD (↓)	mIoU % (↑)	AP50 % (↑)	Cvg. % (↑)	CD (↓)	mIoU % (↑)	AP50 % (↑)	Cvg. % (↑)	CD (↓)	mIoU % (↑)	AP50 % (↑)	Cvg. % (↑)	CD (↓)
LLM-VD	26.1	15.2	96	0.19	13.5	4.6	98	0.24	27.2	21.2	61	0.12	33.5	24.7	97	0.14
ModelScope	19.6	5.7	100	0.25	4.0	0.7	96	0.33	12.0	6.6	44.7	0.17	9.6	2.4	93.3	0.25
w/ Peekaboo	26.0	16.6	93	0.18	14.6	10.2	98	0.25	33.2	35.8	63.7	0.10	36.1	33.3	96.6	0.13
ZeroScope	11.7	0.1	100	0.22	3.6	0.4	100	0.3	13.9	9.3	42.0	0.22	12.6	0.6	88.0	0.26
w/ Peekaboo	20.6	17.9	100	0.19	11.5	11.9	100	0.28	34.7	39.8	56.3	0.17	36.3	33.8	96.3	0.12