Image animation is a key task in computer vision which aims to generate dynamic visual content from static image. Recent image animation methods employ neural based rendering technique to generate realistic animations. Despite these advancements, achieving fine-grained and controllable image animation guided by text remains challenging, particularly for open-domain images captured in diverse real environments. In this paper, we introduce an open domain image animation method that leverages the motion prior of video diffusion model. Our approach introduces targeted motion area guidance and motion strength guidance, enabling precise control the movable area and its motion speed. This results in enhanced alignment between the animated visual elements and the prompting text, thereby facilitating a fine-grained and interactive animation generation process for intricate motion sequences. We validate the effectiveness of our method through rigorous experiments on an open-domain dataset, with the results showcasing its superior performance.
We adopt the widely used 3D U-Net based video diffusion model for image animation. Given a noisy video latent with shape (frames, height, width, channel), we concatenate the clean latent of the reference image and the noisy frames in the temporal dimension. Additionally, we concatenate the motion area mask with the video latent in the channel dimension. This results in the input latent with shape (frames+1, height, width, channel+1) for the 3D U-Net. To control the motion strength of the generated video, we project the motion strength as positional embedding and concatenate it with the time step embedding.
@misc{dai2023finegrained,
title={Fine-Grained Open Domain Image Animation with Motion Guidance},
author={Zuozhuo Dai and Zhenghao Zhang and Yao Yao and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang},
year={2023},
eprint={2311.12886},
archivePrefix={arXiv},
primaryClass={cs.CV}
}