• Author(s) : Zhengze Xu, Mengting Chen, Zhao Wang, Linyu Xing, Zhonghua Zhai, Nong Sang, Jinsong Lan, Shuai Xiao, Changxin Gao

This paper tackles the challenge of video try-on, an area where previous research has yielded limited success. The core difficulty lies in simultaneously preserving intricate clothing details and generating realistic, coherent motions throughout the video.

The overview of Tunnel Try-on. Given an input video and a clothing image, we first extract a focus tunnel to zoom in on the region around the garments to better preserve the details. The zoomed region is represented by a sequence of tensors consisting of the background latent, latent noise, and the garment mask, which are concatenated and fed into the Main U-Net. At the same time, we use a Ref U-Net and a CLIP Encoder to extract the representations of the clothing image. These clothing representations are then added to the Main U-Net using ref-attention. Moreover, human pose information is added into the latent feature to assist in generation. The tunnel embedding is also integrated into temporal attention to generating more consistent motions, and an environment encoder is developed to extract the global context as additional guidance.

To address these challenges, the authors propose “Tunnel Try-on,” a novel diffusion-based framework. The method revolves around creating a “focus tunnel” within the input video. This tunnel essentially provides close-up shots specifically focused on the clothing regions. By zooming in on this area, the framework can preserve the clothing’s fine details more effectively.

To ensure coherent motions across the video, the approach employs a two-pronged strategy. First, a Kalman filter is used to generate smooth crops within the focus tunnel. Second, the tunnel’s position embedding is injected into the attention layers of the model. This step helps improve the continuity of the generated video sequence.

Furthermore, the framework incorporates an environment encoder. This encoder extracts contextual information from areas outside the tunnels, providing supplementary cues for the video generation process.

Through this combination of techniques, Tunnel Try-on achieves the dual objective of preserving clothing details while synthesizing stable and smooth videos. The paper’s findings demonstrate significant advancements and position Tunnel Try-on as a potential first step towards commercially viable video try-on applications.