Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos (2404.17571v1)

Published 26 Apr 2024 in cs.CV

Abstract: Video try-on is a challenging task and has not been well tackled in previous works. The main obstacle lies in preserving the details of the clothing and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named "Tunnel Try-on." The core idea is excavating a "focus tunnel" in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we first leverage the Kalman filter to construct smooth crops in the focus tunnel and inject the position embedding of the tunnel into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels as supplementary cues. Equipped with these techniques, Tunnel Try-on keeps the fine details of the clothing and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Multimodal garment designer: Human-centric latent diffusion models for fashion image editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23393–23402, 2023.
  2. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  3. Fashionmirror: Co-attention feature-remapping virtual try-on with sequential template poses. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 13789–13798, 2021.
  4. Wear-any-way: Manipulable virtual try-on via sparse correspondence alignment. arXiv preprint arXiv:2403.12965, 2024.
  5. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023a.
  6. Livephoto: Real image animation with text-guided motion control. arXiv preprint arXiv:2312.02928, 2023b.
  7. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021.
  8. Fw-gan: Flow-navigated warping gan for video virtual try-on. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1161–1170, 2019.
  9. Fashion editing with adversarial parsing learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8120–8128, 2020.
  10. Parser-free virtual try-on via distilling appearance flows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8485–8493, 2021.
  11. Taming the power of diffusion models for high-quality virtual try-on with appearance flow. arXiv preprint arXiv:2308.06101, 2023.
  12. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. International Conference on Learning Representations, 2024.
  13. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018.
  14. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018.
  15. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  16. Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020a.
  17. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020b.
  18. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. ArXiv, abs/2311.17117, 2023.
  19. Make it move: Controllable image-to-video generation with text descriptions. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18198–18207, 2021.
  20. Do not mask what you do not need to mask: a parser-free virtual try-on. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 619–635. Springer, 2020.
  21. Clothformer: Taming video virtual try-on in all module. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10799–10808, 2022.
  22. Dreampose: Fashion image-to-video synthesis via stable diffusion. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 22623–22633, 2023.
  23. Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. arXiv preprint arXiv:2312.01725, 2023.
  24. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  25. Shineon: Illuminating design choices for practical video-based virtual clothing try-on. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 191–200, 2021.
  26. High-resolution virtual try-on with misalignment and occlusion-handled conditions. In European Conference on Computer Vision, pages 204–219. Springer, 2022.
  27. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  28. Dress code: high-resolution multi-category virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2231–2235, 2022.
  29. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. arXiv preprint arXiv:2305.13501, 2023.
  30. Conditional image-to-video generation with latent flow diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18444–18455, 2023.
  31. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  32. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  33. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  34. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  35. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  36. Towards squeezing-averse virtual try-on via sequential deformation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4856–4863, 2024.
  37. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  38. Edge: Editable dance generation from music. arXiv preprint arXiv:2211.10658, 2022.
  39. Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European conference on computer vision (ECCV), pages 589–604, 2018a.
  40. Disco: Disentangled control for realistic human dance generation. 2023.
  41. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018b.
  42. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  43. An introduction to the kalman filter. 1995.
  44. Magicanimate: Temporally consistent human image animation using diffusion model. ArXiv, abs/2311.16498, 2023.
  45. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
  46. Magicavatar: Multimodal avatar generation and animation. ArXiv, abs/2308.14748, 2023.
  47. Lvmin Zhang. Reference-only controlnet. https://github.com/Mikubill/sd-webui-controlnet/discussions/1236, 2023.5.
  48. Exploring dual-task correlation for pose guided person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7713–7722, 2022.
  49. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  50. Thin-plate spline motion model for image animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3657–3666, 2022.
  51. Mv-ton: Memory-based video virtual try-on network. In Proceedings of the 29th ACM International Conference on Multimedia, pages 908–916, 2021.
  52. Tryondiffusion: A tale of two unets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4606–4615, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Zhengze Xu (2 papers)
  2. Mengting Chen (10 papers)
  3. Zhao Wang (155 papers)
  4. Linyu Xing (3 papers)
  5. Zhonghua Zhai (10 papers)
  6. Nong Sang (87 papers)
  7. Jinsong Lan (11 papers)
  8. Shuai Xiao (31 papers)
  9. Changxin Gao (77 papers)
Citations (8)

Summary

  • The paper introduces a diffusion-based Tunnel Try-on framework that utilizes focus tunnels to preserve clothing details in videos.
  • It integrates Kalman filters and tunnel position embeddings to smooth video motion and ensure temporal consistency.
  • The method achieves superior SSIM, LPIPS, and VFID scores, marking significant improvements over existing virtual try-on solutions.

Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

Introduction to Tunnel Try-on

The concept of video virtual try-on entails dressing a target person with specified clothing in video sequences while maintaining the fidelity of the clothing's appearance and the subject's motions. To enhance user experience and cater to both industry and consumer interests, video try-on must ideally provide an interactive and realistic depiction of clothing under various conditions without requiring physical trials. However, transitioning from image-based to video-based try-on presents unique challenges, most notably the maintenance of clothing details and motion coherence. This is particularly tricky in videos where camera movement and background complexity are introduced.

The proposed framework, named "Tunnel Try-on," leverages a diffusion-based approach focused initially on image try-on to achieve impressive results in video. It employs several innovative techniques such as focus tunnel extraction, environment encoding, and Kalman filter applications for video smoothing, to preserve details and ensure temporal consistency in video outputs.

Technical Breakdown

Focus Tunnel Extraction and Enhancement

The primary innovation is the introduction of a "focus tunnel." This concept involves identifying and zooming in on key regions (primarily clothing areas) in video frames to ensure detail preservation even in varied or complex background settings. This zoomed region, processed frame by frame, forms the central input for the subsequent generative model.

To address the jitters and inconsistencies caused by varying human and camera movements, the use of Kalman filters provides a smoothing effect, stabilizing the input for the model. Further, tunnel position embeddings are included in the model's attention layers, assisting in aligning the focused areas across frames, enhancing the continuity and visual coherence of the output video.

Environmental Encoding

The model also introduces an "environment encoder." This component captures the contextual background information outside the focus tunnel, providing global environmental cues. These cues are crucial for generating realistic and integrative scenes that blend the target clothing and the background seamlessly.

Approach and Performance

The Tunnel Try-on model is designed with a robust architecture that incorporates advanced U-Nets and attention mechanisms, trained through a two-stage process focused on both image and video data. Its performance has been tested against existing video try-on solutions across multiple measures, including SSIM, LPIPS, and VFID, demonstrating superior image fidelity, detail preservation, and motion coherence.

Future Implications and Developments

The implications of the Tunnel Try-on model extend beyond immediate commercial applications in fashion retail and e-commerce. The strategies developed here—for focus management in video, environmental encoding, and motion smoothing—may well be applicable in other domains of video processing and augmented reality.

Continued advancements could see these models gaining greater temporal depth and handling more complex interaction scenarios, possibly integrating real-time user inputs. Furthermore, continued improvement might lead to even more robust models capable of dealing with extreme variations in background, movement, and camera stability.

Conclusion

The Tunnel Try-on model sets a new standard for video virtual try-on technologies with its innovative use of diffusion-based frameworks and detailed attention to motion and environmental context. Its ability to produce high-quality, realistic try-on videos in complex scenarios marks a significant step forward for the application of AI in consumer-focused digital environments.