Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos (2404.17571v1)
Abstract: Video try-on is a challenging task and has not been well tackled in previous works. The main obstacle lies in preserving the details of the clothing and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named "Tunnel Try-on." The core idea is excavating a "focus tunnel" in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we first leverage the Kalman filter to construct smooth crops in the focus tunnel and inject the position embedding of the tunnel into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels as supplementary cues. Equipped with these techniques, Tunnel Try-on keeps the fine details of the clothing and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos.
- Multimodal garment designer: Human-centric latent diffusion models for fashion image editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23393–23402, 2023.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- Fashionmirror: Co-attention feature-remapping virtual try-on with sequential template poses. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 13789–13798, 2021.
- Wear-any-way: Manipulable virtual try-on via sparse correspondence alignment. arXiv preprint arXiv:2403.12965, 2024.
- Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023a.
- Livephoto: Real image animation with text-guided motion control. arXiv preprint arXiv:2312.02928, 2023b.
- Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021.
- Fw-gan: Flow-navigated warping gan for video virtual try-on. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1161–1170, 2019.
- Fashion editing with adversarial parsing learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8120–8128, 2020.
- Parser-free virtual try-on via distilling appearance flows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8485–8493, 2021.
- Taming the power of diffusion models for high-quality virtual try-on with appearance flow. arXiv preprint arXiv:2308.06101, 2023.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. International Conference on Learning Representations, 2024.
- Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018.
- Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020a.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020b.
- Animate anyone: Consistent and controllable image-to-video synthesis for character animation. ArXiv, abs/2311.17117, 2023.
- Make it move: Controllable image-to-video generation with text descriptions. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18198–18207, 2021.
- Do not mask what you do not need to mask: a parser-free virtual try-on. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 619–635. Springer, 2020.
- Clothformer: Taming video virtual try-on in all module. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10799–10808, 2022.
- Dreampose: Fashion image-to-video synthesis via stable diffusion. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 22623–22633, 2023.
- Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. arXiv preprint arXiv:2312.01725, 2023.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Shineon: Illuminating design choices for practical video-based virtual clothing try-on. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 191–200, 2021.
- High-resolution virtual try-on with misalignment and occlusion-handled conditions. In European Conference on Computer Vision, pages 204–219. Springer, 2022.
- Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
- Dress code: high-resolution multi-category virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2231–2235, 2022.
- Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. arXiv preprint arXiv:2305.13501, 2023.
- Conditional image-to-video generation with latent flow diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18444–18455, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Towards squeezing-averse virtual try-on via sequential deformation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4856–4863, 2024.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
- Edge: Editable dance generation from music. arXiv preprint arXiv:2211.10658, 2022.
- Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European conference on computer vision (ECCV), pages 589–604, 2018a.
- Disco: Disentangled control for realistic human dance generation. 2023.
- Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018b.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- An introduction to the kalman filter. 1995.
- Magicanimate: Temporally consistent human image animation using diffusion model. ArXiv, abs/2311.16498, 2023.
- Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
- Magicavatar: Multimodal avatar generation and animation. ArXiv, abs/2308.14748, 2023.
- Lvmin Zhang. Reference-only controlnet. https://github.com/Mikubill/sd-webui-controlnet/discussions/1236, 2023.5.
- Exploring dual-task correlation for pose guided person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7713–7722, 2022.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
- Thin-plate spline motion model for image animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3657–3666, 2022.
- Mv-ton: Memory-based video virtual try-on network. In Proceedings of the 29th ACM International Conference on Multimedia, pages 908–916, 2021.
- Tryondiffusion: A tale of two unets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4606–4615, 2023.
- Zhengze Xu (2 papers)
- Mengting Chen (10 papers)
- Zhao Wang (155 papers)
- Linyu Xing (3 papers)
- Zhonghua Zhai (10 papers)
- Nong Sang (87 papers)
- Jinsong Lan (11 papers)
- Shuai Xiao (31 papers)
- Changxin Gao (77 papers)