Learning Joint Spatial-Temporal Transformations for Video Inpainting (2007.10247v1)

Published 20 Jul 2020 in cs.CV

Abstract: High-quality video inpainting that completes missing regions in video frames is a promising yet challenging task. State-of-the-art approaches adopt attention models to complete a frame by searching missing contents from reference frames, and further complete whole videos frame by frame. However, these approaches can suffer from inconsistent attention results along spatial and temporal dimensions, which often leads to blurriness and temporal artifacts in videos. In this paper, we propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting. Specifically, we simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss. To show the superiority of the proposed model, we conduct both quantitative and qualitative evaluations by using standard stationary masks and more realistic moving object masks. Demo videos are available at https://github.com/researchmm/STTN.

Authors (3)

Yanhong Zeng (23 papers)
Jianlong Fu (91 papers)
Hongyang Chao (34 papers)

Citations (273)

View on Semantic Scholar

Summary

Learning Joint Spatial-Temporal Transformations for Video Inpainting

This paper presents a novel approach to high-quality video inpainting with the introduction of a Spatial-Temporal Transformer Network (STTN). Video inpainting, which involves filling missing regions in video frames while maintaining spatial and temporal coherence, presents significant challenges due to complex motion, appearance changes, and computational demands. Previous methods often resulted in temporal artifacts and blur due to inconsistent attention handling.

Methodology

The authors propose STTN as a framework that addresses these challenges by learning joint spatial-temporal transformations, facilitating video completion as a "multi-to-multi" problem. STTN processes both neighboring and distant frames simultaneously, optimizing spatial-temporal coherence through a self-attention mechanism.

Architecture Overview: STTN consists of three primary components: a frame-level encoder, multi-layer multi-head spatial-temporal transformers, and a frame-level decoder. The transformers focus on coherent content reconstruction across frames via a multi-scale patch-based attention module.
Attention Mechanism: The STTN operates by handling patches of different scales, allowing it to adapt to various appearance changes and complex motion dynamics within video sequences. These patches are processed using a multi-head attention approach, thereby capturing the long-range dependencies necessary for coherent video inpainting.
Adversarial Optimization: The use of a spatial-temporal adversarial loss helps refine the inpainting process, ensuring both perceptual and coherence improvements across frames.

Results

The proposed STTN was evaluated on popular datasets such as YouTube-VOS and DAVIS, using various metrics including PSNR, SSIM, flow warping error, and VFID. The results denote significant improvements over state-of-the-art methods:

Quantitative Gains: The STTN achieved a 2.4% improvement in PSNR and a 19.7% improvement in VFID on YouTube-VOS. These metrics demonstrate the model's superior ability to preserve video quality and perceptual realism.
Qualitative Evaluation: The visual results underscore the efficacy of STTN in maintaining coherent structures and generating high-quality video content even in complex scenarios.

Implications and Future Work

The introduction of STTN marks a considerable advancement in video inpainting by providing a robust mechanism for joint spatial and temporal content processing. This has potential applications in fields ranging from video restoration to object removal and video editing.

Future work could explore:

Enhancements for handling rapid motion by incorporating 3D spatial-temporal attention mechanisms.
Integration of new temporal losses to further optimize video coherence without increasing computational complexity significantly.

Conclusion

The Spatial-Temporal Transformer Network advances the field of video inpainting by seamlessly merging spatial and temporal features, effectively reducing artifacts and blurring in completed videos. This work lays a foundation for future research into even more adaptive and efficient video processing methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - researchmm/STTN: [ECCV'2020] STTN: Learning Joint Spatial-Temporal Transformations for Video Inpainting (474 stars)

Tweets

https://twitter.com/FMCalisto/status/1594811747101286427

YouTube

Show All Videos