Mumpy: Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection (2404.11054v3)
Abstract: The task of video inpainting detection is to expose the pixel-level inpainted regions within a video sequence. Existing methods usually focus on leveraging spatial and temporal inconsistencies. However, these methods typically employ fixed operations to combine spatial and temporal clues, limiting their applicability in different scenarios. In this paper, we introduce a novel Multilateral Temporal-view Pyramid Transformer ({\em MumPy}) that collaborates spatial-temporal clues flexibly. Our method utilizes a newly designed multilateral temporal-view encoder to extract various collaborations of spatial-temporal clues and introduces a deformable window-based temporal-view interaction module to enhance the diversity of these collaborations. Subsequently, we develop a multi-pyramid decoder to aggregate the various types of features and generate detection maps. By adjusting the contribution strength of spatial and temporal clues, our method can effectively identify inpainted regions. We validate our method on existing datasets and also introduce a new challenging and large-scale Video Inpainting dataset based on the YouTube-VOS dataset, which employs several more recent inpainting methods. The results demonstrate the superiority of our method in both in-domain and cross-domain evaluation scenarios.
- Discrete cosine transform. IEEE transactions on Computers, 100(1):90–93, 1974.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
- Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
- Free-form video inpainting with 3d gated convolution and temporal patchgan. In Proceedings of the International Conference on Computer Vision (ICCV), 2019.
- Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3539–3553, 2022.
- Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Exploiting fine-grained face forgery clues via progressive enhancement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 735–743, 2022.
- Hierarchical fine-grained image forgery detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3155–3165, 2023.
- A comprehensive survey and analysis of generative models in machine learning. Computer Science Review, 38:100285, 2020.
- Multi-view transformer for 3d visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022.
- Deep video inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5792–5801, 2019.
- Copy-and-paste networks for deep video inpainting. In International Conference on Computer Vision (ICCV), 2019.
- Localization of deep inpainting using high-pass fully convolutional network. In proceedings of the IEEE/CVF international conference on computer vision, pages 8301–8310, 2019.
- Noise doesn’t lie: towards universal detection of deep inpainting. arXiv preprint arXiv:2106.01532, 2021.
- Towards an end-to-end framework for flow-guided video inpainting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Transformer-based image inpainting detection via label decoupling and constrained adversarial training. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- Fuseformer: Fusing fine-grained information in transformers for video inpainting. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14040–14049, 2021.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Onion-peel networks for deep video completion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4403–4412, 2019.
- Dual-path adaptation from image to video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2203–2213, 2023.
- Pytorch: An imperative style, high-performance deep learning library, 2019.
- Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2017.
- A benchmark dataset and evaluation methodology for video object segmentation. In Computer Vision and Pattern Recognition, 2016.
- Thinking in frequency: Face forgery detection by mining frequency-aware clues. In European conference on computer vision, pages 86–103. Springer, 2020.
- Dlformer: Discrete latent transformer for video inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3511–3520, 2022.
- Safl-net: Semantic-agnostic feature learning network with auxiliary plugins for image manipulation detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22424–22433, 2023.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021.
- Deep video inpainting localization using spatial and temporal traces. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8957–8961. IEEE, 2022.
- Robust image forgery detection over online social network shared images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13440–13449, 2022.
- Vision transformer with deformable attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4794–4803, 2022.
- Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 585–601, 2018.
- Multiview transformers for video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3333–3343, 2022.
- Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2022.
- Frequency-aware spatiotemporal transformers for video inpainting detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8188–8197, 2021.
- Feature pyramid transformer. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, pages 323–339. Springer, 2020.
- Inertia-guided flow completion and style fusion for video inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5982–5991, June 2022.
- Generate, segment, and refine: Towards generic manipulation segmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13058–13065, 2020.
- Deep video inpainting detection. In BMVC, 2021.
- ProPainter: Improving propagation and transformer for video inpainting. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2023.