X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-modal Knowledge Transfer (2312.07378v1)
Abstract: The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences. However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using single-modal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation. The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge\footnote{\url{http://www.hoi4d.top/}.}, outperforming previous state-of-the-art by a large margin. We release the code at https://github.com/jinglinglingling/X4D
- Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9902–9912.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, 6836–6846.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3075–3084.
- Mars: Motion-augmented rgb stream for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7882–7891.
- Deep learning for image and point cloud fusion in autonomous driving: A review. IEEE Transactions on Intelligent Transportation Systems, 23(2): 722–739.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- PointRNN: Point recurrent neural network for moving point cloud processing. arXiv preprint arXiv:1910.08287.
- Point spatio-temporal transformer networks for point cloud video modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2): 2181–2192.
- Pstnet: Point spatio-temporal convolution on point cloud sequences. arXiv preprint arXiv:2205.13713.
- Point 4d transformer networks for spatio-temporal modeling in point cloud videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14204–14213.
- Rank pooling for action recognition. IEEE transactions on pattern analysis and machine intelligence, 39(4): 773–787.
- 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 9224–9232.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Joint segmentation and classification of human actions in video. In CVPR 2011, 3265–3272. IEEE.
- Point-to-voxel knowledge distillation for lidar semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8479–8488.
- Deep residual learning for image recognition. In IEEE Conference on Computer Vision & Pattern Recognition, 770–778.
- Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14751–14762.
- Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 156–165.
- Action recognition based on a bag of 3d points. In 2010 IEEE computer society conference on computer vision and pattern recognition-workshops, 9–14. IEEE.
- Meteornet: Deep learning on dynamic 3d point cloud sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9246–9255.
- HOI4D: A 4D egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21013–21022.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30.
- Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, 20–36. Springer.
- 3dv: 3d dynamic voxel for action recognition in depth video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 511–520.
- Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding. In European Conference on Computer Vision, 19–35. Springer.
- Learning from temporal gradient for semi-supervised action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3252–3262.
- Linking points with labels in 3D: A review of point cloud semantic segmentation. IEEE Geoscience and remote sensing magazine, 8(4): 38–59.
- 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In European Conference on Computer Vision, 677–695. Springer.
- Sat: 2d semantics assisted training for 3d visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1856–1866.
- Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud Sequence Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17661–17670.
- Deep multimodal transfer learning for cross-modal retrieval. IEEE Transactions on Neural Networks and Learning Systems, 33(2): 798–810.
- No pain, big gain: classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8510–8520.