Semantics-aware Motion Retargeting with Vision-Language Models (2312.01964v3)
Abstract: Capturing and preserving motion semantics is essential to motion retargeting between animation characters. However, most of the previous works neglect the semantic information or rely on human-designed joint-level representations. Here, we present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-LLMs to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions. Then the high-level motion semantics are incorporated into the motion retargeting process by feeding the vision-LLM with the rendered images and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics, we adopt a two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints. Experimental results show the effectiveness of the proposed method in producing high-quality motion retargeting results while accurately preserving motion semantics.
- Adobe’s mixamo. https://www.mixamo.com/. Accessed: 2023-02-08.
- Skeleton-aware networks for deep motion retargeting. ACM Transactions on Graphics (TOG), 39(4):62–1, 2020.
- Scanqa: 3d question answering for spatial scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Online motion retargetting. The Journal of Visualization and Computer Animation, 11(5):223–235, 2000.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Michael Gleicher. Retargetting motion to new characters. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pages 33–42, 1998.
- Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2021–2029, 2020.
- Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022.
- Pose-aware attention network for flexible motion retargeting by body part. IEEE Transactions on Visualization and Computer Graphics, pages 1–17, 2023.
- Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2014.
- A hierarchical approach to interactive motion editing for human-like figures. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 39–48, 1999.
- Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 165–172, 2000.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023b.
- Visual semantic reasoning for image-text matching. In ICCV, 2019.
- Pmnet: Learning of disentangled pose and movement for unsupervised motion retargeting. In BMVC, page 7, 2019.
- Swinbert: End-to-end transformers with sparse attention for video captioning. In CVPR, 2022.
- Soft rasterizer: A differentiable renderer for image-based 3d reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7708–7717, 2019.
- Accurate 3d hand pose estimation for whole-body 3d human mesh estimation. In Computer Vision and Pattern Recognition Workshop (CVPRW), 2022.
- 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Physically based motion transformation. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 11–20, 1999.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision, pages 358–374. Springer, 2022a.
- Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022b.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Neural kinematic networks for unsupervised motion retargetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8639–8648, 2018.
- Contact-aware retargeting of skinned motion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9720–9729, 2021.
- Sat: 2d semantics assisted training for 3d visual grounding. In ICCV, 2021.
- Skinned motion retargeting with residual perception of motion semantics & geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13864–13872, 2023.
- On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
- 3d-vista: Pre-trained transformer for 3d vision and text alignment. ICCV, 2023b.