Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators (2306.10900v2)

Published 19 Jun 2023 in cs.CV and cs.AI

Abstract: Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in LLMs. Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Visit our webpage at https://qiqiapink.github.io/MotionGPT/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Cmu graphics lab motion capture database. http://mocap.cs.cmu.edu/.
  2. Openai. (2023). chatgpt (mar 14 version) [large language model]. https://chat.openai.com/chat/.
  3. Hp-gan: Probabilistic 3d human motion prediction via gan. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 1418–1427, 2018.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  6. Vision-language models as success detectors. arXiv preprint arXiv:2303.07280, 2023.
  7. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022.
  8. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 580–597. Springer, 2022.
  9. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2021–2029, 2020.
  10. A recurrent variational autoencoder for human motion synthesis. In Proceedings of the British Machine Vision Conference (BMVC), 2017.
  11. On the effectiveness of adapter-based tuning for pretrained language model adaptation. arXiv preprint arXiv:2106.03164, 2021.
  12. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  13. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  14. Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification. arXiv preprint arXiv:2108.02035, 2021.
  15. Trajevae: Controllable human motion generation from trajectories. arXiv preprint arXiv:2104.00351, 2021.
  16. Lightweight adapter tuning for multilingual speech translation. arXiv preprint arXiv:2106.01463, 2021.
  17. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  18. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  19. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
  20. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13401–13412, 2021.
  21. Auto-conditioned recurrent networks for extended complex human motion synthesis. arXiv preprint arXiv:1707.05363, 2017.
  22. Visual instruction tuning, 2023.
  23. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021.
  24. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  25. Seeing is not always believing: A quantitative study on human perception of ai-generated images. arXiv preprint arXiv:2304.13023, 2023.
  26. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019.
  27. The kit whole-body human motion database. In 2015 International Conference on Advanced Robotics (ICAR), pages 329–336. IEEE, 2015.
  28. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2891–2900, 2017.
  29. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  31. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10985–10995, 2021.
  32. Temos: Generating diverse human motions from textual descriptions. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pages 480–497. Springer, 2022.
  33. The kit motion-language dataset. Big data, 4(4):236–252, 2016.
  34. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  35. Improving language understanding by generative pre-training. 2018.
  36. Language models are unsupervised multitask learners. 2019.
  37. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  38. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  39. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  40. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  41. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  42. Motionclip: Exposing human motion generation to clip space. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pages 358–374. Springer, 2022.
  43. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
  44. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  45. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  46. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  47. mplug-owl: Modularization empowers large language models with multimodality, 2023.
  48. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  49. T2m-gpt: Generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052, 2023.
  50. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
  51. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  52. Music2dance: Dancenet for music-driven dance generation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18(2):1–21, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yaqi Zhang (20 papers)
  2. Di Huang (203 papers)
  3. Bin Liu (441 papers)
  4. Shixiang Tang (49 papers)
  5. Yan Lu (179 papers)
  6. Lu Chen (246 papers)
  7. Lei Bai (154 papers)
  8. Qi Chu (53 papers)
  9. Nenghai Yu (174 papers)
  10. Wanli Ouyang (359 papers)
Citations (72)

Summary

  • The paper demonstrates that finetuned LLMs, with only 0.4% parameter adjustments, can efficiently generate high-quality human motion sequences.
  • It introduces a novel methodology that quantizes multimodal inputs as tokens, unifying text and pose data for realistic motion generation.
  • Results on HumanML3D and KIT-ML confirm MotionGPT’s efficiency by achieving competitive FID scores and lower computational costs compared to baseline models.

An Overview of MotionGPT: Finetuned LLMs as General-Purpose Motion Generators

The paper introduces MotionGPT, a novel framework that utilizes LLMs to generate realistic human motion sequences from textual descriptions and pose data. This work highlights the growing importance of human motion generation in digital media industries and addresses the limitations of single-modality control seen in prior research. By leveraging LLMs, MotionGPT offers flexibility and efficiency in generating human motion, incorporating multimodal inputs and demonstrating robustness across various scenarios.

Framework and Methodology

The MotionGPT framework leverages LLMs, notably adapted with LoRA (Low-Rank Adaptation), for the task of human motion generation from multimodal inputs. The core innovation lies in treating multimodal data, such as single-frame human poses and textual descriptions, as unique input tokens for LLMs. A key process in this methodology involves quantizing multimodal control signals into discrete codes. These serve as a unified prompt guide that steers the motion generation process. By doing so, the framework essentially casts human motion generation as a LLM problem, wherein the LLM is tasked with "answering" movement sequences formulated by these inputs.

A notable aspect of MotionGPT is its frugality in fine-tuning; only 0.4% of the original LLM parameters are adjusted. This allows the model to maintain its learned language priors, facilitating an efficient adaptation to motion generation tasks. The authors demonstrate through experimentation that this approach effectively addresses the challenge of multimodality, enabling LLMs to adapt to control signals not initially present during pre-training.

Evaluation and Results

MotionGPT was evaluated on the HumanML3D and KIT-ML datasets, which are comprehensive benchmark datasets for human motion generation tasks. The evaluations focused on various qualitative and quantitative metrics, such as Frechet Inception Distance (FID), Multi-modal Distance, and diversity scores, positioning MotionGPT favorably against contemporary models like TEMOS, TM2T, and MotionDiffuse.

Among the standout features of MotionGPT is its ability to achieve competitive results with a much smaller set of training parameters (33 million) and reduced computational time, needing only 10% of the training time required by other state-of-the-art models. This efficiency is attributed to the model's structural incorporation of LoRA for fine-tuning.

Experimentally, joint training across multiple control conditions demonstrated superior results compared to isolated training sessions. For instance, using both text and keyframe controls observed marked performance improvements. Notably, MotionGPT achieved an FID of 0.116 on HumanML3D, highlighting its capability to produce high-quality and diverse motion sequences given varied inputs.

Implications and Future Directions

The implementation of MotionGPT is significant as it models a unified solution for multimodal human motion synthesis. Its methodological approach sheds light on the viability of using LLMs beyond textual applications, expanding the scope of AI models to incorporate richer datasets, including visual and physical movement information.

There are several ramifications of MotionGPT's release for the field of AI and digital media. Practically, this approach could revolutionize content creation in film, video gaming, and virtual reality, industries heavily reliant on realistic character animations. Theoretically, it suggests a new paradigm for multimodal learning, where LLMs can be the foundation for various input-output transformation tasks.

For future developments, explorations into additional modalities, such as auditory signals, could broaden MotionGPT's applicability further. Additionally, leveraging advancements within LLM architectures could enhance the fidelity and complexity of generated motion, potentially leading to richer interactions in digital virtual spaces.

In conclusion, MotionGPT represents a progressive step in integrating LLMs with human motion generation, demonstrating an effective blend of language processing and physical modeling that aligns with emergent needs in interactive digital environments.