MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators (2306.10900v2)

Published 19 Jun 2023 in cs.CV and cs.AI

Abstract: Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in LLMs. Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Visit our webpage at https://qiqiapink.github.io/MotionGPT/.

References (52)

Authors (10)

Yaqi Zhang (20 papers)
Di Huang (203 papers)
Bin Liu (441 papers)
Shixiang Tang (49 papers)
Yan Lu (179 papers)
Lu Chen (246 papers)
Lei Bai (154 papers)
Qi Chu (53 papers)
Nenghai Yu (174 papers)
Wanli Ouyang (359 papers)

Citations (72)

View on Semantic Scholar

Summary

The paper demonstrates that finetuned LLMs, with only 0.4% parameter adjustments, can efficiently generate high-quality human motion sequences.
It introduces a novel methodology that quantizes multimodal inputs as tokens, unifying text and pose data for realistic motion generation.
Results on HumanML3D and KIT-ML confirm MotionGPT’s efficiency by achieving competitive FID scores and lower computational costs compared to baseline models.

An Overview of MotionGPT: Finetuned LLMs as General-Purpose Motion Generators

The paper introduces MotionGPT, a novel framework that utilizes LLMs to generate realistic human motion sequences from textual descriptions and pose data. This work highlights the growing importance of human motion generation in digital media industries and addresses the limitations of single-modality control seen in prior research. By leveraging LLMs, MotionGPT offers flexibility and efficiency in generating human motion, incorporating multimodal inputs and demonstrating robustness across various scenarios.

Framework and Methodology

The MotionGPT framework leverages LLMs, notably adapted with LoRA (Low-Rank Adaptation), for the task of human motion generation from multimodal inputs. The core innovation lies in treating multimodal data, such as single-frame human poses and textual descriptions, as unique input tokens for LLMs. A key process in this methodology involves quantizing multimodal control signals into discrete codes. These serve as a unified prompt guide that steers the motion generation process. By doing so, the framework essentially casts human motion generation as a LLM problem, wherein the LLM is tasked with "answering" movement sequences formulated by these inputs.

A notable aspect of MotionGPT is its frugality in fine-tuning; only 0.4% of the original LLM parameters are adjusted. This allows the model to maintain its learned language priors, facilitating an efficient adaptation to motion generation tasks. The authors demonstrate through experimentation that this approach effectively addresses the challenge of multimodality, enabling LLMs to adapt to control signals not initially present during pre-training.

Evaluation and Results

MotionGPT was evaluated on the HumanML3D and KIT-ML datasets, which are comprehensive benchmark datasets for human motion generation tasks. The evaluations focused on various qualitative and quantitative metrics, such as Frechet Inception Distance (FID), Multi-modal Distance, and diversity scores, positioning MotionGPT favorably against contemporary models like TEMOS, TM2T, and MotionDiffuse.

Among the standout features of MotionGPT is its ability to achieve competitive results with a much smaller set of training parameters (33 million) and reduced computational time, needing only 10% of the training time required by other state-of-the-art models. This efficiency is attributed to the model's structural incorporation of LoRA for fine-tuning.

Experimentally, joint training across multiple control conditions demonstrated superior results compared to isolated training sessions. For instance, using both text and keyframe controls observed marked performance improvements. Notably, MotionGPT achieved an FID of 0.116 on HumanML3D, highlighting its capability to produce high-quality and diverse motion sequences given varied inputs.

Implications and Future Directions

The implementation of MotionGPT is significant as it models a unified solution for multimodal human motion synthesis. Its methodological approach sheds light on the viability of using LLMs beyond textual applications, expanding the scope of AI models to incorporate richer datasets, including visual and physical movement information.

There are several ramifications of MotionGPT's release for the field of AI and digital media. Practically, this approach could revolutionize content creation in film, video gaming, and virtual reality, industries heavily reliant on realistic character animations. Theoretically, it suggests a new paradigm for multimodal learning, where LLMs can be the foundation for various input-output transformation tasks.

For future developments, explorations into additional modalities, such as auditory signals, could broaden MotionGPT's applicability further. Additionally, leveraging advancements within LLM architectures could enhance the fidelity and complexity of generated motion, potentially leading to richer interactions in digital virtual spaces.

In conclusion, MotionGPT represents a progressive step in integrating LLMs with human motion generation, demonstrating an effective blend of language processing and physical modeling that aligns with emergent needs in interactive digital environments.

PDF Markdown

MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators (2306.10900v2)

Summary

An Overview of MotionGPT: Finetuned LLMs as General-Purpose Motion Generators

Related Papers

GitHub