Generative Action Description Prompts for Skeleton-based Action Recognition (2208.05318v2)

Published 10 Aug 2022 in cs.CV, cs.AI, cs.LG, and cs.MM

Abstract: Skeleton-based action recognition has recently received considerable attention. Current approaches to skeleton-based action recognition are typically formulated as one-hot classification tasks and do not fully exploit the semantic relations between actions. For example, "make victory sign" and "thumb up" are two actions of hand gestures, whose major difference lies in the movement of hands. This information is agnostic from the categorical one-hot encoding of action classes but could be unveiled from the action description. Therefore, utilizing action description in training could potentially benefit representation learning. In this work, we propose a Generative Action-description Prompts (GAP) approach for skeleton-based action recognition. More specifically, we employ a pre-trained large-scale LLM as the knowledge engine to automatically generate text descriptions for body parts movements of actions, and propose a multi-modal training scheme by utilizing the text encoder to generate feature vectors for different body parts and supervise the skeleton encoder for action representation learning. Experiments show that our proposed GAP method achieves noticeable improvements over various baseline models without extra computation cost at inference. GAP achieves new state-of-the-arts on popular skeleton-based action recognition benchmarks, including NTU RGB+D, NTU RGB+D 120 and NW-UCLA. The source code is available at https://github.com/MartinXM/GAP.

Citations (23)

View on Semantic Scholar

Summary

The paper presents the GAP framework that uses generative language models to create descriptive prompts, moving beyond traditional one-hot encoding.
It employs a multi-modal training strategy that integrates skeleton and text encoders with a part-based contrastive loss to enhance action semantics.
The GAP framework achieves state-of-the-art performance on benchmarks like NTU RGB+D and NW-UCLA without increasing inference costs.

Generative Action Description Prompts for Skeleton-based Action Recognition

The growing field of skeleton-based action recognition is driven by applications across various domains, including human-computer interaction, sports, healthcare, and entertainment. This paper tackles the challenge of action recognition by introducing a novel framework, Generative Action-description Prompts (GAP), which leverages LLMs for improved skeleton-based action recognition. The authors address a significant limitation in current approaches, which primarily utilize one-hot encoding for classification and fail to capture semantic relationships between actions.

Key Contributions

Innovative Use of LLMs: The GAP framework innovatively uses large-scale pre-trained LLMs, such as GPT-3, to generate descriptive prompts about action sequences. By employing these generative action-description prompts, the framework enhances the semantic understanding of skeleton actions, offering more informative cues than traditional one-hot encoding.
Multi-Modal Training Scheme: A crucial aspect of GAP is its multi-modal training paradigm, which incorporates both a skeleton encoder and a text encoder. The text descriptions generated by the LLM are used as additional supervisory signals to guide the skeleton encoder. This bi-modal framework is designed to harness the complementary information from linguistic descriptions and skeletal data.
Part-Based Contrastive Loss: GAP introduces a multi-part contrastive learning approach, using part-based descriptions of skeletal actions to improve feature alignment between skeleton data and text prompts. This approach is essential for capturing and reinforcing the action semantics related to different body parts, thereby improving representation learning.
State-of-the-Art Performance: The GAP framework demonstrates significant improvements over baseline models across standard skeleton-based action recognition benchmarks, including NTU RGB+D, NTU RGB+D 120, and NW-UCLA. It achieves this without increasing computational costs during inference, as the text encoder is only employed during training.

The adoption of a LLM to generate action descriptions is a notable departure from traditional skeleton-based recognition approaches, which typically do not incorporate language-based cues. By framing action recognition as a multi-modal task, GAP effectively bridges the gap between textual and visual-spatial modalities, providing stronger supervision for action representation learning.

Experimental Insights and Future Directions

The experiments validate the effectiveness of the GAP method. The results demonstrate that incorporating language-based descriptions can enhance action recognition accuracy considerably. Furthermore, it is shown that descriptions concerning specific body parts contribute more significantly to performance improvement compared to general action descriptions. This underscores the potential for further allocating nuanced LLMs towards detailed body part movement descriptions.

Future research directions may explore extending the use of LLMs for other modalities in multi-modal action recognition frameworks, such as integrating video and audio data with skeletal data. There is also room to investigate more complex part-based strategies or enrich language prompts for more detailed action semantics. Additionally, automating the process of creating skeletal-text datasets could provide a richer pairing of data modalities for training.

In conclusion, the GAP framework's innovative blending of generative LLMs with skeleton-based action recognition represents a significant advancement in the domain. This multi-modal approach presents a promising direction for developing more sophisticated and semantically rich action recognition systems.

PDF Markdown

Related Papers

GitHub

GitHub - MartinXM/GAP: official implementation for Language Supervised Training for Skeleton-based Action Recognition (115 stars)