Learning Social Affordance Grammar from Videos: Transferring Human Interactions to Human-Robot Interactions

Published 1 Mar 2017 in cs.RO, cs.AI, and cs.CV | (1703.00503v1)

Abstract: In this paper, we present a general framework for learning social affordance grammar as a spatiotemporal AND-OR graph (ST-AOG) from RGB-D videos of human interactions, and transfer the grammar to humanoids to enable a real-time motion inference for human-robot interaction (HRI). Based on Gibbs sampling, our weakly supervised grammar learning can automatically construct a hierarchical representation of an interaction with long-term joint sub-tasks of both agents and short term atomic actions of individual agents. Based on a new RGB-D video dataset with rich instances of human interactions, our experiments of Baxter simulation, human evaluation, and real Baxter test demonstrate that the model learned from limited training data successfully generates human-like behaviors in unseen scenarios and outperforms both baselines.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (40)

View on Semantic Scholar

Summary

The paper presents a probabilistic framework using a spatiotemporal AND-OR graph (ST-AOG) to capture both long-term sub-task planning and short-term atomic actions in human-robot interactions.
It employs a weakly-supervised learning approach with Gibbs sampling to assign probabilistic scores for parsing noisy skeleton data from RGB-D videos.
Experimental evaluations with a Baxter robot demonstrate improved naturalness, evidenced by reduced mean joint angle deviations and positive qualitative assessments.

Introduction

The paper "Learning Social Affordance Grammar from Videos: Transferring Human Interactions to Human-Robot Interactions" (1703.00503) introduces an innovative framework for enabling human-robot interactions (HRI) by learning social affordances from RGB-D videos of human interactions. The central aim is to construct a spatiotemporal AND-OR graph (ST-AOG) that represents the grammar of these interactions. The ST-AOG supports both agents' long-term sub-task planning and short-term atomic actions, allowing seamless integration into humanoids for creating human-like behaviors in diverse, unseen scenarios. The framework is assessed using a dataset comprised of varied human interaction scenarios, providing compelling empirical results.

Framework Overview

The framework facilitates the learning and transfer of social affordance grammar using a weakly-supervised learning approach that leverages Gibbs sampling. The methodology begins with RGB-D video data from which noisy skeletons are extracted to serve as input. These inputs are utilized to construct a hierarchical representation of human interactions, encompassing both short-term and long-term motion elements. Figure 1 elucidates the framework, demonstrating how human interactions are parsed into social affordance grammar to support interaction modeling for human-robot cooperation.

Figure 1: The framework of our approach.

Representation and ST-AOG Construction

The ST-AOG provides a structured representation of interaction dynamics, capturing both latent joint sub-task goals and fine-grained motion patterns, such as body gesture and orientation. The graphical model distinguishes itself by dynamically integrating task planning across dual agents. The graph comprises AND and OR nodes to depict potential compositional and stochastic relationships, as illustrated in Figure 2.

Figure 2: Social affordance grammar as a ST-AOG.

Key graph components include dictionaries for interaction categories, motion and relational attributes, atomic action sequences, and edge relations, which together establish a comprehensive action and relation grammar template. Human arm poses are mapped to robot configurations, aligning both angular and positional joint attributes (Figure 3).

Figure 3: (a) The joint angles of the arm of a Baxter robot directly mapped to a human arm (b).

Methodology

The learning process assigns probabilistic scores to different potential parses of observed interactions, combining arm motion and relational likelihoods with a parsing prior to optimize parse graph sequences. This probabilistic model forms the basis for robust parsing of interactions, enabling accurate motion transfer to a humanoid robot.

Joint and atomic action parsing are achieved through entropy-based interval detection and consequent Gibbs sampling for optimal label assignment, as exemplified in Figure 4.

Figure 4: A sequence of parse graphs in a shaking hands interaction, depicting temporal parsing.

The learned grammar incorporates attributes tied to atomic actions or sub-tasks with clear probabilistic transitions and dependencies, thus enabling nuanced motion production and imitation.

Experimental Evaluation

This approach has been rigorously evaluated through simulations and real-world tests using a Baxter robot. Baxter's ability to emulate human interactions was analyzed across various interaction scenarios, such as handshake and high five (Figure 5 and Figure 6).

Figure 5: Qualitative results of our Baxter simulation.

Figure 6: Qualitative results of the real Baxter test.

Performance metrics including mean joint angle deviations and qualitative assessments by human subjects confirm superiority over baseline methods, demonstrating enhanced naturalness in interactions.

Conclusion

This paper presents a significant advance in the domain of social affordance learning and human-robot interaction by effectively leveraging ST-AOGs to model and transfer intricate human social behaviors. The framework's adaptability to real-time applications and diverse scenarios marks a considerable stride in the design of socially aware robots. Future work may extend this foundation by incorporating LLMs and further enhancing intention recognition, broadening the scope for integrated, multimodal interaction strategies.

Markdown Report Issue