Embodied Multimodal Multitask Learning (1902.01385v1)

Published 4 Feb 2019 in cs.LG, cs.AI, cs.CL, cs.RO, and stat.ML

Abstract: Recent efforts on training visual navigation agents conditioned on language using deep reinforcement learning have been successful in learning policies for different multimodal tasks, such as semantic goal navigation and embodied question answering. In this paper, we propose a multitask model capable of jointly learning these multimodal tasks, and transferring knowledge of words and their grounding in visual objects across the tasks. The proposed model uses a novel Dual-Attention unit to disentangle the knowledge of words in the textual representations and visual concepts in the visual representations, and align them with each other. This disentangled task-invariant alignment of representations facilitates grounding and knowledge transfer across both tasks. We show that the proposed model outperforms a range of baselines on both tasks in simulated 3D environments. We also show that this disentanglement of representations makes our model modular, interpretable, and allows for transfer to instructions containing new words by leveraging object detectors.

Citations (24)

View on Semantic Scholar

Summary

The paper introduces a novel Dual-Attention mechanism that aligns textual and visual cues for effective multitask learning.
It demonstrates cross-task knowledge transfer, enabling improved zero-shot performance in complex simulated 3D environments.
Numerical evaluations in Doom and House3D environments show state-of-the-art results, with up to 96% accuracy on SGN tasks and 58% on EQA tasks.

Overview of "Embodied Multimodal Multitask Learning"

The paper "Embodied Multimodal Multitask Learning" aims to address the challenges associated with developing AI agents capable of navigating and executing tasks in multimodal, three-dimensional environments. The authors present a multitask model designed to perform tasks such as semantic goal navigation (SGN) and embodied question answering (EQA) by leveraging deep reinforcement learning (RL). At the core of their approach is a novel Dual-Attention unit that facilitates the alignment and transfer of task-relevant knowledge across domains through the disentanglement of textual and visual representations.

Significant Contributions

Dual-Attention Mechanism: The proposed model uses a Dual-Attention mechanism to achieve effective alignment between textual and visual modalities. This unit comprises sequential Gated- and Spatial-Attention mechanisms that enhance the model’s capability to ground and transfer knowledge of words associated with visual objects across different tasks.
Cross-Task Knowledge Transfer: A notable aspect of this research is its focus on cross-task knowledge transfer. The model’s innovative architecture allows it to generalize to unseen task scenarios by leveraging shared knowledge components. Hence, the model displays improved zero-shot learning capabilities on tasks that contain novel compositions of words.
Numerical Evaluation in Simulated Environments: The model's efficacy was evaluated in complex, simulated 3D environments such as Doom and House3D. Across various levels of difficulty, the proposed architecture showed significant improvement over state-of-the-art baselines, achieving up to 96% accuracy on SGN tasks and 58% accuracy on EQA tasks in certain settings. These results underscore the robustness of the proposed multitask model in handling diverse and visually rich environments.

Theoretical and Practical Implications

The Dual-Attention model contributes to theoretical advancements in multitask learning by tightly integrating attention mechanisms that preserve task invariance while optimizing for visual-textual alignment. Practically, this model holds promise for real-world applications where autonomous systems must process and integrate diverse information streams, such as robotics, AR/VR, and interactive AI systems that require nuanced understanding and execution based on language and visual inputs.

Future Directions and Extensions

The paper hints at future directions where this model could be extended to transfer knowledge across varied domains by maintaining domain-agnostic representations of objects. Additionally, exploring scalability in terms of object and attribute recognitions holds potential, offering exciting pathways for deploying this model in more dynamic and realistic environments.

In conclusion, the paper presents a well-founded and effectively tested advancement in the field of embodied AI, especially within the scope of multimodal and multitask learning frameworks. By addressing the critical requirement of transferring and aligning knowledge across tasks and modalities, the Dual-Attention model stands as a significant step toward developing more intelligent and versatile autonomous agents.