Wings: Learning Multimodal LLMs without Text-only Forgetting

Published 5 Jun 2024 in cs.CL, cs.AI, and cs.LG | (2406.03496v1)

Abstract: Multimodal LLMs (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM. In this paper, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, like "wings" on either side, are connected in parallel within each layer's attention block. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with attention-based routing to blend the outputs of the visual and textual learners. We design the Low-Rank Residual Attention (LoRRA) to guarantee high efficiency for learners. Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper identifies that attention shifts between text and images lead to text-only performance degradation in multimodal LLMs.
The paper designs a dual learner architecture that balances textual and visual processing to maintain robust text handling.
The paper employs Low-Rank Residual Attention (LoRRA) to efficiently enhance model expressivity and achieve superior benchmark performance.

An Overview of "Wings: Learning Multimodal LLMs without Text-only Forgetting"

This paper introduces "Wings," a novel framework aimed at enhancing multimodal LLMs (MLLMs) by mitigating the phenomenon of text-only forgetting. In essence, the research seeks to address the degradation in performance observed in MLLMs when they are fine-tuned with mixed multimodal inputs, which often leads to a diminished ability to handle text-only instructions that were initially well managed by text-centric LLMs.

Key Contributions

The authors identify and address a critical challenge faced by MLLMs—the tendency to neglect text-only instructions after being fine-tuned with image-text data. By examining attention patterns, they derive that this forgetting is linked to an attention shift from text to visual data, especially when images are placed in between text sequences. This insight informs their architectural innovation: the integration of textual and visual learners designed to stabilize attention allocation across modalities.

Attention Dynamics and MLLM-Laws: Through an empirical investigation into the layer-level attention weights across multiple MLLMs, the researchers uncover a correlation between consistency in these weights and improved text-only performance. They propose the MLLM- as a metric to capture attention shifts that signify a loss of text-only capability.
Visual and Textual Learners: The "Wings" framework incorporates additional visual and textual learners within each layer of the attention mechanism. These learners work in parallel to the main attention branch to restore balance in modality focus. This design choice is inspired by the observation of how multimodal integration can cause competitive shifts in standard MLLMs' attention, thereby disrupting text processing.
Low-Rank Residual Attention (LoRRA): To efficiently implement these learners, the paper introduces LoRRA, a computationally light method that uses low-rank matrix adaptations to enhance representational power without significant overhead. This method allows the model to maintain expansive expressiveness with minimal computational resource demands.
Empirical Validation: The framework is rigorously tested against benchmarks across text-only and multimodal domains. The proposed model demonstrates superior performance in managing both modalities without compromising the quality of text-only tasks, as evidenced by its performance on the Interleaved Image-Text (IIT) benchmark and other well-known datasets.

Implications

The findings have practical and theoretical ramifications. Practically, the capacity to fuse modalities without losing text-handling prowess is vital for developing robust AI systems capable of seamlessly switching between textual and visual contexts in real-world applications. Theoretically, the work challenges the existing paradigm of MLLMs, suggesting that attention dynamics play a crucial role in multimodal learning. Future research might explore further the balance of modality interactions and investigate optimization techniques to enhance cross-modal transfer without significant retraining costs.

Future Directions

The paper lays the groundwork for numerous avenues of future research. As the demand for intelligent systems that can handle complex, multimodal tasks continues to grow, ensuring that performance across modalities remains balanced will be crucial. Additionally, exploring the integration of more sophisticated multi-turn dialogue systems or extending these methods into other modalities like audio could further augment the model's utility. Moreover, scaling down the model for edge devices without losing performance might also present significant challenges worth addressing.

In conclusion, "Wings" presents a compelling approach to sustaining comprehensive performance in MLLMs by incorporating targeted architectural adaptations that prevent modal dominance and ensure robust text handling, contributing to the broader field of multimodal AI development.

Markdown Report Issue