OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser

Published 24 Jun 2024 in cs.CL and cs.AI | (2406.16495v3)

Abstract: Recent research has shown that combining Mamba with Transformer architecture, which has selective state space and quadratic self-attention mechanism, outperforms using Mamba or Transformer architecture alone in language modeling tasks. The quadratic self-attention mechanism effectively alleviates the shortcomings of selective state space in handling long-term dependencies of any element in the sequence. We propose a position information injection method that connects the selective state space model with the quadratic attention, and integrates these two architectures with hybrid experts with cross-sharing domains, so that we can enjoy the advantages of both. We design a new architecture with a more biomimetic idea: Observer-Thinker-Conceiver-Expresser (OTCE), which can compete with well-known medium-scale open-source LLMs on a small scale in language modeling tasks.

Abstract PDF HTML Upgrade to Chat

Authors (5)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel hybrid architecture combining SSM and attention with a cross-domain mixture of experts for enhanced language modeling.
It leverages a unique biomimetic design featuring Observer, Thinker, Conceiver, and Expresser modules to balance short-term and long-term dependencies.
Empirical results show improved efficiency and reduced perplexity on NLP tasks, demonstrating the potential for scalable model training.

An Overview of the OTCE Model Architecture

The paper "OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser" introduces a novel approach to language modeling by integrating selective state space models (SSMs) with attention mechanisms and a cross-domain mixture of experts (MOE). The resulting architecture, OTCE, proposes a biomimetic model that is divided into four modules: Observer, Thinker, Conceiver, and Expresser.

Bridging SSMs with Attention

The integration of SSMs and attention mechanisms addresses several challenges associated with each method individually. While Transformers adeptly manage long-range dependencies across sequences through self-attention, they are hindered by computational constraints due to their quadratic complexity concerning sequence length. SSMs, on the other hand, offer linear scaling during training, incorporating a succinct summary state, but falter in effectively capturing long-term dependencies due to reliance on implicit local positional information.

To synthesize the strengths of both architectures, the authors propose a positional information encoding method that injects relative positional data. This method bridges the SSM's selective state capabilities with attention's quadratic mechanism, thereby resulting in a model capable of efficiently handling both short-term and long-term dependencies.

Cross-Domain Mixture of Experts

The OTCE model introduces a novel mixture of experts that simulate cross-domain knowledge sharing, akin to distributed human knowledge across domains. The Cohesive Cross-Domain Expert shares parameters linearly, suitable for smaller models, while the Expansive Cross-Domain Expert employs shared parameters within complete multi-layer perceptrons, facilitating larger models. This approach enhances generalization and effectively promotes knowledge transfer across domains, vastly improving the efficiency of model training and inference.

Architectural Design: Mimicking Biological Processes

The OTCE architecture is inspired by biological processes of observation, cognition, conception, and expression. The Observer module employs SSM for selective information processing, filtering irrelevant data while retaining essential information. The Thinker module uses the attention mechanism to establish relationships between any sequence elements, thereby building dependencies over long distances. Subsequently, the Conceiver module aggregates all state information into a singular summary. Finally, the Expresser module synthesizes the context-aware state information from attention with the Conceiver's aggregated state to form a complete output.

Empirical Validation and Results

The authors validate the OTCE model across various tasks such as semantic similarity, text classification, and natural language inference. Notably, the architecture thrives on tasks demanding associative recall, outperforming other models that do not incorporate the re-attention weighting step before output. The cross-domain MOE also proves its mettle, functioning more efficiently than traditional shared expert isolation by tailoring shared knowledge more precisely between domains.

An ablation study accentuates the paramount role of combining MOE with attention in boosting the model's overall effectiveness and reducing perplexity during language modeling, indicating that the propensity for cross-domain knowledge sharing facilitates improved data efficiency during training.

Implications and Future Directions

The OTCE model offers substantial improvements over previous architectures by merging the strengths of SSMs and attention with a sophisticated expert system. The hybrid approach not only augments learning capabilities across a broad spectrum of tasks but also ensures scalability and efficiency when dealing with extensive datasets.

The practical implications are clear: models capable of efficiently handling both short and long sequences while sharing knowledge across domains will prove invaluable in advancing natural language processing applications.

Future work will likely focus on refining parameter sharing strategies within cross-domain experts and exploring further integration possibilities with different architectures to continue enhancing the model's comprehension and reasoning abilities. Emphasis on scalability and efficiency will remain critical as the research trajectory progresses toward models capable of handling increasingly complex language modeling challenges.

Markdown Report Issue