Emergent Mind

Abstract

This paper presents Dolphin, a novel decoder-decoder architecture for energy-efficient processing of long contexts in language models. Our approach addresses the significant energy consumption and latency challenges inherent in on-device models. Dolphin employs a compact 0.5B parameter decoder to distill extensive contextual information into a memory embedding, substantially reducing the input length for the primary 7B parameter decoder model. Inspired by vision-language models, we repurpose the image embedding projector to encode long textual contexts, effectively treating extended context as a distinct modality. This innovative method enables processing of substantially longer contexts without the typical computational overhead associated with extended input sequences. Empirical evaluations demonstrate a 10-fold improvement in energy efficiency and a 5-fold reduction in latency compared to conventional full-length context processing methods without losing quality of the response. Our work contributes to the development of more sustainable and scalable language models for on-device applications, addressing the critical need for energy-efficient and responsive AI technologies in resource-constrained environments while maintaining the accuracy to understand long contexts. This research has implications for the broader field of natural language processing, particularly in the domain of efficient model design for resource-limited settings. By enabling more sophisticated AI capabilities on edge devices, Dolphin paves the way for advanced language processing in a wide range of applications where computational resources are at a premium. The Dolphin model is publicly available at https://huggingface.co/NexaAIDev/Dolphin.

Dolphin model architecture: text encoder, projector for embeddings, and main LLM transformer decoder.

Overview

  • The paper introduces Dolphin, a novel dual-decoder architecture that enhances energy efficiency and reduces processing latency in long-context language models by using a combination of a smaller compact decoder and a larger primary decoder.

  • Dolphin leverages memory tokens and a multi-stage training process, including restoration training, continual training, and instruction fine-tuning, to handle extensive contextual information efficiently.

  • Evaluations show impressive gains in energy efficiency, latency reduction, and accuracy, making Dolphin highly suitable for resource-constrained on-device environments like mobile devices and IoT systems.

Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models

The paper "Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models" presents a sophisticated methodology to address the challenges inherent in processing long contexts within language models, particularly in resource-constrained, on-device environments. This approach introduces a novel decoder-decoder architecture designed to enhance energy efficiency and reduce latency while preserving high accuracy and contextual understanding.

Overview

The authors introduce Dolphin, a novel dual-decoder architecture wherein a compact 0.5 billion (0.5B) parameter decoder distills long contextual information into memory embeddings, which are then processed by a primary 7 billion (7B) parameter decoder. Inspired by vision-language models (VLMs), this architecture treats long textual context as a separate modality, enabling efficient handling of extensive input sequences without the usual computational burden.

Methodology

Architecture

  • Dual-Decoder Design: Dolphin employs a smaller 0.5B parameter decoder $\pis$ to encode long contexts into a compressed form. This compressed context, or memory embedding, is then handled by a larger 7B parameter decoder $\pil$ to generate responses to queries.
  • Projector Component: A multi-layer perceptron (MLP) projector $\Phi$ transforms embeddings from the smaller decoder $\pis$ into a format the larger decoder $\pil$ can process, bridging the embedding dimensions between the two decoders.

Memory Tokens

The paper introduces memory tokens to efficiently encapsulate long contextual information. By augmenting the tokenizer with special memory tokens, the architecture captures latent representations of the extensive context, reducing computational overhead.

Multi-Stage Training

The training process is divided into three stages:

  1. Restoration Training: The model learns to reconstruct original context from compressed embeddings, ensuring it can effectively distill and retrieve contextual information.
  2. Continual Training: Focuses on generating context continuations from partial compressed contexts, enhancing the model's ability to maintain coherence over long sequences.
  3. Instruction Fine-Tuning: Fine-tunes the model on instruction-following tasks to ensure accurate responses to queries within given contexts.

Empirical Results

The evaluations underscore impressive gains:

  • Energy Efficiency: Achieves a 10-fold improvement in energy efficiency compared to traditional full-length context processing methods.
  • Latency Reduction: Demonstrates a 5-fold reduction in latency, achieving an average inference time of 4.32 seconds compared to 20.71 seconds for the baseline Qwen2-7B model.
  • Accuracy: Maintains high correctness across various task categories, such as 97.76% in Contextual QA and 98.53% in Numeric QA. In Summarization and Rephrasing, it achieves correctness rates of 99.62% and 99.22% respectively.

Comparisons and Benchmarks

The Dolphin model was benchmarked against AutoCompressor and Qwen2-7B models. Dolphin exhibited a 95.1% win rate over AutoCompressor and showed competitive performance relative to Qwen2-7B, with a win-tie rate of 67.8%.

Implications and Future Directions

This research has significant implications for the deployment of language models in edge computing environments, such as mobile devices and IoT systems, where energy efficiency and low latency are paramount. The introduction of a dual-decoder architecture that leverages memory tokens and a multi-stage training regimen poses a compelling solution for handling long contexts without sacrificing performance.

Future developments might include extending the Dolphin architecture to other modalities or specialized domains, enhancing its application scope. Further optimizations could involve refining the memory token mechanism or improving the projector to handle even more extensive contexts or diverse data types seamlessly.

In conclusion, Dolphin presents a robust framework that addresses key limitations in current on-device language models, particularly concerning energy consumption and processing speed, making it a significant contribution to the field of natural language processing.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube