Papers
Topics
Authors
Recent
2000 character limit reached

On decoder-only architecture for speech-to-text and large language model integration (2307.03917v3)

Published 8 Jul 2023 in eess.AS, cs.CL, and cs.SD

Abstract: LLMs have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based LLMs. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.

Citations (103)

Summary

  • The paper presents Speech-LLaMA, a decoder-only architecture that integrates an audio encoder and CTC compressor with a pretrained LLM for improved speech-to-text conversion.
  • The methodology employs frame-averaging in the CTC compressor and non-causal attention masks with LoRA fine-tuning to achieve up to a 4.6 BLEU score gain.
  • The study demonstrates parameter efficiency and superior multilingual translation performance across 13 languages using the CoVoST dataset.

Speech-to-Text and LLM Integration with Decoder-Only Architectures

Introduction

The integration of speech signals into LLMs remains a largely unexplored domain, particularly in leveraging the "decoder-only" architecture for speech processing tasks. The paper "On decoder-only architecture for speech-to-text and LLM integration" (2307.03917) presents an innovative approach, termed Speech-LLaMA, which integrates acoustic information into text-based LLMs. This research addresses the complexities of aligning speech and text modalities within LLM frameworks, deploying a decoder-only model architecture to streamline and enhance speech-to-text conversion processes.

Methodology

The Study introduces a sophisticated approach employing a pre-existing LLM integrated with an acoustic feature compressor and a simple audio encoder. The proposed Speech-LLaMA model utilizes Connectionist Temporal Classification (CTC) to map compressed acoustic features onto the continuous semantic space of the LLM, diverging from traditional methods that convert speech into discrete tokens. This direct mapping facilitates a seamless fusion of audio and text data, optimizing the LLM's ability to transcribe and translate speech inputs efficiently.

Architecture Overview:

The proposed system comprises three core components: a pre-trained text neural LLM, a CTC compressor for sequence length reduction, and an audio encoder bridging CTC outputs to the LLM's semantic space.

  • CTC Compressor: This module significantly reduces the sequence length by removing redundant acoustic information. Two mechanisms are explored: "blank-removal" and "frame-averaging", with the latter yielding superior results.
  • Audio Encoder: A lightweight randomly initialized module that translates the compressed audio signal into the LLM's semantic space, facilitating deep integration with the LLM.
  • LLMs and LoRA Fine-tuning: The LLaMA-7B model serves as the framework's backbone, augmented via low-rank adaptation (LoRA) for fine-tuning while maintaining minimal computational overhead. Figure 1

    Figure 1: High-level architecture of our proposed approach with LLM.

Experimental Setup

The researchers conducted extensive experiments on multilingual speech-to-text translation tasks. The framework was evaluated across 13 languages with data sourced from the CoVoST dataset. Speech-LLaMA was benchmarked against robust seq2seq architectures and enhanced using various configurations, including causal and non-causal attention mask strategies to allow full context learning of acoustic features.

Baseline Comparison:

  • Baseline models employed a seq2seq architecture, with additional experiments using LLaMA n-best rescoring to enhance their performance.
  • Speech-LLaMA demonstrated significant improvements over these baselines, achieving up to a 4.6 BLEU score gain, highlighting its superiority in parameter efficiency and performance.

Results

The results underscore the capabilities of the Speech-LLaMA model in multilingual speech translation tasks, with notable advances in BLEU scores across all language pairs tested. The "frame-averaging" approach in the CTC compressor outperformed other strategies, effectively leveraging pretrained transcription data to enhance performance. Further, experiments indicated that non-causal attention masks and LoRA fine-tuning yield substantial improvements, underscoring the importance of contextual and low-resource adaptations in enhancing LLM capabilities for speech tasks. Figure 2

Figure 2: The architecture of the decoder-only model for the from-scratch training.

Discussion

The paper emphasizes the advantages of the decoder-only architecture in speech processing, validating its efficacy in achieving competitive results with fewer parameters than traditional encoder-decoder models. This shared parameter system offers improved parameter efficiency, potentially advancing both academic research and practical applications of LLMs in speech-related tasks. Future research directions could explore dynamic integration with extensive language resources and optimized fine-tuning techniques for broader application domains.

Conclusion

The development of the Speech-LLaMA framework marks a significant stride in integrating speech and text modalities within LLMs using a decoder-only architecture. This research not only demonstrates substantial gains in multilingual speech-to-text tasks but also sets the stage for future advancements in parameter-efficient, deeply integrated LLMs capable of processing diverse inputs seamlessly.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.