Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 165 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 189 tok/s Pro

GPT OSS 120B 445 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Efficient Large Language Models: A Survey (2312.03863v4)

Published 6 Dec 2023 in cs.CL and cs.AI

Abstract: LLMs have demonstrated remarkable capabilities in important tasks such as natural language understanding and language generation, and thus have the potential to make a substantial impact on our society. Such capabilities, however, come with the considerable resources they demand, highlighting the strong need to develop effective techniques for addressing their efficiency challenges. In this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from model-centric, data-centric, and framework-centric perspective, respectively. We have also created a GitHub repository where we organize the papers featured in this survey at https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey. We will actively maintain the repository and incorporate new research as it emerges. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of efficient LLMs research and inspire them to contribute to this important and exciting field.

Citations (101)

View on Semantic Scholar

Summary

The paper surveys methodologies that reduce computational demands through compression techniques like quantization, pruning, low-rank approximation, and knowledge distillation.
It details efficient pre-training, fine-tuning, and inference strategies, highlighting advancements such as LoRA, mixed precision, and KV-cache optimization.
The study explores frameworks like DeepSpeed and Megatron that enable scalable, cost-effective deployment of large language models.

Introduction

LLMs have significantly advanced the field of natural language processing. Their success in various tasks, however, is matched by their substantial computational and resource demands. The increasing scale of LLMs necessitates a critical examination of efficiency from both an algorithmic and systemic perspective. The paper at hand delineates a comprehensive survey of methodologies focused on enhancing the efficiency of LLMs, which is paramount for facilitating broader and more cost-effective applications.

Model-Centric Methods

Model Compression Techniques

The compression of LLMs is pivotal for mitigating their resource intensiveness. The survey categorizes model compression into quantization, parameter pruning, low-rank approximation, and knowledge distillation.

Quantization methods are employed post-training or during training (QAT) to compress model weights. Innovations like LLM.int8(), GPTQ, and AWQ aggressively pursue reduced precision levels while maintaining model performance.

Parameter Pruning strategies, delineated into structured and unstructured pruning, selectively eliminate weights, reflecting the model's architecture or individual parameters.

Low-rank Approximation techniques approximate weight matrices with low-rank matrices, hence reducing parameters and computational burden. TensorGPT and ZeroQuant-V2 epitomize this category.

Knowledge Distillation involves training compact student models that emulate the performance of the larger teacher LLMs. Methods like Baby Llama and GKD underscore the diversity within this area.

Efficient Pre-Training and Fine-tuning

Efficient pre-training methodologies vary from mixed precision acceleration, which leverages lower precision to balance computational costs and accuracies, to scaling models, initialization techniques, and optimization strategies like Adam or Sophia that enhance pre-training speed and efficiency.

Fine-tuning bestows LLMs with task-specific knowledge while conserving resources. Technologies like LoRA and prefix-tuning introduce minimal trainable parameters, while memory-efficient approaches like Selective Fine-Tuning significantly trim down GPU memory usage.

Efficient Inference Strategies

These entail speculative decoding, KV-cache optimization, and sharing-based attention acceleration. Speculative decoding forecasts token sequences with smaller models, truncating inference time, whereas KV-cache optimization curtails redundant KV pair computation during autoregressive decoding.

Sharing-based attention acceleration methods like MQA or GQA share key and value transformation matrices among different attention heads, optimizing computational overhead.

Efficient Architecture Design for LLMs

Current strategies for long-context LLMs include Extrapolation and Interpolation, which aim to extend model performance to longer sequences beyond training. For instance, methods like ALiBi employ attention with linear biases to retain input lengths within pre-trained constraints. Other tactics like Recurrent Structure and Window & Stream Structure focus on segmenting texts into manageable chunks or adopting recurrence to retain long-term context information.

Data-Centric Methods

Data Selection

Precision in data selection enables more effective and efficient LLMs pre-training and fine-tuning. Strategies encompass unsupervised and supervised methods for performance enhancement and embrace techniques to optimize for instruction quality and example ordering.

Prompt Engineering

The field of prompt engineering is an emerging avenue for efficiency wherein few-shot prompting, guided by self-instruction and CoT techniques, encourages deeper reasoning with fewer samples. Prompt compression strives to encapsulate prompt information compactly, and prompt generation automates the creation of effective prompts to steer model output.

LLM Frameworks

Frameworks designed to support LLMs must consider their scale and complexity. DeepSpeed and Megatron stand out for integrating optimizations for both training and inference, while Alpa and ColossalAI focus on auto-tuning and parallel execution. Frameworks like vLLM and Parallelformers wield an inference-centric approach.

Concluding Remarks

The paper encapsulates a rich spectrum of innovative methods and frameworks to elevate the efficiency of LLMs. It not only synthesizes current research into a coherent taxonomy but also catalyzes further exploration into this expanding horizon. By doing so, it opens the door to the democratization and widespread adoption of LLMs across various applications.