XGen-7B Technical Report (2309.03450v1)

Published 7 Sep 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have become ubiquitous across various domains, transforming the way we interact with information and conduct research. However, most high-performing LLMs remain confined behind proprietary walls, hindering scientific progress. Most open-source LLMs, on the other hand, are limited in their ability to support longer sequence lengths, which is a key requirement for many tasks that require inference over an input context. To address this, we have trained XGen, a series of 7B parameter models on up to 8K sequence length for up to 1.5T tokens. We have also finetuned the XGen models on public-domain instructional data, creating their instruction-tuned counterparts (XGen-Inst). We open-source our models for both research advancements and commercial applications. Our evaluation on standard benchmarks shows that XGen models achieve comparable or better results when compared with state-of-the-art open-source LLMs. Our targeted evaluation on long sequence modeling tasks shows the benefits of our 8K-sequence models over 2K-sequence open-source LLMs.

Citations (12)

View on Semantic Scholar

Summary

The paper presents a novel 7B parameter LLM that expands sequence processing up to 8K tokens, overcoming traditional 2K token limits.
The methodology employs a two-stage training over 1.5 trillion tokens using TPU-v4 hardware and JaxFormer for efficient scaling.
Evaluations reveal superior instruction tuning and competitive performance on tasks like code generation and summarization, highlighting its practical impact.

An Analysis of the XGen-7B Technical Report

The XGen-7B Technical Report introduces a novel series of 7B parameter LLMs that address the constraints commonly associated with existing open-source LLMs, particularly their limited capacity for long sequence processing. This paper discusses the development, training, and evaluation of XGen-7B models, emphasizing their competitive performance and broader applicability in both academic and commercial settings.

Training and Model Specifications

The XGen-7B series is distinguished by its ability to process sequences up to 8K tokens, compared to the conventional 2K token limitation in other open-source LLMs. The training strategy utilizes a two-stage method over 1.5 trillion tokens, beginning with shorter sequences and progressively increasing to longer sequences. This method not only allows the model to better utilize the context available in lengthy inputs but also optimizes its pre-training efficiency.

The architecture of XGen-7B mirrors the LLaMA model, with adjustments for vocabulary size and token processing capability, facilitating seamless integration with existing frameworks. Notably, XGen-7B's training on TPU-v4 hardware employs JaxFormer, leveraging data and model parallelism.

Instruction Tuning and Evaluation

The paper details the development of instruction-tuned variants, XGen-7B-Inst, which underwent fine-tuning on public-domain instructional data. These models demonstrate superior performance on instructional benchmarks, such as AlpacaEval and MT-Bench, when compared to similarly sized models and occasionally even larger models.

Evaluations carried out showed that XGen-7B maintains competitive performance across a range of tasks including standard NLP benchmarks and code generation on HumanEval, where it achieves similar pass@1 rates to state-of-the-art models.

Long Sequence Modeling and Practical Implications

XGen-7B's ability to handle extended sequences is particularly advantageous for applications necessitating comprehensive context, such as code generation, summarization, and long-form question answering. Targeted evaluations show a marked improvement over models with a 2K sequence limit, underscoring its potential in domains requiring deep contextual understanding.

Moreover, the model's open-source nature invites further exploration and adaptation, promising to advance research and facilitate various commercial applications.

Future Prospects

The XGen-7B series exemplifies the potential of smaller, efficiently trained models that leverage extensive data for optimal performance. The implications of this work extend to improving accessibility and utility of LLMs in mobile and distributed computing environments. Future research could explore the integration of specialized tasks and modalities, further enhancing the versatility and applicability of such models in increasingly complex AI challenges.

In conclusion, the XGen-7B Technical Report provides insightful advancements in the development and application of open-source LLMs, specifically targeting the challenges associated with sequence length limitations. Its contribution lies in offering a robust, accessible alternative for research and commercial utilization, while prompting further inquiry into scalable and efficient LLM training methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - salesforce/xgen: Salesforce open-source LLMs with 8k sequence length. (718 stars)