Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 34 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 80 tok/s Pro

Kimi K2 198 tok/s Pro

GPT OSS 120B 461 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation (2305.15011v2)

Published 24 May 2023 in cs.CL

Abstract: Instruction tuning has shown great promise in improving the performance of LLMs. However, research on multilingual instruction tuning has been limited due to the scarcity of high-quality instruction-response datasets across different languages. To bridge this gap, we present Bactrian-X, a comprehensive multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. Leveraging this dataset, we train a set of adapters using low-rank adaptation (LoRA), which are lightweight components that seamlessly integrate with LLMs. These adapters have a substantially lower parameter count than the base model, making them easily replaceable and usable as plug-ins for different languages or language groups. Extensive experiments in various multilingual evaluation settings demonstrate that models derived from LoRA-based training over Bactrian-X outperform both the vanilla models and existing instruction-tuned models. The code and models are publicly available at https://github.com/mbzuai-nlp/bactrian-x

References (41)

Citations (71)

View on Semantic Scholar

Summary

The paper introduces Bactrian-X, a 3.4M multilingual instruction dataset that leverages LoRA for efficient fine-tuning of large language models.
It demonstrates parameter-efficient tuning integrated with models like BLOOM and LLaMA, addressing challenges in multilingual generalization.
Evaluation with GPT-4 shows that Bactrian-X outperforms standard instruction-tuned models on tasks like XCOPA and sentiment analysis.

Bactrian-X: Advancing Multilingual Instruction-Following Models

The paper "Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation" details the development of Bactrian-X, a substantial multilingual dataset containing 3.4 million instruction-response pairs spanning 52 languages, aimed at enhancing the multilingual capabilities of LLMs through instruction tuning. The paper leverages Low-Rank Adaptation (LoRA) to efficiently fine-tune LLMs with this dataset, providing insights into lightweight adaptation methodologies for multilingual contexts.

Key Contributions

The paper makes several notable contributions to the field of multilingual AI and NLP:

Multilingual Instruction Dataset: The introduction of Bactrian-X comprises diverse, automatically translated instructions based on existing English datasets like Alpaca and Dolly, using the Google Translate API, supplemented by responses generated via ChatGPT. This dataset addresses the longstanding challenge of multilingual generalization in instruction tuning.
Parameter-Efficient Fine-Tuning: Through the innovative use of LoRA, models are fine-tuned using adapters with a reduced parameter count, allowing for seamless integration with existing LLMs like BLOOM and LLaMA, without the burden of full model update.
Evaluation and Results: Bactrian-X models outperform vanilla and other instruction-tuned models across multiple zero-shot tasks in language understanding, such as XCOPA and Sentiment analysis. A more robust performance is observed using larger models like LLaMA with 13B parameters.
Open-Ended Question Assessment: Employing GPT-4 as an evaluator for open-ended question generation tasks, this research demonstrates that Bactrian-X models offer significant improvements over other models like Alpaca and BLOOMZ, particularly in adapting to unseen languages or domains.

Implications and Future Directions

This paper highlights the potential of multilingual instruction datasets to enhance LLMs' abilities across diverse languages. The advent of Bactrian-X emphasizes the importance of increased multilingual training data and efficient adaptation techniques like LoRA in expanding the capabilities of LLMs.

Practical Implications: The practicality of LoRA integration suggests potential scaling to more languages beyond those seen in pre-training, thereby broadening applicability in global NLP applications.
Theoretical Directions: Future research might explore the extension of this methodology to different model architectures, gauging the efficacy of such instruction-following models in various linguistic and cultural contexts.
AI Advancements: The paper offers a framework for future improvements in AI generalizability by focusing on multilingual readiness and efficiency, a necessity as AI systems are increasingly deployed globally.

In summary, the Bactrian-X dataset and corresponding model innovations present substantial progress in NLP by equipping LLMs with more adaptable, multilingual capabilities. Through a focus on efficiency and scope, this work sets a precedent for multi-faceted growth in multilingual AI research, aiming to provide more equitable LLM capabilities across diverse linguistic landscapes.