Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models (2402.13064v1)

Published 20 Feb 2024 in cs.CL

Abstract: We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of LLMs. Unlike prior work that relies on seed examples or existing datasets to construct instruction tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on LLMs (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy.

References (42)

Citations (31)

View on Semantic Scholar

Summary

The paper introduces GLAN, a novel methodology that generates synthetic instruction data from a curated taxonomy of human knowledge.
It employs an educational framework to systematically produce subjects, syllabuses, and homework tasks using LLMs.
Experimental results demonstrate superior performance in STEM, coding, and exam benchmarks with enhanced instruction-following capabilities.

Advanced Synthetic Instruction Tuning for LLMs through GLAN

Introduction to Generalized Instruction Tuning

The advent of LLMs has significantly advanced the capacities of AI in understanding and generating human-like text. Despite these advancements, the direct instruction-following capabilities of LLMs remain a challenge. The novel GLAN (Generalized Instruction-Tuning for LLMs) methodology addresses this gap by generating synthetic instruction tuning data covering a wide range of human knowledge and capabilities. Unlike previous works that rely on seed examples or existing datasets, GLAN draws from a pre-curated taxonomy of human knowledge, enabling the generation of diverse instructions across all disciplines.

Methodology of GLAN

GLAN's approach is inspired by the systematic structure of the human education system, breaking down human knowledge into various fields, sub-fields, and disciplines. This process is facilitated through the use of LLMs and minimal human verification, making it both scalable and customizable. The key phases of the GLAN methodology involve:

Taxonomy Creation: Construction of a comprehensive taxonomy that guides the synthetic instruction generation process.
Subject and Syllabus Generation: Utilizing LLMs to generate a list of subjects for each discipline, followed by detailed syllabuses outlining class sessions and key concepts.
Instruction Generation: Leveraging class session and key concept details to generate diverse homework questions and their corresponding answers.

This methodology mirrors the structure of human educational systems, emphasizing the generation of high-quality, diverse instructional data.

Experimental Findings

Extensive experiments were conducted to test GLAN's effectiveness. Notably, GLAN demonstrated superior performance across several dimensions, including mathematical reasoning, coding, academic exams, logical reasoning, and general instruction following. The instruction dataset spans a wide array of subjects, with GLAN models outperforming or closely matching the results of leading models across various benchmarks.

Academic Exam Benchmarks: A Deeper Dive

A closer examination of performance on academic exams reveals GLAN's proficiency in STEM subjects due to its ability to generate solutions with Chain-of-Thought reasoning. However, there is room for improvement in humanities and social sciences, highlighting potential areas for further development.

Generalization Capabilities and Task-specific Training Data

Analysis on task-specific training data exclusion confirmed GLAN's generalization capabilities, with models avoiding convergence to any specific domain present in the evaluation benchmarks. Additionally, an instruction-following capability evaluation demonstrated GLAN's enhanced instruction-following abilities, albeit with opportunities for further improvement.

Future Directions

GLAN introduces a scalable, general methodology for synthetic instruction tuning that significantly improves LLMs' capabilities across multiple domains. The methodology's ability to generate diverse, high-quality instruction data without relying on task-specific datasets marks a significant step towards achieving better generalized instruction-following capabilities. Future works may explore expanding the taxonomy to include broader data types, generating multi-turn conversation datasets, and refining techniques to enhance performance further in less well-served subjects.

Conclusion

GLAN offers a novel, effective approach to instruction tuning, presenting a promising avenue for enhancing the generalization capabilities of LLMs. Through an advanced understanding of instructional data generation and strategic model training, it is poised to significantly advance the field of generative AI and LLM development.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1760135161616703906

https://twitter.com/teortaxesTex/status/1789401753470788079

https://twitter.com/_philschmid/status/1761333131766886787

https://twitter.com/BrianRoemmele/status/1761389662722245079

https://twitter.com/teortaxesTex/status/1870825611524211019

https://twitter.com/_akhaliq/status/1760149821673922728

HackerNews

Synthetic Data Almost from Scratch (2 points, 1 comment)