GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

Published 26 Jul 2023 in cs.CL | (2307.13923v2)

Abstract: Grammatical error correction aims to correct ungrammatical sentences automatically. Recently, some work has demonstrated the excellent capabilities of closed-source LLMs (LLMs, e.g., ChatGPT) in grammatical error correction. However, the potential of open-source LLMs remains unexplored. In this paper, we introduced GrammarGPT, an open-source LLM, to preliminary explore its potential for native Chinese grammatical error correction. The core recipe of GrammarGPT is to leverage the hybrid dataset of ChatGPT-generated and human-annotated. For grammatical errors with clues, we proposed a heuristic method to guide ChatGPT to generate ungrammatical sentences by providing those clues. For grammatical errors without clues, we collected ungrammatical sentences from publicly available websites and manually corrected them. In addition, we employed an error-invariant augmentation method to enhance the ability of the model to correct native Chinese grammatical errors. We ultimately constructed about 1k parallel data and utilized these data to fine-tune open-source LLMs (e.g., Phoenix, released by The Chinese University of Hong Kong, Shenzhen) with instruction tuning. The experimental results show that GrammarGPT outperforms the existing SOTA system significantly. Although model parameters are 20x larger than the SOTA baseline, the required amount of data for instruction tuning is 1200x smaller, illustrating the potential of open-source LLMs on native CGEC. Our GrammarGPT ranks $3^{rd}$ on NLPCC2023 SharedTask1, demonstrating our approach's effectiveness. The code and data are available at \url{https://github.com/FreedomIntelligence/GrammarGPT}.

Abstract PDF Upgrade to Chat

Citations (19)

View on Semantic Scholar

Summary

The paper presents GrammarGPT as an open-source LLM that leverages hybrid datasets and supervised fine-tuning to address native Chinese grammatical error correction.
It introduces an innovative error-invariant augmentation strategy that substitutes named entities to synthesize additional training data without altering grammatical structure.
It demonstrates competitive performance by ranking third in NLPCC2023 while using only a fraction of the data required by previous state-of-the-art systems.

An Analytical Overview of "GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning"

The paper "GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning" by Fan et al. presents a detailed exploration of using open-source LLMs for correcting grammatical errors specifically within native Chinese text. This work is grounded in the growing success observed in closed-source LLMs such as ChatGPT and aims to transition these successes into the open-source domain.

The researchers introduce GrammarGPT, an open-source LLM tailored for the task of Chinese Grammatical Error Correction (CGEC). The authors note an important distinction in CGEC literature, as much of previous work has focused on errors from non-native Chinese speakers. Native errors, being more intricate and syntactically nuanced, represent a more challenging domain.

Methodological Approach

The study's foundation lies in constructing a hybrid dataset utilizing both ChatGPT-generated data and manually annotated data. The ChatGPT-generated data helps in identifying common grammatical clues that can be leveraged to artificially construct ungrammatical sentences by introducing errors into correctly structured sentences. In contrast, human annotation is applied to more nuanced errors that often occur without clear syntactical clues.

An innovative component of the methodology involves an error-invariant augmentation strategy, where named entities within sentences are substituted with similar entities to generate additional training data without altering the grammatical structure. This method helps in emphasizing grammatical learning over semantic content, forcing models to focus more rigorously on error detection and correction.

Results and Performance

The findings of the paper are notable. GrammarGPT demonstrates substantial improvement over the current state-of-the-art (SOTA) systems, using significantly less data (about 1/1200th of that used by previous SOTA systems), thus highlighting the efficiency of instruction tuning with minimal data requirement. The model also ranks third in the NLPCC2023 Shared Task, solidifying its effectiveness within the competitive landscape.

From a numerical perspective, GrammarGPT's performance is quantified by leveraging both word-level and character-level MaxMatch (M2) scorers. The traditional precision, recall, and F $_{0.5}$ metrics reflect the model's capabilities, with significant improvements shown over the baselines featuring closed-source LLM approaches or those trained on non-native error datasets.

Implications and Future Directions

The potential implications of this research are profound, underscoring the viability of open-source LLMs in specialized NLP tasks such as CGEC. By demonstrating compelling results with an efficient data strategy, this study paves the way for further exploration of open-source LLMs in other languages and domains.

Theoretically, this work expands on the adoption of instructional tuning and augmentation within LLM development, encouraging a move away from extensive labeled datasets and highlighting the importance of model efficiency and versatility. Practically, the approach offers a resilient model that can be applied in educational tools, editorial systems, and language learning software aimed at enhancing grammatical accuracy for native speakers.

Future research directions might explore further refinements in error detection strategies, adjustments for linguistic variability across dialects, or expansions into multilingual models to broaden the applicability of the GrammarGPT framework. Moreover, integrating more sophisticated heuristic-based methods for data synthesis or leveraging adversarial training may advance this field further.

In conclusion, GrammarGPT stands as a testament to the convergence of computational innovation and linguistic complexity, reinforcing the potential of open-source principles in modern computational linguistics.

Markdown Report Issue