Emergent Mind

Abstract

LLMs have made significant strides in various NLP tasks. Recent research shows that the moderately-sized LLMs often outperform their larger counterparts after task-specific fine-tuning. In this work, we delve into the process of adapting LLMs to specialize in document-level machine translation (DocMT) for a specific language pair. Firstly, we explore how prompt strategies affect downstream translation performance. Then, we conduct extensive experiments with two fine-tuning methods, three LLM backbones, and 18 translation tasks across nine language pairs. Our findings indicate that in some cases, these specialized models even surpass GPT-4 in translation performance, while they still significantly suffer from the off-target translation issue in others, even if they are exclusively fine-tuned on bilingual parallel documents. Furthermore, we provide an in-depth analysis of these LLMs tailored for DocMT, exploring aspects such as translation errors, discourse phenomena, training strategy, the scaling law of parallel documents, additional evaluation on recent test sets, and zero-shot crosslingual transfer. Our findings not only shed light on the strengths and limitations of LLM-based DocMT models but also provide a foundation for future research.

Overview

  • LLMs have shown potential in NLP and specifically in DocMT, offering impressive results with context and coherence maintenance.

  • The study explores fine-tuning of medium-sized LLMs using PEFT and FFT methods and evaluates them on translation quality metrics.

  • Fine-tuned LLMs can outperform even larger models like GPT-4 in certain DocMT tasks, but success varies and off-target translations remain an issue.

  • LLMs fine-tuned with FFT method showed high efficiency, requiring only 1% of the full dataset to match performance of whole set, unlike PEFT.

  • Fine-tuned LLMs excel in generalization and zero-shot cross-lingual transfer, setting a foundation for improved DocMT for low-resource languages.

Introduction to LLMs in Document-Level Translation

The potential of LLMs in the field of NLP has been demonstrated consistently across a variety of applications, carving out an impressive track record in tasks such as text generation, summarization, and question-answering. In the specific arena of document-level machine translation (DocMT), which seeks to maintain the context and coherence across sentences in a document during translation, these models have presented remarkable but sometimes inconsistent results. This summary explore extensive research conducted to adapt LLMs for DocMT across multiple language pairs, focusing on the comparative performance of differently sized models with variations in fine-tuning techniques.

Exploring Fine-Tuning Strategies for Translation

Moderately-sized LLMs, those containing around 7 billion parameters, were meticulously fine-tuned using two approaches: Parameter-Efficient Fine-Tuning (PEFT) and Fully Fine-Tuning (FFT). These methods were assessed through an array of metrics designed to accurately gauge translation quality. Despite their exceptional performance on some tasks, LLMs still assorted challenges such as the production of "off-target" translations, wherein the output would be in an incorrect language. Moreover, the study digs into the vital role of prompting strategies during the fine-tuning phase, revealing that certain prompt structures can significantly enhance LLM capabilities in translation tasks.

Key Findings in Translation Performance

The investigation bore fruit in several key findings when it compared the translation proficiency of LLMs against other state-of-the-art models. The study found that fine-tuned LLMs can surpass the translation abilities of even GPT-4, one of the largest available models, in certain tasks. However, success is selective, and in other scenarios, these same models failed completely due to off-target translation issues. Remarkably, the smaller, fine-tuned LLMs displayed fewer errors when performance metrics were aligned with larger models. Additionally, the fine-tuning methods have shown different efficiency levels; for instance, the FFT method required only about 1% of the full dataset to match the performance achieved with the whole set, while PEFT needed 10%.

Advancements and Implications for DocMT

The research implications extend to how LLMs are compared to traditional document-level machine translations. When evaluated on recently created test sets, the fine-tuned LLMs manifested better generalization on out-of-domain text compared to conventional DocMT models. Furthermore, the study found that base LLMs supplemented with task-specific supervised fine-tuning exhibit superior zero-shot cross-lingual transfer capabilities over instruction-tuned LLMs.

The compelling evidence suggests that fine-tuning LLMs on parallel documents can unlock sophisticated translation abilities, thereby improving DocMT models distinctively. These models become particularly advantageous for tasks involving low-resource languages, potentially redefining translation approaches for diverse language pairs. The study sets a solid foundation for ongoing research and development in the realm of machine translation, signposting the journey towards more refined, contextually aware, and accurate translation systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.