Sabiá: Portuguese Large Language Models (2304.07880v4)

Published 16 Apr 2023 in cs.CL and cs.AI

Abstract: As the capabilities of LLMs continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. More specifically, we further pretrain GPT-J and LLaMA models on Portuguese texts using 3% or less of their original pretraining budget. Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin. Our best model, Sabi\'a-65B, performs on par with GPT-3.5-turbo. By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture. Our results indicate that the majority of the benefits stem from the domain-specific knowledge acquired through monolingual pretraining.

Citations (11)

View on Semantic Scholar

Summary

The paper demonstrates that monolingual pretraining on Portuguese data significantly boosts performance in few-shot tasks.
It adapts English-centric models like GPT-J and LLaMA with only 3% of the original pretraining budget to achieve competitive results.
The study highlights the trade-off of specialization, with enhanced Portuguese benchmarks at the expense of reduced English task performance.

Essay on "Sabiá: Portuguese LLMs"

The paper "Sabiá: Portuguese LLMs," investigates the viability of monolingual pretraining to enhance the performance of LLMs that have previously been trained on broad, multilingual corpora. The work presented in this paper highlights the discrepancies in performance that arise from language-specific pretraining as compared to the prevalent approach of utilizing a single multilingual model for various languages.

Summary and Findings

Central to the paper’s narrative is the assertion that continued pretraining on a monolithic corpus in a target language like Portuguese leads to superior performance across language-specific tasks. The researchers extended the pretraining of existing English-centric models, namely GPT-J and LLaMA, using Portuguese datasets. Despite the significantly reduced pretraining budget—3% or less of the original allocation—the resultant models demonstrated marked improvements in Portuguese task performance.

Using a suite of 14 Portuguese datasets, referenced as the Poeta benchmark, the authors report that the specialized models exhibited enhanced performance in few-shot learning scenarios over their English-centric and multilingual equivalents by substantial margins. The Sabiá-65B model, using only a fraction of the traditional pretraining scale, performed comparably to OpenAI’s GPT-3.5-turbo on Portuguese tasks.

Numerical Implications

The evaluation using the Poeta benchmark revealed distinct advancements across various datasets, with improvements especially pronounced in datasets where cultural and domain-specific knowledge of Brazil was pertinent, such as the ENEM dataset. Results indicate that domain-specific pretraining injects rich, domain-relevant knowledge into a model, which often translates into task-specific benefits that a generalized multilingual training approach cannot achieve. This specialization, however, comes with a documented decrease in performance on English benchmarks, illustrating the intrinsic trade-off encountered with such targeted refinement.

Theoretical and Practical Implications

From a theoretical perspective, this paper aligns with the hypothesis that large-scale LLMs can indeed benefit from specialization. The notion of adapting foundational models through further focused pretraining raises critical questions about the eventual ordering of pretraining and task-specific adaptation phases. Practically speaking, this implies a trend toward model diversification, where multiple specialized models could supplant a single, multilingual entity within production pipelines.

Future Directions in AI

Looking ahead, the findings suggest an evolving landscape where AI practitioners might deploy a multitude of specialized models tailored explicitly to domain-specific needs, rather than rely solely on an all-encompassing multilingual model. This paper catalyzes further exploration into cost-effective methods to enhance multilingual capabilities through language-specific adaptation.

In conclusion, the "Sabiá: Portuguese LLMs" paper contributes invaluable insights into language-specific model specialization, emphasizing the tangible benefits and compromises inherent in moving away from a "one-size-fits-all" approach in LLM pretraining. The findings point to the potential of adopting a diverse ecosystem of LLMs honed for specific linguistic and cultural domains.

PDF Markdown

Related Papers

YouTube

Show All Videos