DaLC: Domain Adaptation Learning Curve Prediction for Neural Machine Translation (2204.09259v1)

Published 20 Apr 2022 in cs.CL and cs.AI

Abstract: Domain Adaptation (DA) of Neural Machine Translation (NMT) model often relies on a pre-trained general NMT model which is adapted to the new domain on a sample of in-domain parallel data. Without parallel data, there is no way to estimate the potential benefit of DA, nor the amount of parallel samples it would require. It is however a desirable functionality that could help MT practitioners to make an informed decision before investing resources in dataset creation. We propose a Domain adaptation Learning Curve prediction (DaLC) model that predicts prospective DA performance based on in-domain monolingual samples in the source language. Our model relies on the NMT encoder representations combined with various instance and corpus-level features. We demonstrate that instance-level is better able to distinguish between different domains compared to corpus-level frameworks proposed in previous studies. Finally, we perform in-depth analyses of the results highlighting the limitations of our approach, and provide directions for future research.

Authors (5)

Cheonbok Park (20 papers)
Hantae Kim (3 papers)
Ioan Calapodescu (12 papers)
Hyunchang Cho (4 papers)
Vassilina Nikoulina (28 papers)

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

DaLC: Domain Adaptation Learning Curve Prediction for Neural Machine Translation (2204.09259v1)

Summary

Related Papers