Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations (2404.07851v1)
Abstract: Machine Translation (MT) remains one of the last NLP tasks where LLMs have not yet replaced dedicated supervised systems. This work exploits the complementary strengths of LLMs and supervised MT by guiding LLMs to automatically post-edit MT with external feedback on its quality, derived from Multidimensional Quality Metric (MQM) annotations. Working with LLaMA-2 models, we consider prompting strategies varying the nature of feedback provided and then fine-tune the LLM to improve its ability to exploit the provided guidance. Through experiments on Chinese-English, English-German, and English-Russian MQM data, we demonstrate that prompting LLMs to post-edit MT improves TER, BLEU and COMET scores, although the benefits of fine-grained feedback are not clear. Fine-tuning helps integrate fine-grained feedback more effectively and further improves translation quality based on both automatic and human evaluation.
- MEGA: Multilingual evaluation of generative AI. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computational Linguistics.
- Tower: An open multilingual large language model for translation-related tasks.
- Palm 2 technical report.
- Rachel Bawden and François Yvon. 2023. Investigating the Translation Performance of a Large Multilingual Language Model: The Case of BLOOM.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Findings of the WMT 2018 Shared Task on Automatic Post-Editing. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 710–725, Belgium, Brussels. Association for Computational Linguistics.
- Iterative translation refinement with large language models.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Qlora: Efficient finetuning of quantized llms.
- The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Proceedings of the Eighth Conference on Machine Translation, pages 1066–1083, Singapore. Association for Computational Linguistics.
- Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
- Gptscore: Evaluate as you desire.
- Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33–41, Sofia, Bulgaria. Association for Computational Linguistics.
- Levenshtein transformer. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 1003, pages 11181–11191. Curran Associates Inc., Red Hook, NY, USA.
- xcomet: Transparent machine translation evaluation through fine-grained error detection.
- How good are gpt models at machine translation? a comprehensive evaluation.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.
- Marcin Junczys-Dowmunt and Roman Grundkiewicz. 2016. Log-linear Combinations of Monolingual and Bilingual Neural Machine Translation Models for Automatic Post-Editing. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 751–758, Berlin, Germany. Association for Computational Linguistics.
- DEMETR: Diagnosing evaluation metrics for translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9540–9561, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Kevin Knight and Ishwar Chander. 1994. Automated postediting of documents. In Proceedings of the Twelfth AAAI National Conference on Artificial Intelligence, AAAI’94, pages 779–784, Seattle, Washington. AAAI Press.
- Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet. In Proceedings of the Eighth Conference on Machine Translation, pages 1–42, Singapore. Association for Computational Linguistics.
- Findings of the 2022 conference on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Tom Kocmi and Christian Federmann. 2023a. GEMBA-MQM: Detecting translation quality error spans with GPT-4. In Proceedings of the Eighth Conference on Machine Translation, pages 768–775, Singapore. Association for Computational Linguistics.
- Tom Kocmi and Christian Federmann. 2023b. Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 193–203, Tampere, Finland. European Association for Machine Translation.
- Large Language Models are Zero-Shot Reasoners.
- Klaus Krippendorff. 2011. Computing krippendorff’s alpha-reliability.
- Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
- Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics. Revista tradumàtica: traducció i tecnologies de la informació i la comunicació, pages 455–463.
- Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies.
- Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.
- REFINER: Reasoning feedback on intermediate representations. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1100–1126, St. Julian’s, Malta. Association for Computational Linguistics.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
- CoEdIT: Text Editing by Task-Specific Instruction Tuning.
- Leveraging GPT-4 for automatic translation post-editing. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12009–12024, Singapore. Association for Computational Linguistics.
- COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
- A Recipe for Arbitrary Text Style Transfer with Large Language Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 837–848, Dublin, Ireland. Association for Computational Linguistics.
- BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
- Statistical Phrase-Based Post-Editing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 508–515, Rochester, New York. Association for Computational Linguistics.
- A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 223–231, Cambridge, Massachusetts, USA. Association for Machine Translation in the Americas.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models.
- Prompting PaLM for Translation: Assessing Strategies and Performance.
- Incorporating Terminology Constraints in Automatic Post-Editing. In Proceedings of the Fifth Conference on Machine Translation, pages 1193–1204, Online. Association for Computational Linguistics.
- Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
- Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations.
- Llmrefine: Pinpointing and refining large language models via fine-grained actionable feedback.
- INSTRUCTSCORE: Towards explainable text generation evaluation with automatic feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5967–5994, Singapore. Association for Computational Linguistics.
- Improving machine translation with large language models: A preliminary study with cooperative decoding.
- Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis.
- Dayeon Ki (10 papers)
- Marine Carpuat (56 papers)