Emergent Mind

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

(2310.08491)
Published Oct 12, 2023 in cs.CL and cs.LG

Abstract

Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus's capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model. We open-source our code, dataset, and model at https://kaistai.github.io/prometheus/.

Comparing feedback from Prometheus, GPT-4, and Code-Llama on code-related tasks shows specialized models excel.

Overview

  • The paper introduces Prometheus, a 13B parameter open-source language model designed for fine-grained evaluation, rivaling proprietary models like GPT-4.

  • Prometheus leverages the Feedback Collection dataset, consisting of 1K fine-grained score rubrics, 20K instructions, and 100K responses, to enable nuanced text evaluation.

  • Experimental validation shows Prometheus matches or exceeds human evaluators and GPT-4 in feedback quality and correlation, and outperforms other models in terms of being a universal reward model.

  • The study highlights Prometheus's potential in making high-quality evaluation tools more accessible and suggests its utility in various AI training methodologies.

Introducing Prometheus: Enabling Fine-grained Evaluation with Open-source Language Models

Overview

Recent advances in NLP have positioned LLMs as potent tools for evaluating machine-generated text. However, reliance on proprietary models like GPT-4 presents challenges including lack of transparency, version control issues, and financial barriers. Addressing these concerns, this paper introduces Prometheus, a 13B parameter open-source language model designed to rival the evaluation capabilities of GPT-4. Prometheus, leveraging the newly compiled Feedback Collection dataset, demonstrates remarkable proficiency in evaluating long-form responses across diverse custom score rubrics, showcasing its potential as a versatile and accessible evaluator.

The Feedback Collection Dataset

The Feedback Collection dataset stands out for its unique architecture, designed specifically to enhance fine-grained evaluation capabilities of LLMs. Consisting of 1K fine-grained score rubrics, 20K instructions, and 100K responses with language feedback generated using GPT-4, it introduces a sophisticated framework for instruction-based evaluation. Each dataset instance comprises multiple components, including sophisticated rubrics and reference answers, to enable nuanced understanding and evaluation of text responses. This design not only facilitates detailed feedback generation but also quantitatively rates responses, paving the way for comprehensive and tailored text evaluation.

Experimental Validation

Prometheus’s evaluation prowess was rigorously tested across various benchmarks and compared with human evaluators as well as GPT-4 and ChatGPT models. Key findings are as follows:

  • Correlation with Human Evaluators: Prometheus achieves a Pearson correlation coefficient of 0.897 on 45 customized score rubrics, demonstrating parity with GPT-4 (0.882) and significantly surpassing ChatGPT (0.392).
  • Feedback Quality: In pairwise comparisons, feedback generated by Prometheus is preferred over GPT-4 58.67% of the time, highlighting its superior ability to generate meaningful and critical feedback.
  • Universal Reward Model Potential: Prometheus also excels in two human preference benchmarks, outperforming open-source reward models trained explicitly on human preferences.

These results underscore Prometheus’s adeptness not only in emulating human evaluation standards but also in its utility as a universal reward model, offering insights into its potential applications in model training and development.

Implications and Future Directions

Prometheus challenges the prevailing dependence on proprietary LLMs for text evaluation by offering an open-source alternative that does not sacrifice performance. The inclusion of reference materials such as score rubrics and reference answers proves vital in its success, suggesting avenues for further enhancing LLM evaluators. Additionally, the performance on ranking grading schemes suggests Prometheus's adaptability as a reward model for various AI training methodologies, marking a significant stride toward developing versatile, transparent, and accessible evaluation tools in NLP.

The open-sourcing of Prometheus, along with the Feedback Collection dataset, not only democratizes access to high-quality evaluation tools but also encourages community collaboration in refining and expanding upon this foundational work. Future research could explore domain-specific evaluator models, further diversify evaluation criteria, and integrate Prometheus into broader AI training and development workflows, paving the way for innovative applications and methodologies in artificial intelligence research.

Acknowledgeably, this research initiates a crucial discussion on transparency, autonomy, and accessibility in AI evaluation, setting a precedent for future endeavors in the field. Prometheus not only signifies a step forward in LLM evolution but also embodies the collaborative spirit essential for sustainable progress in AI research and development.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. A general language assistant as a laboratory for alignment
  2. A Survey on Evaluation of Large Language Models
  3. INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models
  4. Can large language models be an alternative to human evaluations?
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. https://lmsys.org/blog/2023-03-30-vicuna/.

  6. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
  7. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks
  8. Large Language Models Are Reasoning Teachers
  9. Generative Models as a Complex Systems Science: How can we make sense of large language model behavior?
  10. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
  11. The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning
  12. Aligning Large Language Models through Synthetic Feedback
  13. Hurdles to Progress in Long-form Question Answering
  14. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval

  15. Holistic Evaluation of Language Models
  16. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004a. Association for Computational Linguistics. https://aclanthology.org/W04-1013.

  17. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004b.
  18. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  19. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
  20. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318
  21. Instruction Tuning with GPT-4
  22. On the challenges of using black-box apis for toxicity evaluation in research
  23. Bleurt: Learning robust metrics for text generation. In Annual Meeting of the Association for Computational Linguistics, 2020. https://api.semanticscholar.org/CorpusID:215548699.

  24. Llama 2: Open foundation and fine-tuned chat models
  25. Shepherd: A Critic for Language Model Generation
  26. PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
  27. Self-Instruct: Aligning Language Models with Self-Generated Instructions
  28. Selfee: Iterative self-revising llm empowered by self-feedback generation. Blog post, May 2023a. https://kaistai.github.io/SelFee/.

  29. FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
  30. BARTScore: Evaluating generated text as text generation. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. https://openreview.net/forum?id=5Ya8PbvpZ9.

  31. BERTScore: Evaluating Text Generation with BERT
  32. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  33. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Show All 33