Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 43 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 464 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models (2405.18027v1)

Published 28 May 2024 in cs.CL

Abstract: While LLMs can serve as agents to simulate human behaviors (i.e., role-playing agents), we emphasize the importance of point-in-time role-playing. This situates characters at specific moments in the narrative progression for three main reasons: (i) enhancing users' narrative immersion, (ii) avoiding spoilers, and (iii) fostering engagement in fandom role-playing. To accurately represent characters at specific time points, agents must avoid character hallucination, where they display knowledge that contradicts their characters' identities and historical timelines. We introduce TimeChara, a new benchmark designed to evaluate point-in-time character hallucination in role-playing LLMs. Comprising 10,895 instances generated through an automated pipeline, this benchmark reveals significant hallucination issues in current state-of-the-art LLMs (e.g., GPT-4o). To counter this challenge, we propose Narrative-Experts, a method that decomposes the reasoning steps and utilizes narrative experts to reduce point-in-time character hallucinations effectively. Still, our findings with TimeChara highlight the ongoing challenges of point-in-time character hallucination, calling for further study.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. MPCHAT: Towards multimodal persona-grounded conversation. In ACL.
  2. AI Dungeon. https://aidungeon.com/.
  3. A general language assistant as a laboratory for alignment. arXiv:2112.00861.
  4. Speak, memory: An archaeology of books known to ChatGPT/GPT-4. In EMNLP.
  5. Character AI. https://beta.character.ai/.
  6. Roleinteract: Evaluating the social interaction of role-playing agents. arXiv:2403.13679.
  7. Large language models meet harry potter: A dataset for aligning dialogue agents with characters. In EMNLP Findings.
  8. A dataset for answering time-sensitive questions. In NeurIPS Datasets and Benchmarks.
  9. Can ai assistants know what they don’t know? arXiv:2401.13275.
  10. MTGER: Multi-view temporal graph enhanced temporal reasoning over time-involved document. In EMNLP Findings.
  11. Time-aware language models as temporal knowledge bases. TACL, 10:257–273.
  12. Generic temporal reasoning with differential analysis and explanation. In ACL.
  13. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In EMNLP Findings.
  14. GPTs. https://openai.com/blog/introducing-gpts.
  15. TempTabQA: Temporal question answering for semi-structured tables. In EMNLP.
  16. Kilem Gwet. 2008. Computing inter-rater reliability and its variance in the presence of high agreement. The British journal of mathematical and statistical psychology, 61:29–48.
  17. Meet your favorite character: Open-domain chatbot mimicking fictional characters with only a few utterances. In NAACL.
  18. The hallucinations leaderboard – an open effort to measure hallucinations in large language models. arXiv:2404.05904.
  19. TemporalWiki: A lifelong benchmark for training and evaluating ever-evolving language models. In EMNLP.
  20. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
  21. Mistral 7b. arXiv:2310.06825.
  22. Will I sound like me? improving persona consistency in dialogues through pragmatic self-consciousness. In EMNLP.
  23. FANToM: A benchmark for stress-testing machine theory of mind in interactions. In EMNLP.
  24. Prometheus: Inducing fine-grained evaluation capability in language models. In ICLR.
  25. Large language models are zero-shot reasoners. In NeurIPS.
  26. Better zero-shot reasoning with role-play prompting. arXiv:2308.07702.
  27. Retrieval-augmented generation for knowledge-intensive nlp tasks. In NeurIPS.
  28. Chatharuhi: Reviving anime character in reality via large language model. arXiv:2308.09597.
  29. Camel: Communicative agents for "mind" exploration of large language model society. In NeurIPS.
  30. Large language models are superpositions of all characters: Attaining arbitrary role-play via self-alignment. arXiv:2401.12474.
  31. Self-refine: Iterative refinement with self-feedback. In NeurIPS.
  32. Augmented language models: a survey. TMLR.
  33. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In EMNLP.
  34. Fine-grained hallucination detection and editing for language models. arXiv:2401.06855.
  35. OpenAI. 2023. Gpt-4 technical report. arXiv:2303.08774.
  36. OpenAI. 2024. Hello gpt-4o.
  37. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv:2308.03188.
  38. Generative agents: Interactive simulacra of human behavior. In UIST.
  39. Replika. https://replika.com/.
  40. Marie-Laure Ryan. 2003. Narrative as virtual reality: Immersion and interactivity in literature and electronic media. The Johns Hopkins University Press.
  41. Marie-Laure Ryan. 2008. Interactive narrative, plot types, and interpersonal relations. In ICIDS.
  42. TVShowGuess: Character comprehension in stories as speaker guessing. In NAACL.
  43. Neural theory-of-mind? on the limits of social intelligence in large LMs. In EMNLP.
  44. Role play with large language models. Nature, 623:493–498.
  45. Character-LLM: A trainable agent for role-playing. In EMNLP.
  46. Roleeval: A bilingual role evaluation benchmark for large language models. arXiv:2312.16132.
  47. Reflexion: language agents with verbal reinforcement learning. In NeurIPS.
  48. Retrieval augmentation reduces hallucination in conversation. In EMNLP Findings.
  49. SillyTavern. https://github.com/sillytavern/sillytavern.
  50. Talkie. https://www.talkie-ai.com/.
  51. Towards benchmarking and improving the temporal reasoning capability of large language models. In ACL.
  52. TimelineQA: A benchmark for question answering over timelines. In ACL Findings.
  53. Enhancing role-playing systems through aggressive queries: Evaluation and improvement. arXiv:2402.10618.
  54. Rolecraft-glm: Advancing personalized role-playing in large language models. arXiv:2401.09432.
  55. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.
  56. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. arXiv:2401.01275.
  57. Learning to speak and act in a fantasy text adventure game. In EMNLP.
  58. A survey on large language model based autonomous agents. Front. Comput. Sci., 18.
  59. Characteristic AI agents via large language models. In LREC-COLING.
  60. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. arXiv:2310.17976.
  61. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv:2310.00746.
  62. Chain of thought prompting elicits reasoning in large language models. In NeurIPS.
  63. Cross-replication reliability - an empirical approach to interpreting inter-rater reliability. In ACL.
  64. A comparison of cohen’s kappa and gwet’s ac1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Medical Research Methodology, 13.
  65. The rise and potential of large language model based agents: A survey. arXiv:2309.07864.
  66. How far are we from believable ai agents? a framework for evaluating the believability of human behavior simulation. arXiv:2312.17115.
  67. Alignment for honesty. arXiv:2312.07000.
  68. Few-shot character understanding in movies as an assessment to meta-learning of theory-of-mind. arXiv:2211.04684.
  69. AlignScore: Evaluating factual consistency with a unified alignment function. In ACL.
  70. Michael Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating extra-linguistic contexts into QA. In EMNLP.
  71. Personalizing dialogue agents: I have a dog, do you have pets too? In ACL.
  72. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv:2309.01219.
  73. Large language models fall short: Understanding complex relationships in detective narratives.
  74. Judging LLM-as-a-judge with MT-bench and chatbot arena. In NeurIPS Datasets and Benchmarks.
  75. Characterglm: Customizing chinese conversational ai characters with large language models. arXiv:2311.16832.
Citations (7)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube