LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements (2404.06283v1)
Abstract: The task of reading comprehension (RC), often implemented as context-based question answering (QA), provides a primary means to assess LLMs' natural language understanding (NLU) capabilities. Yet, when applied to LLMs with extensive built-in world knowledge, this method can be deceptive. If the context aligns with the LLMs' internal knowledge, it is hard to discern whether the models' answers stem from context comprehension or from LLMs' internal information. Conversely, using data that conflicts with the models' knowledge creates erroneous trends which distort the results. To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities. This task is entirely independent of the models' world knowledge, enabling us to evaluate LLMs' linguistic abilities without the interference of parametric knowledge. Testing ChatGPT, GPT-4, LLaMA 2 and Mixtral on such imaginary data, we uncover a class of linguistic phenomena posing a challenge to current LLMs, involving thinking in terms of alternative, hypothetical scenarios. While all the models handle simple affirmative and negative contexts with high accuracy, they are much more prone to error when dealing with modal and conditional contexts. Crucially, these phenomena also trigger the LLMs' vulnerability to knowledge-conflicts again. In particular, while some models prove virtually unaffected by knowledge conflicts in affirmative and negative contexts, when faced with more semantically involved modal and conditional environments, they often fail to separate the text from their internal knowledge.
- Self-rag: Learning to retrieve, generate, and critique through self-reflection.
- Fact Checking with Insufficient Evidence. Transactions of the Association for Computational Linguistics, 10:746–763.
- A survey on machine reading comprehension systems.
- Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA. Association for Computational Linguistics.
- Danqi Chen. 2018. Neural Reading Comprehension and Beyond. Ph.D. thesis, Stanford University.
- Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 2292–2307. Association for Computational Linguistics.
- Adaptation with self-evaluation to improve selective prediction in llms.
- Disco: Distilling counterfactuals with large language models.
- Gotcha! don’t trick me with unanswerable questions! self-aligning large language models for responding to unknown questions.
- Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration.
- A Survey on Automated Fact-Checking. Transactions of the Association for Computational Linguistics, 10:178–206.
- Teaching machines to read and comprehend.
- Deep read: A reading comprehension system. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 325–332, College Park, Maryland, USA. Association for Computational Linguistics.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Mixtral of experts.
- Tug-of-war between knowledge: Exploring and resolving knowledge conflicts in retrieval-augmented language models.
- Realtime qa: What’s the answer right now?
- Daniel Khashabi. 2019. Reasoning-driven question-answering for natural language understanding.
- Tamara Khomutova. 2014. Mood and modality in modern english. Procedia - Social and Behavioral Sciences, 154.
- Can ChatGPT understand causal language in science claims? In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, pages 379–389, Toronto, Canada. Association for Computational Linguistics.
- Large language models are zero-shot reasoners.
- The narrativeqa reading comprehension challenge.
- Saul A. Kripke. 1959. A completeness theorem in modal logic. The Journal of Symbolic Logic, 24(1):1–14.
- Wendy G Lehnert. 1978. The Process of Question Answering. Lawrence Erlbaum Associates, Hillsdale, N. J.
- Large language models with controllable working memory.
- Examining llms’ uncertainty expression towards questions outside parametric knowledge.
- Entity-based knowledge conflicts in question answering. CoRR, abs/2109.05052.
- Hybrid long document summarization using c2f-far and chatgpt: A practical study.
- When not to trust language models: Investigating effectiveness of parametric and non-parametric memories.
- Lisa Matthewson. 2016. Modality, Cambridge Handbooks in Language and Linguistics. Cambridge University Press.
- Eliza Mik. 2024. Caveat lector: Large language models in legal practice.
- Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering.
- The possible, the plausible, and the desirable: Event-based modality detection for language processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 953–965, Online. Association for Computational Linguistics.
- "merge conflicts!" exploring the impacts of external distractors to parametric knowledge graphs.
- Is chatgpt a general-purpose natural language processing task solver?
- Know what you don’t know: Unanswerable questions for squad.
- Desiderata for the context use of question answering systems. ArXiv, abs/2401.18001.
- Trusting your evidence: Hallucinate less with context-aware decoding.
- Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Towards expert-level medical question answering with large language models.
- Ursula Stephany. 1986. Modality. Cambridge University Press.
- Benchmarking machine reading comprehension: A psychological perspective. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1592–1612, Online. Association for Computational Linguistics.
- Llama 2: Open foundation and fine-tuned chat models.
- Neeraj Varshney and Chitta Baral. 2023. Post-abstention: Towards reliably re-attempting the abstained instances in QA. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 967–982, Toronto, Canada. Association for Computational Linguistics.
- Can NLP models correctly reason over contexts that break the common assumptions? CoRR, abs/2305.12096.
- Haoran Wang and Kai Shu. 2023. Explainable claim verification via knowledge-grounded reasoning with large language models.
- Adaptive chameleon or stubborn sloth: Unraveling the behavior of large language models in knowledge clashes. ArXiv, abs/2305.13300.
- How language model hallucinations can snowball.
- Merging generated and retrieved knowledge for open-domain qa. ArXiv, abs/2310.14393.
- Relying on the unreliable: The impact of language models’ reluctance to express uncertainty. ArXiv, abs/2401.06730.
- Context-faithful prompting for large language models.