Benchmarks for Automated Commonsense Reasoning: A Survey (2302.04752v2)

Published 9 Feb 2023 in cs.AI

Abstract: More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of AI systems. However, these benchmarks are often flawed and many aspects of common sense remain untested. Consequently, we do not currently have any reliable way of measuring to what extent existing AI systems have achieved these abilities. This paper surveys the development and uses of AI commonsense benchmarks. We discuss the nature of common sense; the role of common sense in AI; the goals served by constructing commonsense benchmarks; and desirable features of commonsense benchmarks. We analyze the common flaws in benchmarks, and we argue that it is worthwhile to invest the work needed ensure that benchmark examples are consistently high quality. We survey the various methods of constructing commonsense benchmarks. We enumerate 139 commonsense benchmarks that have been developed: 102 text-based, 18 image-based, 12 video based, and 7 simulated physical environments. We discuss the gaps in the existing benchmarks and aspects of commonsense reasoning that are not addressed in any existing benchmark. We conclude with a number of recommendations for future development of commonsense AI benchmarks.

Citations (50)

View on Semantic Scholar

Summary

The paper analyzes over 100 benchmarks, revealing that current commonsense tests often fail due to flawed design and narrow scope.
The paper details the evolution of AI commonsense reasoning across multiple domains including language, vision, and robotics.
The paper recommends creating diverse, high-quality benchmarks to address issues like cultural bias and reliance on encyclopedic data.

Benchmarks for Automated Commonsense Reasoning: A Survey

Introduction

The paper "Benchmarks for Automated Commonsense Reasoning: A Survey" focuses on the evaluation of AI systems in terms of commonsense knowledge and reasoning. It notes the development of over a hundred benchmarks but highlights common shortcomings and areas of commonsense that are still untested. The survey aims to describe the essence of commonsense reasoning, the roles benchmarks play, their common flaws, and recommendations for future research.

Evolution of Commonsense Reasoning in AI

Over the last decade, commonsense reasoning has gained prominence within the AI research community. Initially considered niche, it is now recognized as a significant challenge in AI. Efforts like DARPA's MCS project and initiatives by the Allen Institute for AI underscore its importance. The research focus has also expanded beyond linguistic reasoning to include areas like computer vision and robotics.

Nature of Commonsense and Benchmarks

The paper distinguishes between "resources" like ConceptNet, which provide support for commonsense tasks, and "benchmarks", which serve to evaluate performance. Benchmarks, constructed from tasks that are naturally solvable by humans, often fail to capture the full extent of commonsense reasoning due to flawed design or narrow scope.

Benchmarks Construction and Common Flaws

Constructing effective benchmarks involves balancing coverage and quality. When defining benchmarks, common pitfalls include reliance on encyclopedic knowledge, lack of complexity in problem-solving, and cultural biases. Artifacts, or unintended clues, can mislead AI systems into performing well without genuine understanding.

Challenges in Commonsense Reasoning Tasks

Commonsense reasoning tasks are challenging to define as they often overlap with other cognitive processes, such as language use and vision. Determining whether a task truly measures commonsense requires careful consideration of the task's context and the AI’s underlying reasoning capabilities. Moreover, tasks should not merely exploit linguistic artifacts but should genuinely challenge the system's understanding.

Purposes and Desiderata for Commonsense Benchmarks

Benchmarks aim to compare systems, measure progress, and highlight overlooked areas in research. Ideal benchmarks should be consistent, adaptable to new AI methodologies, and cover a wide range of domains and modalities. Furthermore, they should facilitate absolute measurement of AI capabilities and serve as resources doubling as training datasets.

Trade-offs in Benchmark Design

Benchmarks face trade-offs between size and precision. Smaller, well-curated benchmarks might offer greater insight into AI capabilities but may lack the sample diversity of larger collections. On the other hand, large datasets can suffer from quality control issues, leading to unreliable assessments.

Recommendations for Future Research

The survey concludes with recommendations for improving commonsense benchmarks. Emphasis is placed on high-quality, diverse datasets that cover a wide array of commonsense knowledge domains. Researchers are encouraged to verify benchmark examples more rigorously and to innovate in designing tasks that truly challenge AI models beyond rote memorization or pattern recognition.

Conclusion

The survey highlights the inadequacies of existing commonsense benchmarks and urges the research community to invest effort into creating high-quality, comprehensive, and transparent benchmarks. These benchmarks are crucial for advancing AI capabilities and ensuring that systems possess the depth of understanding required to perform tasks reliably in the real world.