Does AI help humans make better decisions? A statistical evaluation framework for experimental and observational studies (2403.12108v3)
Abstract: The use of AI, or more generally data-driven algorithms, has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions compared to a human-alone or AI-alone system. We introduce a new methodological framework to empirically answer this question with a minimal set of assumptions. We measure a decision maker's ability to make correct decisions using standard classification metrics based on the baseline potential outcome. We consider a single-blinded and unconfounded treatment assignment, where the provision of AI-generated recommendations is assumed to be randomized across cases with humans making final decisions. Under this study design, we show how to compare the performance of three alternative decision-making systems--human-alone, human-with-AI, and AI-alone. Importantly, the AI-alone system includes any individualized treatment assignment, including those that are not used in the original study. We also show when AI recommendations should be provided to a human-decision maker, and when one should follow such recommendations. We apply the proposed methodology to our own randomized controlled trial evaluating a pretrial risk assessment instrument. We find that the risk assessment recommendations do not improve the classification accuracy of a judge's decision to impose cash bail. Furthermore, we find that replacing a human judge with algorithms--the risk assessment score and a LLM in particular--leads to a worse classification performance.
- Albright, A. (2019). If you give a judge a risk score: evidence from kentucky bail decisions. Law, Economics, and Business Fellows’ Discussion Paper Series 85.
- Algorithmic recommendations and human discretion. Technical report, National Bureau of Economic Research.
- Measuring Racial Discrimination in Algorithms. AEA Papers and Proceedings 111, 49–54.
- Measuring Racial Discrimination in Bail Decisions. American Economic Review 112(9), 2992–3038.
- Fairness in machine learning. Nips tutorial 1, 2017.
- Policy learning with asymmetric counterfactual utilities. Journal of the American Statistical Association, Forthcoming.
- Berk, R. (2017). An impact assessment of machine learning risk forecasts on parole board decisions and recidivism. Journal of Experimental Criminology 13, 193–216.
- Forecasting domestic violence: A machine learning approach to help inform arraignment decisions. Journal of empirical legal studies 13(1), 94–115.
- ’it’s reducing a human being to a percentage’ perceptions of justice in algorithmic decisions. In Proceedings of the 2018 Chi conference on human factors in computing systems, pp. 1–14.
- ” hello ai”: uncovering the onboarding needs of medical practitioners for human-ai collaborative decision-making. Proceedings of the ACM on Human-computer Interaction 3(CSCW), 1–24.
- Learning under selective labels with data from heterogeneous decision-makers: An instrumental variable approach. CoRR.
- Heterogeneity in algorithm-assisted decision-making: A case study in child abuse hotline screening. Proceedings of the ACM on Human-Computer Interaction 6(CSCW2), 1–33.
- A snapshot of the frontiers of fairness in machine learning. Communications of the ACM 63(5), 82–89.
- The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv preprint arXiv:1808.00023.
- Counterfactual risk assessments, evaluation, and fairness. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 582–593.
- Characterizing fairness over the set of good models under selective labels. In International Conference on Machine Learning, pp. 2144–2155. PMLR.
- The effects of pre-trial detention on conviction, future crime, and employment: Evidence from randomly assigned judges. American Economic Review 108(2), 201–240.
- Personalized risk assessments in the criminal justice system. American Economic Review 106(5), 119–123.
- Randomized control trial evaluation of the implementation of the psa-dmf system in dane county. Technical report, Access to Justice Lab, Harvard Law School.
- Ground (less) truth: A causal framework for proxy labels in human-algorithm decision-making. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 688–704.
- Discretion in hiring. The Quarterly Journal of Economics 133(2), 765–800.
- Principal fairness for human and algorithmic decision-making. Statistical Science 38(2), 317–328.
- Experimental evaluation of algorithm-assisted human decision-making: Application to pretrial public safety assessment. Journal of the Royal Statistical Society Series A: Statistics in Society 186(2), 167–189.
- Replication data for: Experimental evaluation of algorithm-assisted human decision-making: Application to pretrial public safety assessment. Harvard Dataverse, DOI: 10.7910/DVN/L0NHQU.
- Human decisions and machine predictions. The quarterly journal of economics 133(1), 237–293.
- Human-ai collaboration in healthcare: A review and research agenda.
- The selective labels problem: Evaluating algorithmic predictions in the presence of unobservables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 275–284.
- Manski, C. F. (2007). Identification for Prediction and Decision. Cambridge, MA: Harvard University Press.
- Practitioner compliance with risk/needs assessment tools: A theoretical and empirical assessment. Criminal Justice and Behavior 40(7), 716–736.
- Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application 8, 141–163.
- Neyman, J. (1923). On the application of probability theory to agricultural experiments. essay on principles. Ann. Agricultural Sciences, 1–51.
- Rambachan, A. (2021). Identifying prediction mistakes in observational data.
- Counterfactual risk assessments under unmeasured confounding. arXiv preprint arXiv:2212.09844.
- Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66(5), 688.
- Impact of risk assessment on judges’ fairness in sentencing relatively poor defendants. Law and human behavior 44(1), 51.
- Stevenson, M. T. and J. L. Doleac (2022). Algorithmic risk assessment in the hands of humans. Available at SSRN 3489440.
- Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. Journal of the Royal Statistical Society Series B: Statistical Methodology 80(3), 531–550.