Adaptive Experimental Design for Policy Learning (2401.03756v4)
Abstract: This study investigates the contextual best arm identification (BAI) problem, aiming to design an adaptive experiment to identify the best treatment arm conditioned on contextual information (covariates). We consider a decision-maker who assigns treatment arms to experimental units during an experiment and recommends the estimated best treatment arm based on the contexts at the end of the experiment. The decision-maker uses a policy for recommendations, which is a function that provides the estimated best treatment arm given the contexts. In our evaluation, we focus on the worst-case expected regret, a relative measure between the expected outcomes of an optimal policy and our proposed policy. We derive a lower bound for the expected simple regret and then propose a strategy called Adaptive Sampling-Policy Learning (PLAS). We prove that this strategy is minimax rate-optimal in the sense that its leading factor in the regret upper bound matches the lower bound as the number of experimental units increases.
- Adusumilli, K. (2021): “Risk and optimal policies in bandit experiments,” arXiv:2112.06363.
- (2022): “Minimax policies for best arm identification with two arms,” arXiv:2204.05527.
- Ariu, K., M. Kato, J. Komiyama, K. McAlinn, and C. Qin (2021): “Policy Choice and Best Arm Identification: Asymptotic Analysis of Exploration Sampling,” arXiv:2109.08229.
- Armstrong, T. B. (2022): “Asymptotic Efficiency Bounds for a Class of Experimental Designs,” arXiv:2205.02726.
- Athey, S., and S. Wager (2021): “Policy Learning With Observational Data,” Econometrica, 89(1), 133–161.
- Atsidakou, A., S. Katariya, S. Sanghavi, and B. Kveton (2023): “Bayesian Fixed-Budget Best-Arm Identification,” arXiv:2211.08572.
- Audibert, J.-Y., S. Bubeck, and R. Munos (2010): “Best Arm Identification in Multi-Armed Bandits,” in Conference on Learning Theory, pp. 41–53.
- Bang, H., and J. M. Robins (2005): “Doubly Robust Estimation in Missing Data and Causal Inference Models,” Biometrics, 61(4), 962–973.
- Bibaut, A., M. Dimakopoulou, N. Kallus, A. Chambaz, and M. van der Laan (2021): “Post-Contextual-Bandit Inference,” in Advances in Neural Information Processing Systems (NeurIPS).
- Bubeck, S., R. Munos, and G. Stoltz (2009): “Pure Exploration in Multi-armed Bandits Problems,” in Algorithmic Learning Theory (ALT).
- (2011): “Pure exploration in finitely-armed and continuous-armed bandits,” Theoretical Computer Science.
- Carpentier, A., and A. Locatelli (2016): “Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem,” in Conference on Learning Theory (COLT).
- Chen, C.-H., J. Lin, E. Yücesan, and S. E. Chick (2000): “Simulation Budget Allocation for Further Enhancing TheEfficiency of Ordinal Optimization,” Discrete Event Dynamic Systems, 10(3), 251–270.
- Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018): “Double/debiased machine learning for treatment and structural parameters,” The Econometrics Journal.
- Dehejia, R. H. (2005): “Program evaluation as a decision problem,” Journal of Econometrics, 125(1), 141–173.
- Deshmukh, A. A., S. Sharma, J. W. Cutler, M. Moldwin, and C. Scott (2018): “Simple Regret Minimization for Contextual Bandits,” arXiv:1810.07371.
- Dominitz, J., and C. F. Manski (2022): “Minimax-regret sample design in anticipation of missing data, with application to panel data,” Journal of Econometrics, 226(1), 104–114, Annals Issue in Honor of Gary Chamberlain.
- Dominitz, J., and F. C. Manski (2017): “More Data or Better Data? A Statistical Decision Problem,” The Review of Economic Studies, 84(4), 1583–1605.
- Dongruo Zhou, Lihong Li, Q. G. (2020): “Neural Contextual Bandits with UCB-based Exploration,” in International Conference on Machine Learning (ICML).
- Even-Dar, E., S. Mannor, Y. Mansour, and S. Mahadevan (2006): “Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems.,” Journal of machine learning research.
- Foster, D., D. J. Foster, N. Golowich, and A. Rakhlin (2023): “On the Complexity of Multi-Agent Decision Making: From Learning in Games to Partial Monitoring,” in Conference on Learning Theory (COLT).
- Garivier, A., and E. Kaufmann (2016): “Optimal Best Arm Identification with Fixed Confidence,” in Conference on Learning Theory.
- Glynn, P., and S. Juneja (2004): “A large deviations perspective on ordinal optimization,” in Proceedings of the 2004 Winter Simulation Conference, vol. 1. IEEE.
- Guan, M., and H. Jiang (2018): “Nonparametric Stochastic Contextual Bandits,” AAAI Conference on Artificial Intelligence.
- Gupta, S., Z. C. Lipton, and D. Childers (2021): “Efficient Online Estimation of Causal Effects by Deciding What to Observe,” in Advances in Neural Information Processing Systems (NeurIPS).
- Hadad, V., D. A. Hirshberg, R. Zhan, S. Wager, and S. Athey (2021): “Confidence intervals for policy evaluation in adaptive experiments,” Proceedings of the National Academy of Sciences, 118(15).
- Hahn, J., K. Hirano, and D. Karlan (2011): “Adaptive experimental design using the propensity score,” Journal of Business and Economic Statistics.
- Haussler, D. (1995): “Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik-Chervonenkis dimension,” Journal of Combinatorial Theory, Series A, 69(2), 217–232.
- Hirano, K., and J. R. Porter (2009): “Asymptotics for Statistical Treatment Rules,” Econometrica, 77(5), 1683–1701.
- Ito, S., T. Tsuchiya, and J. Honda (2022): “Adversarially Robust Multi-Armed Bandit Algorithm with Variance-Dependent Regret Bounds,” in Conference on Learning Theory.
- Jin, Y. (2023): “Upper bounds on the Natarajan dimensions of some function classes,” arXiv:2209.07015.
- Karlan, D., and D. H. Wood (2014): “The Effect of Effectiveness: Donor Response to Aid Effectiveness in a Direct Mail Fundraising Experiment,” Working Paper 20047, National Bureau of Economic Research.
- Kasy, M., and A. Sautmann (2021): “Adaptive Treatment Assignment in Experiments for Policy Choice,” Econometrica, 89(1), 113–132.
- Kato, M., and M. Imaizumi (2023): “Fixed-Budget Best Arm Identification in Two-Armed Gaussian Bandits with Unknown Variances under a Small Gap,” Unpublised.
- Kato, M., M. Imaizumi, T. Ishihara, and T. Kitagawa (2022): “Best Arm Identification with Contextual Information under a Small Gap,” arXiv:2209.07330.
- Kato, M., T. Ishihara, J. Honda, and Y. Narita (2020): “Adaptive Experimental Design for Efficient Treatment Effect Estimation: Randomized Allocation via Contextual Bandit Algorithm,” arXiv:2002.05308.
- Kato, M., K. McAlinn, and S. Yasui (2021): “The Adaptive Doubly Robust Estimator and a Paradox Concerning Logging Policy,” in Advances in Neural Information Processing Systems (NeurIPS).
- Kaufmann, E., O. Cappé, and A. Garivier (2016): “On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models,” Journal of Machine Learning Research, 17(1), 1–42.
- Kim, W., G.-S. Kim, and M. C. Paik (2021): “Doubly Robust Thompson Sampling with Linear Payoffs,” in Advances in Neural Information Processing Systems (NeurIPS).
- Kitagawa, T., and A. Tetenov (2018): “Who Should Be Treated? Empirical Welfare Maximization Methods for Treatment Choice,” Econometrica, 86(2), 591–616.
- Kock, A. B., D. Preinerstorfer, and B. Veliyev (2023): “Treatment recommendation with distributional targets,” Journal of Econometrics, 234(2), 624–646.
- Komiyama, J., K. Ariu, M. Kato, and C. Qin (2023): “Rate-Optimal Bayesian Simple Regret in Best Arm Identification,” Mathematics of Operations Research.
- Lai, T., and H. Robbins (1985): “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics.
- (1972): “Limits of experiments,” in Theory of Statistics, pp. 245–282. University of California Press.
- Le Cam, L. (1986): Asymptotic Methods in Statistical Decision Theory (Springer Series in Statistics). Springer.
- Manski, C. (2000): “Identification problems and decisions under ambiguity: Empirical analysis of treatment response and normative analysis of treatment choice,” Journal of Econometrics, 95(2), 415–442.
- Manski, C. F. (2002): “Treatment choice under ambiguity induced by inferential problems,” Journal of Statistical Planning and Inference, 105(1), 67–82.
- (2004): “Statistical Treatment Rules for Heterogeneous Populations,” Econometrica, 72(4), 1221–1246.
- Manski, C. F., and A. Tetenov (2016): “Sufficient trial size to inform clinical practice,” Proceedings of the National Academy of Sciences, 113(38), 10518–10523.
- Natarajan, B. K. (1989): “On learning sets and functions,” Machine Learning, 4(1), 67–97.
- Neyman, J. (1923): “Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes,” Statistical Science, 5, 463–472.
- (1934): “On the Two Different Aspects of the Representative Method: the Method of Stratified Sampling and the Method of Purposive Selection,” Journal of the Royal Statistical Society, 97, 123–150.
- Qian, W., and Y. Yang (2016): “Kernel Estimation and Model Combination in A Bandit Problem with Covariates,” Journal of Machine Learning Research.
- Qin, C., D. Klabjan, and D. Russo (2017): “Improving the Expected Improvement Algorithm,” in Advances in Neural Information Processing Systems (NeurIPS).
- Rakhlin, A., K. Sridharan, and A. Tewari (2015): “Sequential complexities and uniform martingale laws of large numbers,” Probability Theory and Related Fields, 161(1), 111–153.
- Robbins, H. (1952): “Some aspects of the sequential design of experiments,” Bulletin of the American Mathematical Society.
- Rubin, D. B. (1974): “Estimating causal effects of treatments in randomized and nonrandomized studies,” Journal of Educational Psychology.
- Russo, D. (2016): “Simple Bayesian Algorithms for Best Arm Identification,” arXiv:1602.08448.
- Schlag, K. H. (2007): “Eleven% Designing Randomized Experiments under Minimax Regret,” Unpublished manuscript, European University Institute, Florence.
- Shang, X., R. de Heide, P. Menard, E. Kaufmann, and M. Valko (2020): “Fixed-confidence guarantees for Bayesian best-arm identification,” in International Conference on Artificial Intelligence and Statistics, vol. 108, pp. 1823–1832.
- Stoye, J. (2009): “Minimax regret treatment choice with finite samples,” Journal of Econometrics, 151(1), 70–81.
- (2012): “Minimax regret treatment choice with covariates or with limited validity of experiments,” Journal of Econometrics, 166(1), 138–156.
- Swaminathan, A., and T. Joachims (2015): “Counterfactual Risk Minimization,” in Proceedings of the 24th International Conference on World Wide Web, p. 939–941. Association for Computing Machinery.
- Tabord-Meehan, M. (2022): “Stratification Trees for Adaptive Randomization in Randomized Controlled Trials,” The Review of Economic Studies.
- Tekin, C., and M. van der Schaar (2015): “RELEAF: An Algorithm for Learning and Exploiting Relevance,” IEEE Journal of Selected Topics in Signal Processing.
- Thompson, W. R. (1933): “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika.
- van der Laan, M. J. (2008): “The Construction and Analysis of Adaptive Group Sequential Designs,” https://biostats.bepress.com/ucbbiostat/paper232.
- van der Vaart, A. (1991): “An Asymptotic Representation Theorem,” International Statistical Review / Revue Internationale de Statistique, 59(1), 97–121.
- (1998): Asymptotic Statistics, Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
- Wager, S., and K. Xu (2021): “Diffusion Asymptotics for Sequential Experiments,” arXiv:2101.09855.
- Wald, A. (1949): “Statistical Decision Functions,” The Annals of Mathematical Statistics, 20(2), 165 – 205.
- Yang, J., and V. Tan (2022): “Minimax Optimal Fixed-Budget Best Arm Identification in Linear Bandits,” in Advances in Neural Information Processing Systems (NeurIPS).
- Yang, Y., and D. Zhu (2002): “Randomized Allocation with nonparametric estimation for a multi-armed bandit problem with covariates,” Annals of Statistics, 30(1), 100–121.
- Zhan, R., Z. Ren, S. Athey, and Z. Zhou (2022): “Policy Learning with Adaptively Collected Data,” arXiv:2105.02344.
- Zheng, W., and M. J. van der Laan (2011): “Cross-Validated Targeted Minimum-Loss-Based Estimation,” in Targeted Learning: Causal Inference for Observational and Experimental Data, Springer Series in Statistics. Springer-Verlag New York.
- Zhou, Z., S. Athey, and S. Wager (2023): “Offline Multi-Action Policy Learning: Generalization and Optimization,” Operations Research, 71(1), 148–183.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.