Learning Metrics that Maximise Power for Accelerated A/B-Tests (2402.03915v2)
Abstract: Online controlled experiments are a crucial tool to allow for confident decision-making in technology companies. A North Star metric is defined (such as long-term revenue or user retention), and system variants that statistically significantly improve on this metric in an A/B-test can be considered superior. North Star metrics are typically delayed and insensitive. As a result, the cost of experimentation is high: experiments need to run for a long time, and even then, type-II errors (i.e. false negatives) are prevalent. We propose to tackle this by learning metrics from short-term signals that directly maximise the statistical power they harness with respect to the North Star. We show that existing approaches are prone to overfitting, in that higher average metric sensitivity does not imply improved type-II errors, and propose to instead minimise the $p$-values a metric would have produced on a log of past experiments. We collect such datasets from two social media applications with over 160 million Monthly Active Users each, totalling over 153 A/B-pairs. Empirical results show that we are able to increase statistical power by up to 78% when using our learnt metrics stand-alone, and by up to 210% when used in tandem with the North Star. Alternatively, we can obtain constant statistical power at a sample size that is down to 12% of what the North Star requires, significantly reducing the cost of experimentation.
- The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely. Working Paper 26463. National Bureau of Economic Research. https://doi.org/10.3386/w26463
- Variance Reduction in Ratio Metrics for Efficient Online Experiments. In Proc. of the 46th European Conference on Information Retrieval (ECIR ’24). Springer.
- Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In Proc. of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, 55–63. https://doi.org/10.1145/3159652.3159699
- Large-Scale Validation and Analysis of Interleaved Search Evaluation. ACM Trans. Inf. Syst. 30, 1, Article 6 (mar 2012), 41 pages. https://doi.org/10.1145/2094072.2094078
- Ed H. Chi. 2020. From Missing Data to Boltzmann Distributions and Time Dynamics: The Statistical Physics of Recommendation. In Proc. of the 13th International Conference on Web Search and Data Mining (WSDM ’20). ACM, 1–2. https://doi.org/10.1145/3336191.3372193
- Deep Neural Networks for YouTube Recommendations. In Proc. of the 10th ACM Conference on Recommender Systems (RecSys ’16). ACM, 191–198. https://doi.org/10.1145/2959100.2959190
- Alex Deng and Xiaolin Shi. 2016. Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, 77–86. https://doi.org/10.1145/2939672.2939700
- Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. In Proc. of the Sixth ACM International Conference on Web Search and Data Mining (WSDM ’13). ACM, 123–132. https://doi.org/10.1145/2433396.2433413
- Estimating long-term causal effects from short-term experiments and long-term observational data with unobserved confounding. In Proc. of the Second Conference on Causal Learning and Reasoning (Proc. of Machine Learning Research, Vol. 213), Mihaela van der Schaar, Cheng Zhang, and Dominik Janzing (Eds.). PMLR, 791–813. https://proceedings.mlr.press/v213/goffrier23a.html
- Machine Learning for Variance Reduction in Online Experiments. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 8637–8648.
- Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics 49, 2 (2021), 1055 – 1080. https://doi.org/10.1214/20-AOS1991
- Olivier Jeunen. 2019. Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems. In Proc. of the 13th ACM Conference on Recommender Systems (RecSys ’19). ACM, 596–600. https://doi.org/10.1145/3298689.3347069
- Olivier Jeunen. 2023. A Common Misassumption in Online Experiments with Machine Learning Models. SIGIR Forum 57, 1, Article 13 (dec 2023), 9 pages. https://doi.org/10.1145/3636341.3636358
- On (Normalised) Discounted Cumulative Gain as an Offline Evaluation Metric for Top-n𝑛nitalic_n Recommendation. arXiv:2307.15053 [cs.IR]
- Henry F Kaiser. 1960. Directional statistical decisions. Psychological Review 67, 3 (1960), 160.
- Learning Sensitive Combinations of A/B Test Metrics. In Proc. of the Tenth ACM International Conference on Web Search and Data Mining (WSDM ’17). ACM, 651–659. https://doi.org/10.1145/3018661.3018708
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. In Proc. of the 3rd International Conference on Learning Representations (ICLR ’14). arXiv:1412.6980 [cs.LG]
- A/B Testing Intuition Busters: Common Misunderstandings in Online Controlled Experiments. In Proc. of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22). ACM, 3168–3177. https://doi.org/10.1145/3534678.3539160
- Trustworthy online controlled experiments: A practical guide to A/B testing. Cambridge University Press.
- Olivier Ledoit and Michael Wolf. 2004. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88, 2 (2004), 365–411. https://doi.org/10.1016/S0047-259X(03)00096-4
- Olivier Ledoit and Michael Wolf. 2020. The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation. Journal of Financial Econometrics 20, 1 (06 2020), 187–218. https://doi.org/10.1093/jjfinec/nbaa007 arXiv:https://academic.oup.com/jfec/article-pdf/20/1/187/42274902/nbaa007.pdf
- On the Variance of the Adaptive Learning Rate and Beyond. In International Conference on Learning Representations (ICLR ’20). https://arxiv.org/abs/1908.03265
- Off-policy learning in two-stage recommender systems. In Proc. of The Web Conference 2020. 463–473.
- Frederick Mosteller. 1948. A k-Sample Slippage Test for an Extreme Population. The Annals of Mathematical Statistics 19, 1 (1948), 58–65. http://www.jstor.org/stable/2236056
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf
- Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, 235–244. https://doi.org/10.1145/2939672.2939688
- On the Convergence of Adam and Beyond. In International Conference on Learning Representations (ICLR ’18). https://openreview.net/forum?id=ryQu7f-RZ
- Pareto optimal proxy metrics. arXiv:2307.01000 [stat.ME]
- Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology 66, 5 (1974), 688.
- Sven Schmit and Evan Miller. 2022. Sequential confidence intervals for relative lift with regression adjustments. (2022).
- Juliet Popper Shaffer. 1995. Multiple Hypothesis Testing. Annual Review of Psychology 46, 1 (1995), 561–584. https://doi.org/10.1146/annurev.ps.46.020195.003021 arXiv:https://doi.org/10.1146/annurev.ps.46.020195.003021
- Harald Steck. 2013. Evaluation of recommendations: rating-prediction and ranking. In Proc. of the 7th ACM Conference on Recommender Systems (RecSys ’13). ACM, 213–220. https://doi.org/10.1145/2507157.2507160
- Estimating Long-Term Effects from Experimental Data. In Proc. of the 16th ACM Conference on Recommender Systems (RecSys ’22). ACM, 516–518. https://doi.org/10.1145/3523227.3547398
- Choosing a Proxy Metric from Past Experiments. arXiv:2309.07893 [stat.ME]
- Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. In Proc. of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, 505–514. https://doi.org/10.1145/3331184.3331259
- Aleksei Ustimenko and Liudmila Prokhorenkova. 2020. StochasticRank: Global Optimization of Scale-Free Discrete Functions. In Proc. of the 37th International Conference on Machine Learning (ICML ’20’, Vol. 119). PMLR, 9669–9679. https://proceedings.mlr.press/v119/ustimenko20a.html
- Abraham Wald. 1945. Sequential Tests of Statistical Hypotheses. The Annals of Mathematical Statistics 16, 2 (1945), 117 – 186. https://doi.org/10.1214/aoms/1177731118
- Surrogate for Long-Term User Experience in Recommender Systems. In Proc. of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA) (KDD ’22). ACM, 4100–4109. https://doi.org/10.1145/3534678.3539073
- Bernard Lewis WELCH. 1947. The Generalization of ‘Student’s’ Problem when Several Different Population Variances are Involved. Biometrika 34, 1-2 (01 1947), 28–35. https://doi.org/10.1093/biomet/34.1-2.28
- Huizhi Xie and Juliette Aurisset. 2016. Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, 645–654. https://doi.org/10.1145/2939672.2939733
- Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation. In Proc. of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’10). ACM, 507–514. https://doi.org/10.1145/1835449.1835534