Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Metrics that Maximise Power for Accelerated A/B-Tests (2402.03915v2)

Published 6 Feb 2024 in cs.LG, cs.IR, stat.AP, and stat.ML

Abstract: Online controlled experiments are a crucial tool to allow for confident decision-making in technology companies. A North Star metric is defined (such as long-term revenue or user retention), and system variants that statistically significantly improve on this metric in an A/B-test can be considered superior. North Star metrics are typically delayed and insensitive. As a result, the cost of experimentation is high: experiments need to run for a long time, and even then, type-II errors (i.e. false negatives) are prevalent. We propose to tackle this by learning metrics from short-term signals that directly maximise the statistical power they harness with respect to the North Star. We show that existing approaches are prone to overfitting, in that higher average metric sensitivity does not imply improved type-II errors, and propose to instead minimise the $p$-values a metric would have produced on a log of past experiments. We collect such datasets from two social media applications with over 160 million Monthly Active Users each, totalling over 153 A/B-pairs. Empirical results show that we are able to increase statistical power by up to 78% when using our learnt metrics stand-alone, and by up to 210% when used in tandem with the North Star. Alternatively, we can obtain constant statistical power at a sample size that is down to 12% of what the North Star requires, significantly reducing the cost of experimentation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely. Working Paper 26463. National Bureau of Economic Research. https://doi.org/10.3386/w26463
  2. Variance Reduction in Ratio Metrics for Efficient Online Experiments. In Proc. of the 46th European Conference on Information Retrieval (ECIR ’24). Springer.
  3. Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In Proc. of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, 55–63. https://doi.org/10.1145/3159652.3159699
  4. Large-Scale Validation and Analysis of Interleaved Search Evaluation. ACM Trans. Inf. Syst. 30, 1, Article 6 (mar 2012), 41 pages. https://doi.org/10.1145/2094072.2094078
  5. Ed H. Chi. 2020. From Missing Data to Boltzmann Distributions and Time Dynamics: The Statistical Physics of Recommendation. In Proc. of the 13th International Conference on Web Search and Data Mining (WSDM ’20). ACM, 1–2. https://doi.org/10.1145/3336191.3372193
  6. Deep Neural Networks for YouTube Recommendations. In Proc. of the 10th ACM Conference on Recommender Systems (RecSys ’16). ACM, 191–198. https://doi.org/10.1145/2959100.2959190
  7. Alex Deng and Xiaolin Shi. 2016. Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, 77–86. https://doi.org/10.1145/2939672.2939700
  8. Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. In Proc. of the Sixth ACM International Conference on Web Search and Data Mining (WSDM ’13). ACM, 123–132. https://doi.org/10.1145/2433396.2433413
  9. Estimating long-term causal effects from short-term experiments and long-term observational data with unobserved confounding. In Proc. of the Second Conference on Causal Learning and Reasoning (Proc. of Machine Learning Research, Vol. 213), Mihaela van der Schaar, Cheng Zhang, and Dominik Janzing (Eds.). PMLR, 791–813. https://proceedings.mlr.press/v213/goffrier23a.html
  10. Machine Learning for Variance Reduction in Online Experiments. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 8637–8648.
  11. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics 49, 2 (2021), 1055 – 1080. https://doi.org/10.1214/20-AOS1991
  12. Olivier Jeunen. 2019. Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems. In Proc. of the 13th ACM Conference on Recommender Systems (RecSys ’19). ACM, 596–600. https://doi.org/10.1145/3298689.3347069
  13. Olivier Jeunen. 2023. A Common Misassumption in Online Experiments with Machine Learning Models. SIGIR Forum 57, 1, Article 13 (dec 2023), 9 pages. https://doi.org/10.1145/3636341.3636358
  14. On (Normalised) Discounted Cumulative Gain as an Offline Evaluation Metric for Top-n𝑛nitalic_n Recommendation. arXiv:2307.15053 [cs.IR]
  15. Henry F Kaiser. 1960. Directional statistical decisions. Psychological Review 67, 3 (1960), 160.
  16. Learning Sensitive Combinations of A/B Test Metrics. In Proc. of the Tenth ACM International Conference on Web Search and Data Mining (WSDM ’17). ACM, 651–659. https://doi.org/10.1145/3018661.3018708
  17. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. In Proc. of the 3rd International Conference on Learning Representations (ICLR ’14). arXiv:1412.6980 [cs.LG]
  18. A/B Testing Intuition Busters: Common Misunderstandings in Online Controlled Experiments. In Proc. of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22). ACM, 3168–3177. https://doi.org/10.1145/3534678.3539160
  19. Trustworthy online controlled experiments: A practical guide to A/B testing. Cambridge University Press.
  20. Olivier Ledoit and Michael Wolf. 2004. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88, 2 (2004), 365–411. https://doi.org/10.1016/S0047-259X(03)00096-4
  21. Olivier Ledoit and Michael Wolf. 2020. The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation. Journal of Financial Econometrics 20, 1 (06 2020), 187–218. https://doi.org/10.1093/jjfinec/nbaa007 arXiv:https://academic.oup.com/jfec/article-pdf/20/1/187/42274902/nbaa007.pdf
  22. On the Variance of the Adaptive Learning Rate and Beyond. In International Conference on Learning Representations (ICLR ’20). https://arxiv.org/abs/1908.03265
  23. Off-policy learning in two-stage recommender systems. In Proc. of The Web Conference 2020. 463–473.
  24. Frederick Mosteller. 1948. A k-Sample Slippage Test for an Extreme Population. The Annals of Mathematical Statistics 19, 1 (1948), 58–65. http://www.jstor.org/stable/2236056
  25. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf
  26. Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, 235–244. https://doi.org/10.1145/2939672.2939688
  27. On the Convergence of Adam and Beyond. In International Conference on Learning Representations (ICLR ’18). https://openreview.net/forum?id=ryQu7f-RZ
  28. Pareto optimal proxy metrics. arXiv:2307.01000 [stat.ME]
  29. Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology 66, 5 (1974), 688.
  30. Sven Schmit and Evan Miller. 2022. Sequential confidence intervals for relative lift with regression adjustments. (2022).
  31. Juliet Popper Shaffer. 1995. Multiple Hypothesis Testing. Annual Review of Psychology 46, 1 (1995), 561–584. https://doi.org/10.1146/annurev.ps.46.020195.003021 arXiv:https://doi.org/10.1146/annurev.ps.46.020195.003021
  32. Harald Steck. 2013. Evaluation of recommendations: rating-prediction and ranking. In Proc. of the 7th ACM Conference on Recommender Systems (RecSys ’13). ACM, 213–220. https://doi.org/10.1145/2507157.2507160
  33. Estimating Long-Term Effects from Experimental Data. In Proc. of the 16th ACM Conference on Recommender Systems (RecSys ’22). ACM, 516–518. https://doi.org/10.1145/3523227.3547398
  34. Choosing a Proxy Metric from Past Experiments. arXiv:2309.07893 [stat.ME]
  35. Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. In Proc. of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, 505–514. https://doi.org/10.1145/3331184.3331259
  36. Aleksei Ustimenko and Liudmila Prokhorenkova. 2020. StochasticRank: Global Optimization of Scale-Free Discrete Functions. In Proc. of the 37th International Conference on Machine Learning (ICML ’20’, Vol. 119). PMLR, 9669–9679. https://proceedings.mlr.press/v119/ustimenko20a.html
  37. Abraham Wald. 1945. Sequential Tests of Statistical Hypotheses. The Annals of Mathematical Statistics 16, 2 (1945), 117 – 186. https://doi.org/10.1214/aoms/1177731118
  38. Surrogate for Long-Term User Experience in Recommender Systems. In Proc. of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA) (KDD ’22). ACM, 4100–4109. https://doi.org/10.1145/3534678.3539073
  39. Bernard Lewis WELCH. 1947. The Generalization of ‘Student’s’ Problem when Several Different Population Variances are Involved. Biometrika 34, 1-2 (01 1947), 28–35. https://doi.org/10.1093/biomet/34.1-2.28
  40. Huizhi Xie and Juliette Aurisset. 2016. Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, 645–654. https://doi.org/10.1145/2939672.2939733
  41. Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation. In Proc. of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’10). ACM, 507–514. https://doi.org/10.1145/1835449.1835534
Citations (3)

Summary

  • The paper proposes a framework that learns A/B test metrics by minimizing p-values from past experiments to boost statistical power.
  • It empirically demonstrates up to a 78% improvement using learned metrics alone and a 210% boost when combined with North Star metrics.
  • The study introduces a spherical regularization technique that accelerates convergence by up to 40%, significantly reducing required sample sizes.

Introduction & Background

The practice of utilising online controlled experiments, or A/B tests, has become integral to the decision-making processes within technology companies. These experiments help identify superior system variants based on a predefined North Star metric, which typically focuses on long-term indicators such as revenue or user retention. However, due to their delayed and insensitive nature, these North Star metrics inadvertently increase the duration and cost of experiments, while also leading to prevalent false negatives in the outcomes.

To address this inefficiency, there has been a body of research dedicated to enhancing the sensitivity of these North Star metrics through various methods, including control variates, identifying proxy metrics, and learning combination metrics that optimize sensitivity. This paper builds upon these efforts by proposing a framework for learning A/B-testing metrics that maximize statistical power, extending beyond the field of web search applications.

Learning Enhanced Metrics

The authors propose a novel approach that diverges from previous methods which maximized average metric sensitivity, pointing out that this did not necessarily translate to reduced type-II error rates. Instead, the focus shifts toward minimizing the p-values produced by a metric on past experiment logs, providing a more equitable allocation of gains over multiple experiments. This new angle aims to produce more statistically significant results rather than a few extreme cases.

Empirical evidence from datasets comprising over 153 A/B-pairs from two social media applications substantiates the framework's capability to bolster statistical power by up to 78% when the learned metrics are used alone, and by as much as 210% when combined with the North Star metric. Moreover, the framework allows for maintaining constant statistical power with a sample size that is merely 12% of what is required by the North Star – a considerable decrease that translates to a reduction in the cost of experimentation.

Evaluating Metrics and Ensuring Reliability

Evaluation of newly learned metrics is done with meticulous attention to multiple hypothesis testing. Innovations here include adopting conservative corrections like the Bonferroni method to curb type-I errors or conducting synthetic A/A tests to ensure accurate confidence levels. The paper illustrates that when learned metrics are coupled with North Star metrics and other validated proxies, we not only limit false positives but also significantly amplify our statistical power and consequently reduce experimentation costs.

Accelerating Convergence in Optimization

The paper further contributes to methodological advancements by proposing a spherical regularization technique that aids in accelerating convergence for scale-free objectives – a property exhibited by the z-scores pivotal in this domain. This technique introduces a more gradient-optimized loss surface without affecting the optima and demonstrates up to 40% fewer iterations required for convergence.

Final Thoughts

In conclusion, this work presents a robust, generalizable framework that significantly enhances the efficacy and efficiency of online experimentation. The findings and methodologies outlined have practical implications, enabling platforms like ShareChat and Moj to make faster, more confident decisions. The successful application of learnt metrics catalyzes an era where informed decision-making can be accomplished with reduced resources, thereby streamlining the path to product and service optimizations.

X Twitter Logo Streamline Icon: https://streamlinehq.com