Learning Metrics that Maximise Power for Accelerated A/B-Tests (2402.03915v2)

Published 6 Feb 2024 in cs.LG, cs.IR, stat.AP, and stat.ML

Abstract: Online controlled experiments are a crucial tool to allow for confident decision-making in technology companies. A North Star metric is defined (such as long-term revenue or user retention), and system variants that statistically significantly improve on this metric in an A/B-test can be considered superior. North Star metrics are typically delayed and insensitive. As a result, the cost of experimentation is high: experiments need to run for a long time, and even then, type-II errors (i.e. false negatives) are prevalent. We propose to tackle this by learning metrics from short-term signals that directly maximise the statistical power they harness with respect to the North Star. We show that existing approaches are prone to overfitting, in that higher average metric sensitivity does not imply improved type-II errors, and propose to instead minimise the $p$-values a metric would have produced on a log of past experiments. We collect such datasets from two social media applications with over 160 million Monthly Active Users each, totalling over 153 A/B-pairs. Empirical results show that we are able to increase statistical power by up to 78% when using our learnt metrics stand-alone, and by up to 210% when used in tandem with the North Star. Alternatively, we can obtain constant statistical power at a sample size that is down to 12% of what the North Star requires, significantly reducing the cost of experimentation.

References (41)

Citations (3)

View on Semantic Scholar

Summary

The paper proposes a framework that learns A/B test metrics by minimizing p-values from past experiments to boost statistical power.
It empirically demonstrates up to a 78% improvement using learned metrics alone and a 210% boost when combined with North Star metrics.
The study introduces a spherical regularization technique that accelerates convergence by up to 40%, significantly reducing required sample sizes.

Introduction & Background

The practice of utilising online controlled experiments, or A/B tests, has become integral to the decision-making processes within technology companies. These experiments help identify superior system variants based on a predefined North Star metric, which typically focuses on long-term indicators such as revenue or user retention. However, due to their delayed and insensitive nature, these North Star metrics inadvertently increase the duration and cost of experiments, while also leading to prevalent false negatives in the outcomes.

To address this inefficiency, there has been a body of research dedicated to enhancing the sensitivity of these North Star metrics through various methods, including control variates, identifying proxy metrics, and learning combination metrics that optimize sensitivity. This paper builds upon these efforts by proposing a framework for learning A/B-testing metrics that maximize statistical power, extending beyond the field of web search applications.

Learning Enhanced Metrics

The authors propose a novel approach that diverges from previous methods which maximized average metric sensitivity, pointing out that this did not necessarily translate to reduced type-II error rates. Instead, the focus shifts toward minimizing the p-values produced by a metric on past experiment logs, providing a more equitable allocation of gains over multiple experiments. This new angle aims to produce more statistically significant results rather than a few extreme cases.

Empirical evidence from datasets comprising over 153 A/B-pairs from two social media applications substantiates the framework's capability to bolster statistical power by up to 78% when the learned metrics are used alone, and by as much as 210% when combined with the North Star metric. Moreover, the framework allows for maintaining constant statistical power with a sample size that is merely 12% of what is required by the North Star – a considerable decrease that translates to a reduction in the cost of experimentation.

Evaluating Metrics and Ensuring Reliability

Evaluation of newly learned metrics is done with meticulous attention to multiple hypothesis testing. Innovations here include adopting conservative corrections like the Bonferroni method to curb type-I errors or conducting synthetic A/A tests to ensure accurate confidence levels. The paper illustrates that when learned metrics are coupled with North Star metrics and other validated proxies, we not only limit false positives but also significantly amplify our statistical power and consequently reduce experimentation costs.

Accelerating Convergence in Optimization

The paper further contributes to methodological advancements by proposing a spherical regularization technique that aids in accelerating convergence for scale-free objectives – a property exhibited by the z-scores pivotal in this domain. This technique introduces a more gradient-optimized loss surface without affecting the optima and demonstrates up to 40% fewer iterations required for convergence.

Final Thoughts

In conclusion, this work presents a robust, generalizable framework that significantly enhances the efficacy and efficiency of online experimentation. The findings and methodologies outlined have practical implications, enabling platforms like ShareChat and Moj to make faster, more confident decisions. The successful application of learnt metrics catalyzes an era where informed decision-making can be accomplished with reduced resources, thereby streamlining the path to product and service optimizations.

PDF Markdown

Tweets

https://twitter.com/LifeAtShareChat/status/1793574387133952120