Computing Minimum Sample Size for A/B Tests in Statsmodels: How and Why (2024)

A deep-dive into how and why Statsmodels uses numerical optimization instead of closed-form formulas

Published in

Towards Data Science

There is currently no good resource on how Statsmodels computes the minimum sample size.

It is critical to calculate the minimum sample size required before conducting an A/B test. A popular way to do it is by calling the tt_ind_solve_power function in Python’s Statsmodels package, but there are currently 2 gaps when it comes to understanding how it works:

1. There are many great articles (e.g. by Stan Nsky, TDS 2019) explaining what the parameters mean and provide examples of function calls. However, they do not explain how the function actually computes the sample size and why the procedure is correct.

2. There are also many great articles (e.g. by Mintao Wei, TDS 2023) that explain the statistical derivation based on a z-test for proportions such as conversion rates, which is also a popular choice for many online sample size calculators (e.g. Evan Miller’s Calculator). However, this is not the method used by Statsmodels and results can differ.

This is important for data scientists because Statsmodels is commonly used to compute sample size in Python.

Data scientists frequently use Statsmodels to get the minimum sample size, but may not be aware that it employs a different method from what most articles describe and what most online calculators use. It is essential to understand how the function works so that we can trust its results.

This article bridges the gap by explaining how Statsmodels actually works.

This article aims to make the novel contribution of explaining how tt_ind_solve_power actually computes the sample size, why the procedure is correct and what benefits it brings over closed-form solutions. [1]

Part 1: It will first explain how sample size is computed and why the procedure is correct in two steps:

Show the statistical derivation for sample size calculations.
Write a stripped-down version of tt_ind_solve_power that is an exact implementation of the statistical derivation and produces the same output as the original function

Part 2: Following which, it will explain two benefits it brings over closed-form solutions:

Benefits to generalizability
Benefits to statistical intuition

1.1. Showing the statistical derivation for sample size calculations

Core Idea

A general A/B test is an unpaired two-sample t-test. Rather than using a closed-form solution, Statsmodels obtains the minimum sample size in two steps:

For a given sample size, compute the associated power of the test.
Run a numerical optimization algorithm to find the sample size that returns the target power of the test.

Notation and Concepts

These are some terms we will use throughout the article:

n: minimum required sample size. n = n_1 + n_2
n_1, n_2: minimum required sample size for the treatment and control group, respectively
ratio: n_2 = n_1 * ratio, where for a 50:50 allocation, ratio = 1
p: p-value
𝛼: significance level / type I error
𝛽: type II error; 1-𝛽 is power
μ_1, μ_2: means of treatment group and control group, respectively
X̄1, X̄2: sample means of treatment group and control group, respectively

t_(1-𝛼): critical value / t-score that cuts off the top 100𝛼(%) of the standard t distribution.
MDE: the minimum detectable effect, or the level of statistically significant difference that can be detected given all other parameters (e.g. a base conversion rate of 10%, an expected uplift of 50% and so an expected treatment conversion rate of 15% means that the MDE is 15–10=5%=0.05)
𝜎: standard deviation of observations in each group, assumed to be the same
d: Cohen’s d / standardized effect size, given by MDE / 𝜎
H_0, H_1: null hypothesis, alternative hypothesis

Derive the formula for power of a test

Define the null and alternative hypothesis:

2. Derive the distribution of the test statistic under the null hypothesis (H_0):

We find that under the null hypothesis, the test statistic t follows a t-distribution with (n_1 + n_2 - 2) degrees of freedom.

This follows from the following:

Where the sample variance of X is computed as such:

3. Derive the distribution of the test statistic under the alternative hypothesis (H_1):

We find that under the alternative hypothesis, assuming that the difference in means is the MDE, the test statistic t follows a non-central t-distribution with non-centrality parameter θ = d * sqrt((n1 * n2) / (n1 + n2)) and (n_1 + n_2 - 2) degrees of freedom.

A non-central t-distribution (nct) with a positive non-centrality parameter can be roughly thought of as a standard t-distribution shifted to the right. [2] Intuitively, the standard t-distribution happens under the null when we expect 0 effect on average, while the non-central t-distribution happens under the alternative when we expect a positive effect that is on average roughly equal to the MDE.

Definition: A non-central t-distribution random variable T with non-centrality parameter θ and ν degrees of freedom is defined as:

where Z is a standard normal random variable, and V is a chi-squared distributed random variable with ν degrees of freedom.

The proof starts from the observation that under the alternative hypothesis, the true difference in means is MDE and so we can subtract MDE and divide by the population standard deviation to get a standard normal variable.

4. Compute the power

Since we know the distribution of the test statistic under the null and alternative hypotheses, and the cdf of both distributions are known, we can calculate power easily given the level of significance and type of test (two-tailed, greater, smaller). The diagram below visualizes how:

Computing Minimum Sample Size for A/B Tests in Statsmodels: How and Why (4)

In Python, the implementation looks like this:

def power(self, effect_size, nobs1, alpha, ratio=1, df=None,
 alternative='two-sided'):
 nobs2 = nobs1*ratio
 if df is None:
 df = (nobs1 + nobs2 - 2) # Get non-centrality parameter
 nobs = nobs1 * nobs2 / (nobs1 + nobs2)
 d = effect_size
 nc_param = d * np.sqrt(nobs)
 # Get effective level of signifiance, alpha_
 if alternative in ['two-sided']:
 alpha_ = alpha / 2.
 elif alternative in ['smaller', 'larger']:
 alpha_ = alpha
 else:
 raise ValueError("alternative has to be 'two-sided', 'larger' " +
 "or 'smaller'")
 # Compute power of a t-test
 power = 0
 if alternative in ['two-sided', 'larger']:
 crit_upp = stats.t.isf(alpha_, df) # isf = inverse survival function = value where Pr(t > value) = alpha
 power += 1 - special.nctdtr(df, nc_param, crit_upp) # 1 - Pr(t < crit_upp) = Pr(t > crit_upp) for non-central t distribution
 if alternative in ['two-sided', 'smaller']:
 crit_low = stats.t.ppf(alpha_, df) # ppf = percent point function = value where Pr(t < value) = alpha
 power += special.nctdtr(df, nc_param, crit_low) # 
 return power

Obtain minimum sample size using numerical optimization

Given that we now know how to computer power for a given set of parameters, we can then run a numerical optimization method to find the minimum sample size that achieves the target power. Since the total sample size is a function of the treatment sample size (n = n_1 + ratio * n_1), we will be finding n_1.

This works because power is monotonically increasing in sample size n_1. Intuitively, more samples means that A/B testing results are more certain and so if the alternative hypothesis is true, more values will reject the null hypothesis (see left subplot of figure below).

But this also means that subtracting off the target power gives a monotonically increasing function with a negative start point and a positive end point. By the intermediate value theorem and monotonicity of the function, there is a unique root that corresponds to our minimum sample size (see right subplot of figure below).

Computing Minimum Sample Size for A/B Tests in Statsmodels: How and Why (5)

A popular, high-performing numerical optimization method is Brent’s method. Brent’s method is a root-finding algorithm that combines various techniques such as the bisection method, the secant method and inverse quadratic interpolation. Further details of its implementation in Statsmodels can be found here.

In Python, the implementation looks like this:

def solve_power(self, effect_size=None, nobs1=None, alpha=None, power=None,
 ratio=1., alternative='two-sided'):
 print('--- Arguments: ---')
 print('effect_size:', effect_size, 'nobs1:', nobs1, 'alpha:', alpha, 'power:', power, 'ratio:', ratio, 'alternative:', alternative, '\n') # Check that only nobs1 is None
 kwds = dict(effect_size=effect_size, nobs1=nobs1, alpha=alpha,
 power=power, ratio=ratio, alternative=alternative)
 key = [k for k,v in kwds.items() if v is None]
 assert(key == ['nobs1'])
 # Check that the effect_size is not 0
 if kwds['effect_size'] == 0:
 raise ValueError('Cannot detect an effect-size of 0. Try changing your effect-size.')
 # Initialize the counter
 self._counter = 0
 # Define the function that we want to find the root of
 # We want to find nobs1 s.t. current power = target power, i.e. current power - target power = 0
 # So func = current power - target power
 def func(x):
 kwds['nobs1'] = x
 target_power = kwds.pop('power') # always the same target power specified in keywords, e.g. 0.8
 current_power = self.power(**kwds) # current power given the current nobs1, note that self.power does not have power as an argument
 kwds['power'] = target_power # add back power to kwds
 fval = current_power - target_power
 print(f'Iteration {self._counter}: nobs1 = {x}, current power - target power = {fval}')
 self._counter += 1
 return fval
 # Get the starting values for nobs1, given the brentq_expanding algorithm
 # In the original code, this is the self.start_bqexp dictionary set up in the __init__ method
 bqexp_fit_kwds = {'low': 2., 'start_upp': 50.}
 # Solve for nobs1 using brentq_expanding
 print('--- Solving for optimal nobs1: ---')
 val, _ = brentq_expanding(func, full_output=True, **bqexp_fit_kwds)
 return val

1.2. Writing a stripped-down version of tt_ind_solve_power that is an exact implementation of the statistical derivation and produces the same output as the original function

The source file in Statsmodels is available here. While the original function is written to be more powerful, its generalizability also makes it harder to gain intuition on how the code works.

I thus looked through the source code line-by-line and simplified it down from 1,600 lines of code to 160, and from 10+ functions to just 2, while ensuring the that implementation remains identical.

The stripped-down code contains just two functions under the TTestIndPower class, exactly following the statistical derivation explained in Part 1:

power, which computes power given a sample size
solve_power, which finds the minimum sample size that achieves a target power using Brent’s method

This is the full code for the stripped-down version with a test to check that it produces the same output as the original function:

2.1. Benefits to generalizability

This approach can be easily generalized to finding other parameters of interest (e.g. finding the level of significance or minimum detectable effect instead of sample size).

Via the closed-form solution approach, we need to find an equation for each parameter, which can be complex or infeasible. In contrast, the same numerical optimization approach works for any parameter.

2.2. Benefits to statistical intuition

This approach is arguably more intuitive because it is a natural extension of the concept of statistical power. Further, the concept of the non-central t-distribution offers clearer insights into how minimum sample size changes when other parameters change.

Case 1: Change in parameter leading to increase in θ and thus an increase in power

Recall that the non-centrality parameter is θ = d * sqrt((n1 * n2) / (n1 + n2)). An increase in θ effectively shifts the non-central distribution to the right, reducing the overlap between the distributions under the two hypotheses.

This can be created by the following:

Increase in MDE which increases Cohen’s d and thus increases θ
Decrease in population standard deviation which increases Cohen’s d and thus increases θ

This increases power given the same sample size (see Case 1 in diagram below), and thus reduces the minimum sample size.

Case 2: Change in parameter leading directly to an increase in power without changing θ

An increase in the level of significance means that more values will lead to a rejection of the null. This directly increases power (see Case 2 in diagram below) and reduces the minimum sample size.

Case 3: Change in target power

An increase in target power means that the initial n will no longer meet the higher target power, thus requiring an increase in minimum sample size.

Computing Minimum Sample Size for A/B Tests in Statsmodels: How and Why (6)

The function to solve for minimum sample size in Statsmodels is powerful and relies on numerical optimization. While different from standard closed-form solutions, this approach makes it easier to see the statistical intuition behind how sample size is computed, and to generalize to computing other parameters of interest. It is an approach worth understanding for data scientists interested in marketing and product analytics.

Deng, L. (2020). Required Sample Size for A/B Testing. https://towardsdatascience.com/required-sample-size-for-a-b-testing-6f6608dd330a

Kohavi, R., Tang, D., & Xu, Y (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. In Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing (p. I). Cambridge: Cambridge University Press.

Miller, E. (2013). Sample Size Calculator.

Nsky, S. (2019). Experiment sample size calculation using power analysis. https://towardsdatascience.com/experiment-sample-size-calculation-using-power-analysis-81cb1bc5f74b

Wei, M. (2023). Probing into Minimum Sample Size Formula: Derivation and Usage. https://towardsdatascience.com/probing-into-minimum-sample-size-formula-derivation-and-usage-8db9a556280b

[1] The equivalent function in R is pwr.t.test. We use the Python version because the open-source code is available for readers to view, compare and work through.

[2] The general non-central t distribution is not symmetric or centered around Cohen’s d, but tends towards being so as the degrees of freedom increases.