Skip to main content

A/B Test Significance

Calculate statistical significance and required sample size for A/B tests. Enter values for instant results with step-by-step formulas.

Share this calculator

Formula

z = (p₂ - p₁) / √[p(1-p)(1/n₁ + 1/n₂)]

Where z is the test statistic, p₁ and p₂ are conversion rates for control and variation, p is the pooled proportion, and n₁ and n₂ are sample sizes. The p-value is calculated from the normal distribution. For 95% significance, z must exceed ±1.96 (two-tailed).

Worked Examples

Example 1: E-commerce Checkout Button Color Test

Problem: An e-commerce site tests green vs. orange checkout buttons. Control (green): 50,000 visitors, 1,750 conversions. Variation (orange): 50,000 visitors, 1,925 conversions. Is the orange button significantly better at 95% confidence?

Solution: Step 1: Calculate conversion rates\nControl rate = 1,750/50,000 = 3.50%\nVariation rate = 1,925/50,000 = 3.85%\nRelative uplift = (3.85-3.50)/3.50 = 10.0%\n\nStep 2: Calculate pooled proportion and standard error\nPooled p = (1,750+1,925)/(50,000+50,000) = 3.675%\nSE = √(0.03675 × 0.96325 × (1/50,000 + 1/50,000)) = 0.001189\n\nStep 3: Calculate z-score\nz = (0.0385 - 0.0350) / 0.001189 = 2.94\n\nStep 4: Calculate p-value (two-tailed)\np-value = 2 × (1 - Φ(2.94)) = 0.0033\n\nStep 5: Determine significance\np-value (0.0033) < α (0.05) ✓\n\nStep 6: Calculate 95% confidence interval\nCI = (0.35%) ± 1.96 × 0.119%\nCI = [0.12%, 0.58%]\n\nConclusion: Orange button shows a statistically significant 10% relative improvement.

Result: Significant: YES (p=0.003) | Uplift: +10.0% | 95% CI: [0.12%, 0.58%] | ~175 extra conversions

Example 2: Sample Size Calculation for Pricing Page Test

Problem: A SaaS company wants to test a new pricing page. Current conversion rate is 2.5%. They want to detect a 15% relative improvement with 80% power at 95% significance. How many visitors per variant are needed?

Solution: Step 1: Define parameters\nBaseline rate (p₁) = 2.5% = 0.025\nMinimum Detectable Effect = 15% relative\nNew rate (p₂) = 0.025 × 1.15 = 2.875% = 0.02875\nAbsolute effect = 0.02875 - 0.025 = 0.00375\n\nStep 2: Get z-scores\nα = 0.05, two-tailed: z_α/2 = 1.96\nPower = 80%: z_β = 0.84\n\nStep 3: Apply sample size formula used by this calculator\nn = 2 × (z_α/2 + z_β)² × [p₁(1-p₁) + p₂(1-p₂)] / (p₂-p₁)²\n\nn = 2 × (1.96 + 0.84)² × [0.025×0.975 + 0.02875×0.97125] / (0.00375)²\nn ≈ 58,380 per variant\n\nStep 4: Calculate test duration\nAt 2,000 visitors/day split 50/50:\nDays needed = 58,380 / 1,000 = 59 days (about 2 months)

Result: Required: 58,380 visitors per variant | 116,760 total | ~59 days at 2K visitors/day

Example 3: Marginal Result Interpretation

Problem: A landing page test shows: Control: 8,000 visitors, 240 conversions. Variation: 8,000 visitors, 272 conversions. The uplift looks promising. How should this be interpreted?

Solution: Step 1: Calculate metrics\nControl rate = 240/8,000 = 3.00%\nVariation rate = 272/8,000 = 3.40%\nRelative uplift = 13.3%\n\nStep 2: Assess statistical significance\np-value ≈ 0.151 > α = 0.05\nResult is NOT statistically significant at 95% confidence\nIt is also not significant at 90% confidence\n\nStep 3: Calculate confidence interval\n95% CI for difference: [-0.15%, 0.95%]\nThe CI includes zero, confirming non-significance\n\nStep 4: Interpret the situation\n- Promising directional lift but still inconclusive\n- Could be real effect or random variation\n- Sample size is too small for confidence\n\nStep 5: Recommendations\nOption A: Continue test until the planned sample size is reached\nOption B: Re-run with a larger traffic allocation\nOption C: Treat this as exploratory evidence, not a

Result: Not significant at 95% (p=0.151) | Suggestive but inconclusive | Recommend: keep running or gather more traffic

Frequently Asked Questions

What is statistical significance in A/B testing?

Statistical significance indicates that an observed difference between variants is unlikely to have occurred by random chance alone. When we say a result is 'statistically significant at 95% confidence,' we mean there's less than a 5% probability that the observed difference happened by chance (p-value < 0.05). However, statistical significance doesn't mean the result is practically important—a tiny difference can be statistically significant with large enough sample sizes.

How do I calculate the sample size needed for an A/B test?

Sample size depends on: 1) Baseline conversion rate, 2) Minimum Detectable Effect (MDE) - the smallest improvement worth detecting, 3) Statistical power (typically 80%), and 4) Significance level (typically 95%/α=0.05). The formula involves the z-scores for your desired power and significance level. Generally, smaller effects and lower baseline rates require larger samples. A 10% relative lift from a 3% baseline typically requires ~30,000 visitors per variant.

How long should I run an A/B test?

Run your test until you reach the required sample size calculated before the test starts. Never stop early just because you see significance—this inflates false positive rates (peeking problem). Also consider: 1) Run for at least one full week to capture day-of-week effects, 2) Account for seasonality, 3) Ensure you capture business cycles. Tools like sequential testing or always-valid p-values allow valid early stopping, but require specific statistical methods.

What inputs do I need to use A/B Test Significance accurately?

Each field is labelled with the required unit (metric or imperial). Gather your source values before starting — for example, a weight measurement in kilograms, a distance in metres, or a dollar amount — and enter them exactly as measured. The formula section on this page lists every variable and explains what each represents.

How accurate are the results from A/B Test Significance?

All calculations use established mathematical formulas and are performed with high-precision arithmetic. Results are accurate to the precision shown. For critical decisions in finance, medicine, or engineering, always verify results with a qualified professional.

Is my data stored or sent to a server?

No. All calculations run entirely in your browser using JavaScript. No data you enter is ever transmitted to any server or stored anywhere. Your inputs remain completely private.

Background & Theory

The A/B Test Significance Calculator applies the following established principles and formulas. Search engine optimisation and digital marketing performance is quantified through a hierarchy of interconnected metrics. Click-through rate (CTR) divides the number of clicks on a link by the number of times it was shown (impressions), expressing how compelling a headline, ad, or meta description is at a given position. Industry average organic CTR for the top Google result sits around 28 to 35 percent, declining sharply with rank. Cost-per-click (CPC) is the average amount paid each time a user clicks a paid advertisement, calculated by dividing total ad spend by total clicks. Return on ad spend (ROAS) divides total revenue attributed to advertising by total ad spend; a ROAS of 4 means $4 in revenue for every $1 spent. Conversion rate divides completed goal actions (purchases, sign-ups, downloads) by total sessions or unique visitors, bridging traffic metrics to business outcomes. Keyword difficulty scores (typically 0 to 100) estimate how competitive it would be to rank organically for a given search term, based on the authority of pages currently ranking in the top results. PageRank, the algorithm Google was originally built on, modelled the web as a directed graph and assigned each page an authority score proportional to the number and quality of inbound links, treating a link as a vote of confidence weighted by the linking page's own authority. The Flesch Reading Ease formula scores text legibility on a 0 to 100 scale using sentence length and syllable count per word. Higher scores indicate easier reading; most consumer-oriented web content targets scores above 60. Bounce rate measures the percentage of sessions in which a user leaves without triggering a second page view, though its interpretation depends heavily on page purpose. Email open rate benchmarks vary significantly by industry, averaging around 20 to 25 percent across sectors. Social media engagement rate divides total interactions (likes, comments, shares) by total reach or follower count, assessing content resonance beyond simple impression counts.

History

The history behind the A/B Test Significance Calculator traces back through the following developments. Before algorithmic search engines, web navigation relied on manually curated directories maintained by human editors. Yahoo launched its categorised directory in 1994 and briefly dominated web discovery by organising sites into a hierarchical taxonomy. Early automated search engines including AltaVista and Excite ranked pages using keyword frequency in on-page content, which immediately spawned keyword stuffing as the first widespread manipulation tactic: publishers repeated target phrases hundreds of times, sometimes rendered in white text on a white background to hide them from readers while remaining visible to crawlers. Google's founding in 1998 by Larry Page and Sergey Brin at Stanford introduced PageRank, a link-graph authority algorithm that shifted ranking signals away from easily gamed on-page text toward the harder-to-fabricate structure of inbound links. This dramatically improved result quality and positioned Google as the dominant search engine within three years of launch. The growing commercial value of first-page rankings created a professional SEO industry that reverse-engineered ranking signals, built link farms, and pursued aggressive anchor text optimisation. Google responded to systematic manipulation with major named algorithm updates: Panda in 2011 penalised low-quality, thin, and duplicate content; Penguin in 2012 targeted unnatural link patterns and link schemes; and Hummingbird in 2013 introduced deep semantic parsing to match query intent rather than literal keyword strings. These updates collectively shifted SEO best practice toward genuine content quality, topical depth, and user experience signals. Facebook launched its self-service advertising platform in 2007, enabling granular demographic, interest, and behavioural targeting at scale for the first time. Social media marketing matured into a distinct professional discipline through the 2010s. Google formalised mobile-first indexing in 2016 and made Core Web Vitals official ranking signals in 2021. From 2023 onward, AI Overviews began surfacing synthesised answers atop search results, creating a zero-click environment that fundamentally challenged traffic-dependent content business models.

References