Introduction

A/B testing (also called split testing) is a randomized experiment comparing two versions to determine which performs better. It's the gold standard for making data-driven decisions in product development, marketing, and UX design.


What is A/B Testing?

  • Control (A): Current version
  • Treatment (B): Modified version with one change
  • Random assignment: Users randomly see A or B
  • Compare outcomes: Which version performs better?

Example

Testing a new "Buy Now" button color:

Control: Green button (current)

Treatment: Orange button (new)

Metric: Click-through rate


The A/B Testing Process

Step 1: Form Hypothesis

"Changing [X] will increase [metric] because [reason]"

Step 2: Determine Metrics

  • Primary metric: Main success measure (conversion rate)
  • Secondary metrics: Other important measures
  • Guardrail metrics: Ensure no negative impact

Step 3: Calculate Sample Size

Determine how many users needed for statistical validity.

Step 4: Run Experiment

  • Random assignment
  • Run for predetermined duration
  • Don't peek and stop early!

Step 5: Analyze Results

Calculate statistical significance and effect size.


Key Statistical Concepts

ConceptDefinitionTypical Value
Significance Level (α)Probability of false positive5% (0.05)
Power (1-β)Probability of detecting true effect80%
p-valueProbability result is due to chance< 0.05 is significant
Confidence IntervalRange of plausible effect sizes95% CI
Effect SizeMagnitude of differenceDepends on context
Statistical Significance vs. Practical Significance:
A result can be statistically significant but practically insignificant. A 0.1% conversion lift might be "significant" with large samples but not worth implementing.

Sample Size Calculation

Sample size depends on:

  • Baseline conversion rate: Current performance
  • Minimum detectable effect: Smallest improvement you care about
  • Significance level: Usually 5%
  • Power: Usually 80%

Rule of thumb: Smaller effects require larger samples

To detect a 10% relative lift in a 5% conversion rate, you need ~30,000 users per variant.


Common Pitfalls

  • Peeking: Stopping early when you see desired result
  • Multiple comparisons: Testing many variants without correction
  • Novelty effect: Users react to newness, not improvement
  • Selection bias: Non-random assignment
  • Insufficient sample: Underpowered tests
  • Wrong metric: Optimizing proxy, not true goal

Conclusion

Key Takeaways

  • A/B testing compares control vs. treatment with random assignment
  • Hypothesis first: Know what you're testing and why
  • Calculate sample size before starting
  • Don't peek—run for full duration
  • p < 0.05 indicates statistical significance
  • Consider practical significance, not just statistical
  • A/B testing enables data-driven decisions