A/B Test Process
Hypothesis
→Design
→Run
→Analyze
In This Article
Introduction
A/B testing (also called split testing) is a randomized experiment comparing two versions to determine which performs better. It's the gold standard for making data-driven decisions in product development, marketing, and UX design.
What is A/B Testing?
- Control (A): Current version
- Treatment (B): Modified version with one change
- Random assignment: Users randomly see A or B
- Compare outcomes: Which version performs better?
Example
Testing a new "Buy Now" button color:
Control: Green button (current)
Treatment: Orange button (new)
Metric: Click-through rate
The A/B Testing Process
Step 1: Form Hypothesis
"Changing [X] will increase [metric] because [reason]"
Step 2: Determine Metrics
- Primary metric: Main success measure (conversion rate)
- Secondary metrics: Other important measures
- Guardrail metrics: Ensure no negative impact
Step 3: Calculate Sample Size
Determine how many users needed for statistical validity.
Step 4: Run Experiment
- Random assignment
- Run for predetermined duration
- Don't peek and stop early!
Step 5: Analyze Results
Calculate statistical significance and effect size.
Key Statistical Concepts
| Concept | Definition | Typical Value |
|---|---|---|
| Significance Level (α) | Probability of false positive | 5% (0.05) |
| Power (1-β) | Probability of detecting true effect | 80% |
| p-value | Probability result is due to chance | < 0.05 is significant |
| Confidence Interval | Range of plausible effect sizes | 95% CI |
| Effect Size | Magnitude of difference | Depends on context |
Statistical Significance vs. Practical Significance:
A result can be statistically significant but practically insignificant. A 0.1% conversion lift might be "significant" with large samples but not worth implementing.
A result can be statistically significant but practically insignificant. A 0.1% conversion lift might be "significant" with large samples but not worth implementing.
Sample Size Calculation
Sample size depends on:
- Baseline conversion rate: Current performance
- Minimum detectable effect: Smallest improvement you care about
- Significance level: Usually 5%
- Power: Usually 80%
Rule of thumb: Smaller effects require larger samples
To detect a 10% relative lift in a 5% conversion rate, you need ~30,000 users per variant.
Common Pitfalls
- Peeking: Stopping early when you see desired result
- Multiple comparisons: Testing many variants without correction
- Novelty effect: Users react to newness, not improvement
- Selection bias: Non-random assignment
- Insufficient sample: Underpowered tests
- Wrong metric: Optimizing proxy, not true goal
Conclusion
Key Takeaways
- A/B testing compares control vs. treatment with random assignment
- Hypothesis first: Know what you're testing and why
- Calculate sample size before starting
- Don't peek—run for full duration
- p < 0.05 indicates statistical significance
- Consider practical significance, not just statistical
- A/B testing enables data-driven decisions