Marketing

What Is A/B Testing? The Complete Guide for 2026

A/B testing explained from first principles — what it is, how it works, frequentist vs Bayesian stats, the PIE framework, client-side vs server-side testing, common mistakes, and tools. With real examples and everything the other guides skip.

Victor Ogonyo

·2026-06-07·20 min read

A/B testing is the most powerful optimisation tool available to a product or marketing team — and one of the most consistently misused. Most teams stop tests too early, test the wrong things, confuse statistical significance with business importance, and ship changes based on noise rather than signal.

This guide covers everything: what A/B testing actually is, how the statistics work (without requiring a PhD), what to test and in what order, the technical choices you need to make, the frameworks that separate mature experimentation programmes from chaotic ones, and the mistakes that invalidate most tests before they even start.

What Is A/B Testing?

A/B testing — also called split testing or AB testing — is a controlled experiment in which two versions of something are shown to randomly selected groups of users simultaneously, and a pre-defined metric is compared to determine which version performs better.

Version A (the control): What you currently have. The baseline.
Version B (the variant): The change you want to test.

Users are randomly assigned to see either A or B. After enough users have experienced each version, you measure your chosen metric — conversion rate, click-through rate, revenue per visitor — and determine whether the difference is statistically meaningful or just noise.

The core logic is ruthlessly simple: instead of debating what might work, you let user behaviour decide.

Split URL Testing vs. A/B Testing

A standard A/B test serves both variants on the same URL, with the variant applied by changing elements on the page. A split URL test redirects users to entirely different URLs — /pricing-v2 vs /pricing. Split URL tests are better for testing dramatically different layouts or designs. Standard A/B tests are better for element-level changes (headlines, buttons, copy).

A/B Testing vs. Multivariate Testing

A/B tests change one variable at a time. Multivariate tests (MVT) change multiple elements simultaneously and test all combinations. If you test 3 headlines × 2 button colours × 2 images, you have 12 combinations to test simultaneously. MVT requires far more traffic to reach significance on each combination — typically 4–6× more than a single A/B test. Use A/B tests for most decisions. Reserve MVT for high-traffic pages where you have already found individual elements that move the needle.

A/A Testing

Before trusting your A/B testing tool, run an A/A test — show the exact same version to both groups and confirm that the tool reports no significant difference. If your tool reports a winner in an A/A test, your randomisation, tracking, or data pipeline has a bug. Fixing this before running real tests saves months of invalid data.

Why A/B Testing Matters: The Business Case

Conversion improvements compound permanently

A 1% improvement in conversion rate doesn't disappear next month. It applies to every visitor, every day, indefinitely. A landing page converting at 4% instead of 3% on 10,000 monthly visitors produces 1,200 extra signups per year — at zero additional acquisition cost. Run 12 such tests in a year and the compounding effect is dramatic.

Opinion is the most expensive input

The designer thinks the button should be blue. The founder thinks orange. The VP of Marketing wants the headline rewritten entirely. Every opinion-driven decision is a coin flip. A/B testing replaces the highest-paid person's opinion with user behaviour.

Users behave in ways that consistently surprise experts

The change that seems obviously better — more copy explaining the benefits, a larger hero image, a more descriptive headline — frequently loses to the simpler alternative. The only way to know is to test. Gut instinct built on years of experience in one context fails constantly in new contexts.

It guides major redesigns, not just small tweaks

The most common objection to A/B testing is that you can't test a full redesign. You can. Test individual sections of the redesign sequentially, validate each before committing to the full version, and ship a redesign you know converts better rather than one that looks better to your internal team.

The Statistics Behind A/B Testing

You don't need to be a statistician to run good A/B tests — but you need to understand enough to avoid the most common statistical errors.

Frequentist Testing (p-values and statistical significance)

The dominant approach in A/B testing is frequentist. You run the test, calculate a p-value, and compare it to your significance threshold (almost always p < 0.05, meaning 95% confidence).

What the p-value actually means: The probability of observing a result at least as extreme as yours, assuming there is no real difference between A and B. A p-value of 0.03 means: if A and B were truly identical, there would only be a 3% chance of seeing a difference this large by random chance alone.

What statistical significance does not mean:

It does not mean the result is large or practically important
It does not guarantee the effect will hold on future traffic
It does not mean you are 95% certain the variant is better — that is a common and consequential misinterpretation

Confidence intervals matter more than p-values. A 95% confidence interval of [+0.1%, +0.3%] is statistically significant but almost certainly not worth shipping. A CI of [+8%, +24%] is both statistically and practically significant. Always report effect size with its confidence interval, not just the p-value.

Bayesian Testing

Bayesian A/B testing takes a different approach. Instead of asking "what is the probability of seeing this result if there is no difference?", it asks "given the data I have, what is the probability that B is better than A?"

Bayesian methods:

Allow you to monitor results continuously without the peeking problem (see Mistakes section)
Express results in intuitive terms: "There is an 87% probability that the variant is better"
Incorporate prior beliefs (though priors are debated)
Are preferred by teams that need to make fast decisions and cannot wait for fixed sample sizes

Which to use: Frequentist is the standard and easiest to communicate to stakeholders. Bayesian is better for teams running many tests simultaneously and who need to iterate quickly. Tools like Optimizely, Statsig, and AB Tasty support both approaches.

Statistical Power and Sample Size

Statistical power is the probability of detecting a real effect if one exists. Standard power is set at 80% — meaning your test has a 20% chance of missing a real improvement (a false negative).

Minimum Detectable Effect (MDE): The smallest lift worth detecting. Setting MDE too small requires enormous samples. Setting it too large means you'll miss real but modest improvements.

Required sample size depends on:

Your current baseline conversion rate
Your MDE
Your significance level (usually 0.05)
Your statistical power (usually 0.80)

Use Evan Miller's sample size calculator or Optimizely's sample size tool before starting any test. This step is skipped by most teams and causes most failed tests.

Rough guide:

Baseline rate	MDE (relative)	Visitors needed per variant
3%	20% lift	~6,500
3%	10% lift	~26,000
10%	20% lift	~1,800
10%	10% lift	~7,200

A/B Testing Implementation: Client-Side vs. Server-Side

This is the technical decision most guides ignore. It matters significantly.

Client-Side Testing

The testing tool loads a JavaScript snippet in the browser. When a user lands on a page, the script fires, determines which variant to show, and modifies the page in real time.

Advantages:

No engineering required for most tests
Marketers and designers can create tests using visual editors
Fast iteration — tests can be live within hours

Disadvantages:

Flash of original content (FOOC/flicker): The original page renders briefly before the variant is applied. Users may notice the change, which introduces bias and degrades user experience.
Performance impact: The JavaScript tag adds page load time. Every millisecond of load time reduces conversion rate.
Limited to front-end changes: Cannot test back-end logic, pricing logic, or algorithm changes
Apple ITP impact: Apple's Intelligent Tracking Prevention (ITP) in Safari aggressively restricts third-party cookies, which can corrupt user assignment and produce Sample Ratio Mismatch in client-side tools. This affects a significant percentage of traffic on many sites.

Best for: Landing page copy, headlines, CTA buttons, layout changes, image tests, email opt-in forms.

Server-Side Testing

The variant assignment happens on the server before the page is sent to the browser. The user always receives a fully rendered page with no flicker.

Advantages:

No flicker — cleaner user experience and no bias from loading artefacts
Can test back-end changes: pricing algorithms, recommendation systems, checkout logic, API responses
Not affected by browser cookie restrictions or Apple ITP
More reliable user assignment and tracking

Disadvantages:

Requires engineering involvement for every test
Slower iteration — tests take longer to set up
Higher technical complexity

Best for: Product feature tests, algorithm changes, checkout flow experiments, pricing tests, personalisation experiments.

Hybrid Approach

Most mature experimentation teams use both. Client-side for marketing and copy tests (fast, no engineering dependency). Server-side for product and algorithmic tests (reliable, unrestricted).

The A/B Testing Process: 7 Steps

Step 1: Secure organisational buy-in

A/B testing fails at the organisational level as often as the statistical level. If leaders override test results because they disagree with the outcome, or if teams run tests in silos without a shared methodology, you will produce noise, not insight.

Build a culture of experimentation: every significant product or marketing decision should be tested, results should be documented and shared, and everyone — including leadership — should commit to shipping what the data says, not what they prefer.

This requires explicit buy-in before the first test runs. Present the business case (compounding conversion improvements, reduced risk on big changes) and agree on the process (who can declare a winner, what happens with inconclusive tests, how results are shared).

Step 2: Measure baseline performance

Before testing anything, establish accurate baselines. Install proper analytics (GA4, Plausible, or PostHog), set up conversion goal tracking, and collect at least 2–4 weeks of baseline data.

Know these numbers for every page you plan to test:

Current conversion rate
Monthly visitor count
Conversion event definition (exactly what counts as a conversion)

Step 3: Research and identify what to test

Use data, not opinion, to choose what to test.

Quantitative signals: Analytics funnel reports showing where users drop off. Pages with high traffic but lower-than-expected conversion. Steps in checkout or signup that have high exit rates.

Qualitative signals: Session recordings (Microsoft Clarity, Hotjar) showing where users get confused or rage-click. Heatmaps showing which content gets attention. User interviews surfacing objections and confusion. Support tickets revealing friction points.

The goal is to find where users are failing — not where you think they might be failing.

Step 4: Form a specific hypothesis

A hypothesis is not "let's try a different headline." It is a testable, falsifiable statement:

"If we change [X], we expect [metric] to [increase/decrease] because [reasoning based on data]."

Example:

"If we change the CTA button copy from 'Get started' to 'Start my free 14-day trial', we expect the signup conversion rate to increase because the new copy explicitly communicates the risk-free nature of the offer, which removes a purchase objection we identified in exit survey responses."

The "because" clause is what makes results interpretable. If the variant wins but you don't know why, you cannot build on the result.

Step 5: Prioritise using the PIE Framework

When you have more hypotheses than capacity to test, use the PIE Framework to prioritise:

P — Potential: How much improvement is possible here? Pages with low conversion rates have more room to improve than those already performing well.
I — Importance: How much traffic or revenue does this page/element affect? A 10% improvement on a page with 50,000 monthly visitors matters more than on one with 500.
E — Ease: How difficult is this test to implement? Technical complexity, resource requirements, and dependencies all reduce ease.

Score each dimension 1–10 and average the scores. Test highest-PIE items first.

Step 6: Calculate sample size, then run the test

Before launching, calculate your required sample size (see statistics section above). Split traffic randomly and equally. Use your testing tool's built-in randomisation — never manually alternate users between versions.

Run the test for the duration required to reach your sample size. At minimum, run every test for at least one full business week to capture day-of-week variation in user behaviour — traffic and conversion patterns on Monday are systematically different from those on Sunday.

Step 7: Analyse results, then document and iterate

When the test ends:

Check sample ratio mismatch (SRM): Were the groups actually equal in size? A significant imbalance (more than 1% off) indicates a tracking or randomisation bug and invalidates results.
Check statistical significance: Is p < 0.05? What is the confidence interval for the effect?
Segment results: Break down by device type, traffic source, new vs. returning visitors. A variant that wins overall may lose for mobile users — shipping it would hurt mobile conversions.
Check secondary metrics: Did the variant increase signups but decrease trial-to-paid conversion? A local improvement that creates a downstream problem is not a real win.
Document everything: The hypothesis, the result, the confidence interval, the segment breakdowns, and the conclusion. Maintain a shared test log. This library of knowledge is a durable competitive asset.

What to A/B Test: Priority Order

Tier 1 — Test these first

Value proposition and headline. Your homepage headline communicates your core value proposition. It determines whether a visitor continues reading or leaves. Even a 10% improvement in headline engagement cascades through your entire funnel. Test benefit-led vs. feature-led vs. outcome-led headlines. Test specificity ("Save 3 hours per week on reporting" vs. "Save time on reporting").

Call to action (CTA) copy and placement. "Get started" vs. "Start my free trial" vs. "Try it free — no credit card" can produce dramatically different click rates because each signals a different level of commitment. Test copy first, then colour (only if it improves contrast and visibility, not just aesthetics), then placement.

Pricing page structure. The pricing page is your highest-intent page. Test: number of plans (2 vs. 3), default billing period (monthly vs. annual), presence of a "most popular" badge, money-back guarantee copy, and plan feature emphasis.

Signup and checkout form. Each additional form field reduces completion rate by approximately 5–10%. Test removing optional fields, social login vs. email, single-step vs. multi-step flows, and field order.

Tier 2 — Test after Tier 1

Email subject lines. High-leverage, fast to test, low implementation cost. Even a 5% improvement in open rate on a 10,000-subscriber list produces 500 more readers per send — permanently, for every future email.

Social proof placement and type. Moving testimonials directly beneath the CTA button typically outperforms testimonials lower on the page. Test quote testimonials vs. logo bars vs. case study summaries vs. star ratings.

Onboarding flow steps. The path from signup to first value delivery. Test removing steps, reordering steps, adding or removing a progress bar, and changing the first action you ask new users to take.

Landing page layout. Video vs. image vs. no media in the hero. One-column vs. two-column layout. Long-form vs. short-form pages for high-consideration products.

Tier 3 — Low priority

Button colour (unless accessibility is genuinely compromised), font size, footer content, navigation order, stock photos vs. illustration. These rarely move the needle compared to positioning and copy changes.

Real A/B Test Examples with Results

Button copy: "Your" vs. "My" — 90% CTR lift

Unbounce tested "Start your free 30-day trial" against "Start my free 30-day trial." Changing a single word increased click-through rate by 90%. The word "my" creates psychological ownership before the user commits — they mentally experience having the product before clicking.

Lesson: Micro-copy changes on high-intent elements can produce outsized results. Test pronouns, action verbs, and specificity before testing visual changes.

Highrise (now Basecamp) tested 8 homepage variants. The winner used a photo of a real, smiling person with a brief testimonial quote — replacing a feature-led layout entirely. Signups increased by 102.5%.

Lesson: Social proof from a real human face next to the primary CTA consistently outperforms feature lists. Trust beats specification for most B2C and SMB products.

Free trial length — shorter wins

Multiple SaaS companies have published results showing that 14-day trials outperform 30-day trials. Shorter trials attract more motivated users, create urgency, and require less ongoing nurturing. Users who need 30 days to decide rarely convert.

Lesson: Counterintuitive results are common. "More" (more time, more features, more copy) frequently loses to "less." Test the assumption before you commit to it.

Adding an exit-intent popup with a specific offer (free consultation, extended trial, discount code) consistently recovers 10–15% of users who were about to leave. The key is specificity — "Get 20% off today only" outperforms "Don't leave yet!"

Lesson: Recover abandoning users before investing in acquiring more of them. Exit-intent tests are low effort and high return.

The 7 Most Common A/B Testing Mistakes

1. Peeking (stopping the test early)

You check results after 3 days. The variant is winning at 91% confidence. You stop the test and ship the variant. Six months later you notice the metric has reverted.

This is the peeking problem — the most common and most consequential A/B testing mistake. The p-value fluctuates wildly before you reach the required sample size. Stopping when it crosses a threshold exploits this volatility and produces false positives at a rate far higher than 5%.

Fix: Calculate your required sample size before starting. Do not analyse results until you have reached it. If you must monitor, use sequential testing methods designed for continuous monitoring (the mSPRT, or Bayesian methods).

2. Changing one thing vs. changing everything

Redesigning the entire page layout, headline, CTA, image, and social proof simultaneously produces a result you cannot interpret. If the variant wins, you don't know which change drove it. If it loses, you don't know what to fix.

Fix: One variable per test. This is the foundational discipline of A/B testing.

3. Ignoring Sample Ratio Mismatch (SRM)

SRM occurs when the two test groups are not the sizes they should be — for example, 45% in group A and 55% in group B instead of 50/50. This indicates a bug in randomisation, tracking, or attribution. All results from a test with SRM are invalid.

Fix: Before declaring any result, verify the group sizes match the expected split. Most professional tools flag SRM automatically; check manually if yours does not.

4. Testing on insufficient traffic

A page with 200 monthly visitors cannot produce a statistically valid result in any reasonable timeframe for typical conversion metrics. Testing it produces noise. Every result will look significant (because small samples are volatile) but none will be reliable.

Fix: Use a sample size calculator before starting. For low-traffic situations, pursue qualitative research (user interviews, session recordings, usability testing) rather than quantitative testing.

5. Ignoring Apple ITP (Intelligent Tracking Prevention)

Apple's ITP in Safari caps third-party cookie lifetime at 7 days and blocks many tracking mechanisms used by client-side testing tools. On sites where 30–40% of traffic comes from Safari (common for consumer products and B2C), ITP can corrupt user assignment, create SRM, and produce misleading results.

Fix: Use first-party cookie attribution where possible. Switch to server-side testing for experiments that run longer than a week. Audit your testing tool's ITP documentation and apply their recommended mitigations.

6. The multiple comparisons problem

If you run 20 A/B tests simultaneously and use p < 0.05 as your threshold, you expect approximately one false positive by chance — a test that appears to win but isn't actually better. The more tests you run simultaneously, the more false positives you accumulate.

Fix: Apply a Bonferroni correction when running multiple simultaneous tests on the same traffic. Or use a structured experimentation programme that limits simultaneous tests on the same user segment.

7. Shipping without segmenting results

A variant that wins overall may lose for mobile users, new visitors, or users from a specific traffic source. Shipping a global change that helps desktop users but hurts mobile users (who are often the majority of traffic) produces a net negative result despite the apparent "win."

Fix: Before shipping any winning variant, segment results by device type, traffic source, and new vs. returning users. Ship only if the variant wins across all critical segments, or ship conditionally (e.g., only for desktop) if the segment results differ.

A/B Testing Tools in 2026

Web and landing page testing

Optimizely — Enterprise standard. Supports both client-side and server-side, frequentist and Bayesian statistics, and full-stack feature experimentation. Expensive; best for large teams running many tests.

VWO (Visual Website Optimizer) — Strong mid-market option. Includes heatmaps, session recordings, and A/B testing in one platform. Better value than Optimizely for teams that don't need enterprise-scale infrastructure.

AB Tasty — European alternative with strong personalisation features alongside experimentation. Good privacy compliance track record.

Unbounce — Landing page builder with native A/B testing. Best for teams managing many landing page variants in a no-code environment.

Convert.com — Privacy-first A/B testing tool with GDPR compliance built in and strong ITP mitigation. Good mid-market option for European companies.

Product and feature flag experimentation

LaunchDarkly — Feature flag system with built-in experimentation. Allows server-side A/B tests with no page flicker. Widely used by engineering teams. Relatively expensive.

Statsig — Built by former Meta engineers. Strong statistical methodology (CUPED variance reduction), good developer experience, increasingly popular with engineering-led companies.

Split.io — Developer-focused feature flag and experimentation platform. Strong in enterprise engineering contexts.

Amplitude Experiment — Best for teams already using Amplitude for analytics. Native integration means no data pipeline to maintain.

GrowthBook — Open-source A/B testing platform. Free to self-host. Good for teams that want control and have engineering resources to manage infrastructure.

Email testing

Klaviyo — Best for ecommerce email experimentation. Tests revenue-based metrics, not just opens and clicks. Supports multivariate subject line testing.

Mailchimp — Built-in A/B testing for subject lines, send times, and content. Sufficient for basic email tests.

ConvertKit / Kit — A/B subject line testing on all paid plans. Clean interface, easy to use.

Research tools (to find what to test)

Microsoft Clarity — Free heatmaps, session recordings, and rage-click detection. Essential for identifying test opportunities without spending anything.

Hotjar — Heatmaps, session recordings, and on-site surveys. Free tier is sufficient for most early-stage teams.

Multi-Armed Bandit Testing: When to Use It

A multi-armed bandit is an alternative to A/B testing that continuously reallocates traffic toward better-performing variants in real time, rather than waiting for a fixed test to conclude.

When bandit testing outperforms A/B testing:

Short-lived campaigns where you want to maximise conversions during the test period (e.g., a 2-week promotional landing page)
When testing many variants simultaneously (3+)
When the cost of showing the losing variant is high (revenue-per-visitor tests)

When bandit testing is worse than A/B testing:

When you want a clean, interpretable result to learn from and build on
When seasonal or weekly patterns affect results (bandits converge before accounting for these cycles)
When you need to document the result for organisational learning

Most product and marketing teams should use A/B testing for their experimentation programme and reserve bandit testing for short, high-stakes campaigns.

Building an Experimentation Programme

Running one A/B test is useful. Running a structured experimentation programme compounds over time into a genuine competitive advantage.

What a mature programme looks like

A shared test log documenting every experiment: hypothesis, result, confidence interval, segments, conclusion
A backlog of prioritised hypotheses (PIE-scored)
A clear owner for experimentation who ensures statistical rigour
A cadence: 2–4 tests per month on high-traffic properties
A review process: results are shared across product, marketing, and engineering teams
A norm: no significant product or marketing decision ships without a test plan

The organisational models for experimentation

Centralised: One experimentation team owns all tests company-wide. Ensures statistical rigour and consistent methodology. Slower — bottleneck on the central team.

Decentralised: Individual product and marketing teams run their own tests. Faster iteration. Risk of inconsistent methodology, invalid tests, and local optimisation that creates conflicts.

Hybrid (most common): A central experimentation team sets methodology, tools, and standards, and reviews results. Individual teams propose and run tests with support from the central team. This balances speed with rigour.

Where to start if you are early-stage

Run an A/A test on your highest-traffic page to verify your tracking is accurate
Identify your single highest-traffic, lowest-converting page
Run 3 user interviews to find the biggest objection or confusion point
Write a hypothesis that addresses that specific objection
Calculate your sample size
Run the test
Document the result — win or lose

Then repeat. The programme grows from there.

Key A/B Testing Terms: Complete Glossary

A/B test: A controlled experiment comparing two versions of a page, email, or product element to determine which performs better on a defined metric.

Control (A): The existing version. The baseline against which the variant is compared.

Variant (B): The modified version being tested.

Hypothesis: A specific, testable prediction in the form: "If we change X, we expect Y to increase because Z."

Conversion rate: The percentage of users who complete the desired action.

Statistical significance: The probability that an observed difference is not due to chance. Typically set at 95% (p < 0.05).

p-value: The probability of seeing the observed result (or more extreme) if the two versions were truly identical. p < 0.05 is the standard threshold for declaring a result significant.

Confidence interval: A range within which the true effect size is likely to fall. A 95% CI of [+3%, +12%] means there is a 95% chance the true lift is between 3% and 12%.

Minimum Detectable Effect (MDE): The smallest lift worth detecting. Determines the required sample size.

Statistical power: The probability of detecting a real effect if one exists. Standard is 80%.

Sample Ratio Mismatch (SRM): When test groups are not the expected size, indicating a randomisation or tracking bug. Invalidates all results.

Frequentist testing: The standard p-value approach. Results are definitive only after the predetermined sample size is reached.

Bayesian testing: An alternative approach that expresses results as probability statements ("87% chance the variant is better") and supports continuous monitoring.

Novelty effect: A temporary lift caused by users paying more attention to something new. Fades over time. Especially common in redesign tests.

Peeking: Stopping a test early because the result looks promising. Invalidates statistical guarantees.

A/A test: Running an experiment with identical variants to verify that the testing tool's randomisation and tracking are working correctly.

Multi-armed bandit: A dynamic traffic allocation method that continuously shifts traffic toward better-performing variants. Best for short campaigns; worse for learning.

PIE Framework: A prioritisation framework scoring test candidates on Potential, Importance, and Ease.

Client-side testing: Variant delivery via JavaScript in the browser. Easy to implement; subject to flicker and ITP limitations.

Server-side testing: Variant delivery from the server before the page renders. No flicker; no ITP issues; requires engineering support.

Apple ITP (Intelligent Tracking Prevention): Apple's Safari browser feature that limits third-party cookie lifetime and blocks certain tracking mechanisms, affecting client-side testing tools.

FOOC / Flicker: Flash of Original Content — when a user briefly sees the control version before the variant is applied by a client-side testing tool.

CUPED (Controlled-experiment Using Pre-Experiment Data): A variance reduction technique that uses pre-experiment data to reduce noise and reach significance faster. Used by Statsig and other advanced tools.

Frequently Asked Questions

How long should an A/B test run?

Until it reaches the pre-calculated sample size. At a minimum, run every test for at least one full business week to account for day-of-week variation. Never stop because the variant appears to be winning — this is the peeking problem.

What is a good A/B test sample size?

Depends on your baseline conversion rate and the lift you want to detect. Use a sample size calculator. As a rough baseline: detecting a 20% relative lift from a 3% conversion rate requires approximately 6,500 visitors per variant (13,000 total).

Can I run multiple A/B tests at the same time?

Yes — if they test different pages or affect different user groups. Running two tests on the same page for the same users creates interference: you cannot isolate which change drove any observed effect.

What should I do with an inconclusive test?

An inconclusive result means you don't have sufficient evidence to prefer either version. Do not ship the variant (no evidence it is better). Do not revert dramatically (no evidence it is worse). Form a new hypothesis and test something different. Inconclusive tests are not wasted — they eliminate a hypothesis from your backlog.

What is the difference between A/B testing and split testing?

They mean the same thing. "Split testing" is more common in email marketing contexts. "A/B testing" is more common in product and web optimisation. Both describe the same methodology: random assignment of users to two versions with comparison of a pre-defined metric.

Do I need a dedicated tool, or can I do this manually?

You need a proper tool for any test on a website or product. Manual implementation — showing version A on odd hours and version B on even hours, for example — introduces systematic bias that invalidates results. For email, most email service providers have built-in A/B testing that is sufficient.

What conversion rate lift should I aim to detect?

A 10–20% relative lift is a realistic target for most changes. Testing for smaller lifts requires very large sample sizes; testing for larger lifts only detects dramatic changes and misses real but modest improvements. For context: a change that lifts conversion from 3.0% to 3.3% is a 10% relative lift and would require approximately 26,000 visitors per variant to detect.

A/B testing done correctly is a compounding asset. Each test either produces a winner that improves conversion permanently, or eliminates a hypothesis and sharpens your understanding of what your users actually respond to.

The teams that build the largest advantages are not the ones who run the most tests — they are the ones who run tests correctly, document everything, and use each result to form a sharper hypothesis for the next one. Start with sample size calculation. Never peek. Segment before shipping. Document every result.

The testing infrastructure you build today will be worth more than any individual test result.

Building something great?

List your startup on Startup Launch Page -- reach real investors, founders, and early adopters.

Launch your startup →

← Back to Blog