Attribution data lies. It's not malicious; it's structurally biased. Last-click attribution gives all credit to the last touch. Multi-touch attribution distributes credit, but the credit allocation rests on assumptions that are often wrong. Both leave you with the same problem: you can't tell whether a channel is actually driving incremental revenue or just claiming credit for revenue you'd have earned anyway.
Incrementality testing fixes this by running a controlled experiment. Geo-holdout tests are the most accessible way to do it for Shopify brands without a dedicated data team. This guide walks through the methodology we use for client accounts.
What incrementality actually measures
A simple thought experiment: if you turned off your Meta retargeting campaign tomorrow, how much revenue would you actually lose?
Attribution data says you'd lose all the revenue Meta retargeting is currently credited with. Reality usually says you'd lose 20-40% of it — the rest were customers who would have bought through email, organic, or direct anyway.
The gap between "attributed revenue" and "incremental revenue" is what incrementality testing measures. The gap is often huge.
Why geo-holdouts work
A geo-holdout test pauses ads in some geographic regions while running them in matched regions. By comparing revenue in test versus holdout regions over a fixed period, you can isolate the channel's actual incremental impact — without needing complex statistical models.
The geo-test method works because:
- Geography is independent of marketing decisions
- Matched markets have similar baseline conversion behavior
- The intervention (ads on/off by geo) is clean and easy to execute
- The results are interpretable without statistical training
The methodology isn't perfect, but for the cost ($10-20K of test spend) and complexity (low), it gives Shopify operators a useful read on channel impact.
Setting up a geo-holdout test
Step 1: Pick the channel to test
Test one channel at a time. Pick the channel whose incremental impact you most doubt. Common starting points: Meta retargeting, branded search, or a recently-added channel like AppLovin or Pinterest.
Step 2: Match your markets
Identify pairs of US states (or DMAs) with similar:
- Conversion rates from your store data
- Demographic profiles
- Baseline daily order volume
- Your existing customer mix
Common matched pairs we've used:
- Texas / Florida (large, similar buying patterns)
- Ohio / Pennsylvania (similar Midwest demos)
- Washington / Oregon (similar West Coast demos)
- North Carolina / Georgia (similar Southeast demos)
For smaller test budgets, you might match smaller states or DMAs (designated market areas — Nielsen-defined regions).
Step 3: Decide test design
Two common designs:
Holdout design. Run ads as normal in one set of states, completely turn them off in matched states. Cleanest signal. Requires real revenue at risk.
Spend-difference design. Reduce spend by 50% or 75% in one set, hold normal in others. Less revenue at risk, but harder to interpret.
For first-time testers, we recommend the full holdout design even though it's more aggressive. The signal is clearer and the test takes less time.
Step 4: Set the test duration
Three to four weeks is the minimum for a meaningful read. Too short and you'll see noise. Longer than 4 weeks and you start running into attribution decay and seasonal shifts that muddy the data.
Step 5: Configure the campaign restrictions
In your ad platform, restrict the channel campaigns to only run in your test geos. Pull all spend from holdout geos. Make sure organic content (email, SMS, social posts) continues running everywhere — you only want the paid channel turned off.
Step 6: Track outcomes
You're measuring total revenue (not attributed revenue) by geography. In Shopify, segment orders by shipping state during the test period. Compare:
- Total revenue in test geos during test period
- Total revenue in holdout geos during test period
- Both compared to a baseline period (4 weeks before the test)
Calculating the lift
The simple formula:
Lift = (Revenue in test geos / Baseline test geos) - (Revenue in holdout geos / Baseline holdout geos)
If your test geos saw revenue grow by 10% while your holdout geos grew by 2%, the channel's incremental lift is roughly 8%.
Apply that 8% to your total revenue for the test period to get the dollar value of the channel's incremental contribution. Compare to the spend on that channel during the test.
If the channel was credited with $50K of attributed revenue but only $20K of incremental revenue, you've learned something important: the attribution is significantly overstating the channel's true value.
Common test design mistakes
Forgetting non-channel marketing. If your email program is sending different content to different states (rare, but possible), it'll contaminate the test.
Running during a sale or promotion. Either your test or your sale will get muddled. Schedule tests during steady-state periods.
Picking poorly matched geos. California and Mississippi aren't matched markets even if your audience size is similar. Make sure baseline conversion rates and demographics line up.
Cutting too short. Two weeks of test data is rarely enough. Hold to 3-4 weeks.
Testing multiple channels at once. You won't be able to attribute the lift difference to any specific channel. Test one at a time.
Not preserving the original campaign settings. When you turn the test campaigns back on after the holdout period, the algorithm has lost some learning. Plan for a 3-7 day re-stabilization.
What to do with the results
Three common scenarios:
Scenario A: Lift exceeds attributed revenue. The channel is even more valuable than your dashboard says. Increase spend.
Scenario B: Lift roughly matches attributed revenue. Your attribution is reasonably accurate for this channel. Continue current spend.
Scenario C: Lift is significantly below attributed revenue. The channel is over-claiming credit. Consider:
- Reducing spend on this channel and reallocating
- Restructuring the channel (different audience focus, different campaign types)
- Continuing it but discounting the attributed ROAS in your decision-making
Don't immediately kill a channel that fails an incrementality test. Sometimes the test reveals that the channel is good at retention or LTV but not new customer acquisition. That's still a valuable function — just not what you thought it was doing.
A real example
A pet supplement client we worked with had Meta retargeting credited with $80K/month of attributed revenue at 5.2x ROAS. We ran a geo-holdout test pausing Meta retargeting in 5 matched states for 3 weeks.
Result: revenue in holdout states dropped 3.5%. Test states held flat. Applied to total revenue, the incremental contribution of Meta retargeting was roughly $28K/month, not $80K.
Outcome: we reduced retargeting spend by 50%, reallocated to creative testing and new prospecting audiences. Six weeks later, blended ROAS was up 18% with similar total spend. The retargeting hadn't been worthless — but it was eating budget that worked harder elsewhere.
How often to run tests
Quarterly cadence is the minimum we recommend for accounts at $50K+/month spend. Test rotates through channels:
- Q1: Meta retargeting incrementality
- Q2: Branded search incrementality
- Q3: A new channel incrementality (Pinterest, TikTok, etc.)
- Q4: Top-of-funnel awareness incrementality
After a year, you have a much clearer picture of which channels actually move revenue versus which are decorating your dashboards.
What to do this week
Pick the channel you're most uncertain about. Plan a 3-week geo-holdout test for next month. Identify your matched markets, calculate the test spend exposure, and put it on the calendar.
For more on measurement, see our MMM vs MTA vs GA4 attribution post, first-party data strategy, and why ROAS down but revenue up explained.