Machine Learning for Marketing Mix Without Fitting Noise

11 min read · 20 June 2025

What this covers

Machine Learning for Marketing Mix Without Fitting Noise
The 18-Month Trap and What It Hides
Why the Math Doesn't Work: The Regime Change Tax
The Clean-Window MMM Protocol
Phase 1: The Data Window Audit (Days 1-30)

Machine Learning for Marketing Mix Without Fitting Noise

A $12 million skincare brand I worked with last year built a marketing mix model on 18 months of spend data. The model came back with a confident recommendation: shift 28 percent of paid social spend to paid search and YouTube. The CFO signed off. The CMO ran the reallocation through Q2. By the end of the quarter, contribution margin had dropped 22 percent. Revenue held roughly flat, but the mix of customers acquired had shifted toward higher-cost, lower-LTV cohorts. The team blamed the new creative. They blamed the agency. They blamed the seasonal shift. The actual culprit was the MMM itself.

The 18 months of training data spanned three pixel resets, the iOS 14.5 attribution break, the GA4 migration, and two creative-strategy changes. The model had not modelled the brand's marketing physics. It had modelled the brand's instrumentation chaos and called the result a budget recommendation.

The 18-Month Trap and What It Hides

Recast MMM data needs documents the underlying problem with clinical precision. Their technical analysis shows that an MMM trained on under four weeks of data can return error rates indistinguishable from a longer run. The implication cuts the other way too: long historical windows do not automatically produce better models. They produce better models only when the underlying data-generating process has been stable across the window. Cross a pixel reset, an iOS update, a GA4 migration, or a major creative pivot, and the long window becomes a liability, not an asset. The model is fitting noise from regime changes that have nothing to do with media physics.

The skincare brand was not unusual. Most DTC operators under $50 million revenue who run MMMs feed them whatever data the team can pull together, on the assumption that more data is always better. It is not. The MMM literature is clear: clean stationary data beats long polluted data every time. A six-month clean window will out-predict an 18-month polluted window for budget allocation decisions. Operators rarely know this because the MMM vendor they use rarely volunteers it. The vendor wants the longest possible training window because that is what the marketing materials say is best practice. The science says otherwise.

The second thing the skincare case hides is the question MMM is not designed to answer. Operators routinely conflate marketing mix modelling with attribution. They are different tools answering different questions. Attribution asks: which channel deserves credit for this conversion? MMM asks: at the next dollar of spend, which channel produces the highest incremental revenue? The first is a backward-looking attribution question. The second is a forward-looking allocation question. Admetrics MMM DTC guide walks through the trade-off: MTA models like last-click are about reporting, MMM is about decision-making, and they can produce contradictory recommendations on the same dataset.

The 22 percent margin loss the skincare brand absorbed was not a freak event. It was the predictable outcome of running a poorly-conditioned MMM in an attribution role. The model said "shift to paid search and YouTube" because, in the polluted data, those channels showed the strongest residual signal once iOS-disrupted Meta data got noisy. That residual signal was an artefact of the regime change, not a real incrementality finding. A geo-holdout test would have caught it before the budget moved. There was no geo-holdout test. The team trusted the model output because it came from a fancy Bayesian regression and the consultancy presenting it had impressive references.

Sellforte MMM for ecommerce makes the same point from the operator side. DTC brands under $50 million typically lack the data depth and the experimentation budget to run MMM in isolation. They need the model paired with continuous geo-experiments to validate its outputs. Without that pairing, MMM is a confidence machine that produces plausibly-shaped channel-spend curves and very expensive mistakes.

Why the Math Doesn't Work: The Regime Change Tax

Run the math on what happened to the skincare brand. The 18-month window included approximately seven months of pre-iOS 14.5 data (where Meta pixel attribution was reliable), four months of post-iOS-14.5 transition (where pixel data was severely degraded but Meta had not yet rolled out Aggregated Event Measurement workarounds), and seven months of post-AEM data (where Meta attribution was partially recovered but on a different methodology). The model was trying to fit a single saturation curve across three different attribution regimes for the same channel.

The curve it produced was a weighted average that did not match any of the three regimes. The pre-iOS regime had higher Meta-attributed conversions for the same spend. The transition regime had lower attributed conversions because Meta was under-counting. The post-AEM regime had medium-attributed conversions on a different methodology. The model could not see the regime change because there was no flag in the input data telling it the methodology had shifted. It treated the under-counting as a saturation effect, concluded that Meta was less effective than it actually was, and recommended shifting spend away.

The reverse happened on paid search. The model saw consistent attribution across the window (Google Ads pixel was less affected by iOS 14.5) and treated paid search as the most reliable performer. In incrementality terms, paid search is mostly capturing demand the brand already had. Shifting more budget to it inflates the channel's reported ROAS without producing real incremental revenue. That is the trap. The model rewarded the channel that was best at taking credit, not the channel that was best at producing incremental customers.

The 22 percent margin loss came from acquiring customers through channels that were attribution-favoured but incrementality-poor. Each new customer cost more on a true-incremental basis even though the model said they were cheaper. By Q3 the team caught the bug, but the damage was done. Re-running the MMM on the previous 12 months of clean post-AEM data produced an entirely different recommendation, and the brand spent another quarter unwinding the reallocation.

The Clean-Window MMM Protocol

The fix is The Clean-Window MMM Protocol. It is a three-phase discipline that pairs a strict data-window standard with a continuous geo-holdout validation cadence. The protocol assumes that no MMM output is decision-grade until two conditions are met: the training window contains at least 24 weeks of data with no regime changes, and the model's recommendation has been validated against a live geo-experiment on the relevant channel. Without both, the MMM is producing confident-shaped curves that should not be trusted with real budget.

The 24-week floor comes from the practical math of MMM stability. With weekly granularity, you need enough variation in spend across channels to identify the saturation curves. Sub-24 weeks tends to produce models where the priors dominate the data. Above 24 weeks tends to be enough variation, provided no regime change has happened inside the window. If a regime change happens at week 18 of a 24-week window, you have a 6-week post-regime sample, which is too small. The window has to be both long enough and clean enough.

The geo-holdout layer is where most operators flinch. It feels expensive to deliberately cut spend in five test markets for six weeks. It is not. Haus incrementality testing makes the case that geo-holdouts cost a single-digit percentage of channel spend and produce causal estimates that no MMM can produce alone. The Haus methodology pairs MMM with quarterly geo-experiments, treating the experiments as the truth-source and the MMM as the interpolation engine between them. That pairing is what makes MMM decision-grade.

Haus is Meta incremental shows what the experiment output looks like in practice. Geo-tests on Meta produce specific incremental-lift numbers that can be compared directly against the MMM's predicted incrementality. If the two numbers agree within roughly 20 percent, the MMM is calibrated. If they diverge, the MMM has either a regime-change problem in the data or a saturation curve set with bad priors. Either way, the geo-test is the audit.

The Clean-Window MMM Protocol does not replace existing tooling. Meta Robyn project is the most-cited open-source MMM in DTC and works fine for the build layer. The protocol changes how you feed Robyn data and how you interpret what comes out. Same model, different discipline.

Phase 1: The Data Window Audit (Days 1-30)

Day 1 of The Clean-Window MMM Protocol is not a model build. It is a data audit. List every regime change in the brand's marketing measurement over the last three years. iOS 14.5 (April 2021). iOS 16 (September 2022). GA4 forced migration (July 2023). Pixel resets from platform changes. Major creative pivots. Major channel additions or removals. Major pricing changes. Each of these is a discontinuity in the data-generating process and each one cuts the usable training window.

Map the discontinuities on a timeline. The longest interval between two consecutive discontinuities is your maximum clean window. If the longest gap is 14 months, your maximum clean window is 14 months. If it is six months, your training window is six months and you need to consider whether MMM is the right tool at all (very short windows favour incrementality testing as the primary tool, with MMM as a supplement).

By Day 10, the team has mapped the windows and chosen the training window for the next MMM build. The non-negotiable rule: no data older than the most recent regime change. If iOS 14.5 was a discontinuity for your Meta data, do not include pre-iOS 14.5 Meta data in the training set. If GA4 forced migration was a discontinuity for site data, do not include pre-GA4 data. The discipline is to drop data, not to add adjustment factors. Every adjustment factor is a guess. Dropping data is honest.

By Day 20, the team has built the cleaned dataset. By Day 30, the dataset has been reviewed by a senior data person who can sense-check the spend and conversion numbers against finance ledgers. This last step catches data-pipeline bugs that would otherwise look like channel performance issues. Pipeline bugs are common, easy to miss, and devastating to MMM output quality. Catch them at the audit stage.

Phase 2: The MMM Build with Robyn (Month 2-3)

Phase 2 is the model build. Most $1M-$10M brands run Robyn rather than building from scratch. Recast on Robyn walks through the practical mechanics: ridge regression, adstock decay, saturation curves, and Bayesian priors. The technical detail matters less than two operational decisions: how to set the priors and how to handle holdout validation.

Priors are where most MMM builds go wrong. The default Robyn priors are non-informative, which means the model learns everything from the training data. With only 24 weeks of data, that is not enough. Set priors using your incrementality testing results from the previous quarter. If the last Meta geo-test showed a 15 percent incremental lift on a 25 percent spend cut, encode that as a prior on the Meta saturation curve. The model now starts from a known causal estimate and refines it with the observed data, rather than discovering the curve from scratch on a 24-week sample.

Holdout validation inside the model is non-negotiable. Reserve the last four to six weeks of the training data as a holdout, fit the model on the remainder, and check the holdout RMSE. If the holdout error is more than 1.3 times the training error, the model is overfitting and should not be trusted for decision-making. Adjust the regularisation parameter and rerun until the gap closes. Operators who skip this step ship overfit models and discover the problem only when the budget call goes bad.

By the end of Month 3, the team has a calibrated Robyn model with informed priors, a clean 24-week training window, and a validated holdout error. The next step is not to act on the recommendations. It is to test them.

Phase 3: The Quarterly Geo-Holdout Loop (Ongoing)

Phase 3 is the validation cadence. Every quarter, the team runs a geo-holdout test on the channel where the MMM has the strongest reallocation recommendation. If the model says "shift 15 percent more spend to YouTube", the geo-test cuts YouTube spend by 30 percent in five matched markets for six weeks and measures the conversion delta against control markets. The test answers a single question: is the MMM's marginal incrementality estimate for YouTube within 20 percent of the geo-test result?

If yes, the MMM is calibrated for that channel and the reallocation is safe to roll out. If no, the channel's curve in the model is wrong and needs to be re-priored from the test result. Either way, the test is the truth-source. The MMM is never trusted to be right on its own.

Quarterly cadence sounds slow, but it matches how often DTC budgets actually get reset. A new geo-test every 90 days, rotating across the channels with the largest pending reallocation recommendations, gives the brand a continuous calibration loop. Over a year, every major channel gets a geo-test refresh. Over two years, the prior-setting discipline produces an MMM that is genuinely useful for budget decisions, not just for board-deck slides.

The cost of the geo-test is the spend cut in the holdout markets. Five markets at 30 percent cut on a 10 percent of total spend channel for six weeks works out to roughly 0.3 percent of quarterly media budget. Compared with the 22 percent margin loss the skincare brand took on a bad reallocation, the test is rounding error. Operators who skip it are saving pennies and risking the rent.

From Confidence Theatre to Decision-Grade Allocation

The shift The Clean-Window MMM Protocol produces is not a fancier model. It is a different relationship between the team and the model output. Before, the MMM was a confidence machine that produced channel-shift recommendations the CFO signed off on without challenge. After, the MMM is a calibrated allocation tool whose recommendations are tested against geo-experiments before the budget moves. The model is treated as a hypothesis. The geo-test is the verdict.

The metric that signals success is the gap between MMM-predicted incrementality and geo-test-measured incrementality across all major channels. Track it quarterly. If the gap is above 30 percent on multiple channels, the model is mis-calibrated and Phase 1 has to rerun. If the gap is below 20 percent, the MMM is decision-grade and budget reallocations from it can be rolled out at scale. Aim for the 20 percent ceiling within four quarters of starting the protocol.

The brands that lose 20 percent of contribution margin on a bad MMM call are the brands that treat the model's confident-shaped curves as truth. The brands that pair the model with a quarterly geo-test loop are the ones whose allocation calls actually compound. The skincare brand I started with eventually rebuilt their MMM under this protocol. Two years later, they were running quarterly geo-tests on a 28-week clean window with informed priors. The CFO stopped getting blindsided by reallocations. The CMO stopped blaming the agency. The MMM became boring, which, in marketing science, is the highest compliment you can pay a model.

Free tool · put it to numbers

Unit Economics Calculator

Contribution margin per order after COGS, shipping and fees — the number scaling actually depends on.

Open calculator →

Practical FMCG & eCommerce growth playbooks — margins, retention and scaling tactics, straight to your inbox.

Put it to work

Turn ai optimization into profit you can see

Get a hands-on operator to turn the frameworks above into results — book a free audit call.

Book a free audit →Browse the full AI Optimization

Machine Learning for Marketing Mix Without Fitting Noise

Machine Learning for Marketing Mix Without Fitting Noise

The 18-Month Trap and What It Hides

Why the Math Doesn't Work: The Regime Change Tax

The Clean-Window MMM Protocol

Phase 1: The Data Window Audit (Days 1-30)

Phase 2: The MMM Build with Robyn (Month 2-3)

Phase 3: The Quarterly Geo-Holdout Loop (Ongoing)

From Confidence Theatre to Decision-Grade Allocation

Unit Economics Calculator

The Future of Marketing Attribution Is Triangulated

Marketing Mix Modeling for Ecommerce Brands Past $1M

Media Mix Optimization: The Marginal Return Equalizer

Budget Allocation Based on Attribution That Doesn't Starve Growth

Why AI Powered Ad Optimization Is Hiding A Cannibalisation Problem

How AI Powered Product Recommendations Quietly Erode Margin

Turn ai optimization into profit you can see