Dharmateja Priyadarshi Uddandarao, Senior Statistician – Data Scientist, Amazon
Rethinking Incrementality in Causal Experiments
People often say that incrementality experiments are the best way to measure cause and effect. Do a clean experiment, look at the treatment and control groups, and the difference must be the cause. In real life, many teams find something disturbing: the lift looks great, the dashboards look sure of themselves, but the results don’t happen again or when they are scaled up. The experiment itself is not usually the problem. It’s what happened before the experiment started.
Most causal workflows put in a lot of work to balance user traits like demographics, past engagement, device type, and geography, while quietly assuming that the outcome of interest will naturally fit. That idea is usually wrong.
Unbalanced outcomes, not unbalanced covariates, are what actually break incrementality.
A Quiet Source of Bias
Think about an online store that is testing a personalized recommendation widget. Users in the treatment group see a smarter ranking of products, while users in the control group see the normal layout. The experiment seems random on paper. There is a good balance between covariates like browsing history, session frequency, and device type. When we look at the baseline data, though, we see a pattern: people who got treatment were already spending more before the experiment started.
This can’t be fixed completely by modeling after the fact. When the baseline revenue is significantly different, the experiment is not just measuring pure incrementality; it is also measuring pre-existing demand.
This is where propensity pre-balancing on the target variable makes a difference.
Balancing the Outcome Before Measuring Its Change
This method brings the target variable into the balancing stage itself rather than treating the result as something to be examined only after treatment. The concept is straightforward but effective: make sure the pre-treatment outcome distributions of the treatment and control groups are similar before estimating lift.
This entails balancing not only on user attributes but also on recent revenue, frequency of purchases, or average order value prior to exposure in the recommendation widget experiment. This makes the counterfactual plausible. Now, the control group shows how the treated users would have appeared if they hadn’t seen the widget.
This change significantly stabilizes incrementality estimates, particularly for metrics like revenue or spend that have heavy-tailed distributions.
What Changes When Target Pre-Balancing Is Applied
The experiment starts to tell a different story once the baseline outcomes are in line with each other. Initial results that showed a revenue boost of almost double digits often turned into something smaller and more believable. The variance goes down. Intervals of confidence get smaller. Replication gets better.
But the most important change is how easy it is to understand. Finally, stakeholders can be sure that the measured effect is caused by the intervention and not by who got it.
Balance Diagnostics from a Realistic Scenario
| Metric | Before Pre-Balancing | After Pre-Balancing |
|---|---|---|
| Avg baseline revenue (USD) | T: 52.3 / C: 44.1 | T: 48.9 / C: 49.1 |
| Standardized mean diff (baseline Y) | 0.41 | 0.03 |
| Estimated incremental lift | +9.8% | +4.1% |
| Lift stability across re-runs | Low | High |
What initially looked like a massive business win turns out to be a smaller effect.
Visualizing the Shift
Before pre-balancing, treatment users are clustered toward the higher end of the revenue distribution. After pre-balancing, treatment and control overlap almost perfectly. This overlap is what makes causal claims defensible.

One Clean Implementation Example
Below is a single, consolidated code block showing how target pre-balancing fits naturally into a causal workflow. The example uses logistic propensity modeling with inverse-propensity weighting, but the same logic applies to matching or stratification.
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# df contains:
# T -> treatment indicator (0/1)
# Y_pre -> pre-treatment target variable (e.g., past revenue)
# Y_post -> post-treatment outcome
# X_* -> covariates
features = [c for c in df.columns if c.startswith("X_")] + ["Y_pre"]
X = df[features]
T = df["T"]
X_scaled = StandardScaler().fit_transform(X)
prop_model = LogisticRegression(max_iter=1000)
prop_model.fit(X_scaled, T)
propensity = prop_model.predict_proba(X_scaled)[:, 1]
df["ipw"] = np.where(T == 1, 1 / propensity, 1 / (1 - propensity))
# Incremental effect using weighted means
lift = (
np.average(df.loc[T == 1, "Y_post"], weights=df.loc[T == 1, "ipw"]) -
np.average(df.loc[T == 0, "Y_post"], weights=df.loc[T == 0, "ipw"])
)
print(f"Estimated incremental lift: {lift:.2f}")
What matters here is not the specific estimator but the inclusion of the pre-treatment target inside the propensity model. That single modeling choice is what changes the quality of the causal estimate.
Why This Works
From a causal perspective, target pre-balancing strengthens the ignorability assumption by explicitly conditioning on latent demand. It reduces regression-to-the-mean effects, improves overlap, and makes the counterfactual outcome more believable. From a business perspective, it prevents over-claiming impact and protects decision-makers from acting on inflated numbers. In short, it replaces optimistic lift with credible lift.
Closing Thoughts
Most failed experiments are quiet. They fail quietly by overstating impact, eroding trust, and producing results that don’t hold up outside the test window. Propensity pre-balancing on the target variable is a simple but underused technique that addresses this failure at its root.
If incrementality matters to your business, then balancing the outcome before measuring its change isn’t optional. It’s foundational.
About Author: Dharmateja Priyadarshi Uddandarao
Dharmateja Priyadarshi Uddandarao is a distinguished data scientist and statistician whose work bridges the gap between advanced Statistics and practical economic applications. He currently serves as a Senior Statistician at Amazon. He can be reached out through LinkedIn | ************@***il.com” target=”_blank” rel=”noreferrer noopener”>Email