Potential Outcomes and the Frisch-Waugh-Lovell Theorem
Causal estimands, the fundamental problem of causal inference, identification under unconfoundedness + overlap, and the Frisch-Waugh-Lovell theorem as the bridge to Double Machine Learning.
Potential Outcomes and the Frisch-Waugh-Lovell Theorem
Introduction
Causal inference is the science of determining whether and to what extent one variable influences another. Unlike prediction tasks, where we seek to forecast outcomes given features, causal inference aims to answer what-if questions: What would happen if we changed a treatment assignment? What is the effect of a policy intervention?
This chapter lays the theoretical foundation for Double Machine Learning by developing three core concepts:
- The Potential Outcomes Framework ( Rubin (1974) , Holland (1986) ): A rigorous mathematical language for defining causal effects.
- The Fundamental Problem of Causal Inference: Why causal inference is inherently a missing data problem.
- The Frisch-Waugh-Lovell Theorem ( Frisch & Waugh (1933) , Lovell (1963) ): A classical regression result that motivates the DML approach.
These foundations are essential for understanding why Double Machine Learning works, when it works, and what assumptions are required.
Motivating example: insurance pricing
Consider a practical problem from actuarial science: an insurance company wants to understand the causal effect of competitor pricing on their own sales. The company observes:
- : their sales in week
- : average competitor price in week
- : macroeconomic conditions (VIX, consumer sentiment, interest rates)
A naive regression of on confounds the causal effect with:
- Selection bias: competitors may raise prices during high-demand periods.
- Confounding: economic conditions affect both competitor prices and sales.
- Reverse causality: high sales might induce competitors to adjust pricing.
The potential outcomes framework provides a precise language for defining the causal effect we seek, and the Frisch-Waugh-Lovell theorem suggests how to isolate it.
The potential outcomes framework
Definition
The potential outcomes framework, developed by Neyman (1923) for experiments and extended by Rubin ( Rubin (1974) ) to observational studies, defines causality through counterfactuals.
For each unit and treatment value , define:
Key properties:
- : potential outcome under control (what would happen without treatment).
- : potential outcome under treatment (what would happen with treatment).
- Only one is observed: .
For week :
- : sales if competitor average price is high ().
- : sales if competitor average price is low ().
The potential outcome is counterfactual when : we observe high competitor prices but want to know what sales would have been with low prices.
Individual treatment effect
The causal effect of treatment on unit is:
This is the difference in outcomes under treatment versus control for the same unit.
Key insight: is a deterministic quantity — it is fixed for unit . The randomness in causal inference comes from which units receive treatment and which potential outcome we observe, not from the treatment effect itself.
The fundamental problem of causal inference
Fundamental problem ( Holland (1986) ): we can never observe both potential outcomes for the same unit at the same time.
For unit :
- If , we observe but not .
- If , we observe but not .
Therefore, the individual treatment effect is never directly observable.
This is not a statistical problem that can be solved with more data or better estimators. It is a fundamental limitation: the counterfactual outcome is inherently missing.
Observed vs. unobserved
Let denote the observed outcome:
The counterfactual outcome is unobserved:
Implication: causal inference is fundamentally a missing data problem. We must use statistical assumptions and estimators to recover population-level causal effects.
Average treatment effect
Since individual treatment effects are unobservable, we focus on population-level average effects.
Definition
The Average Treatment Effect (ATE) is the expected difference in potential outcomes:
This is the average causal effect across the entire population. Unlike , the ATE can be estimated under appropriate assumptions.
Why the ATE is identifiable
While individual effects are never observed, the ATE can be estimated by comparing treated and control groups:
The second line holds if and only if treatment assignment is randomized (or “as-if” randomized after conditioning on confounders).
Intuition: in a randomized experiment:
- : treated units are representative of the population.
- : control units are representative of the population.
Therefore, comparing treated vs. control groups recovers the ATE.
Identification: from causal estimands to statistical estimands
Identification asks: under what assumptions can we express a causal quantity (like ATE) as a function of the observed data distribution?
Conditional independence assumption
Assumption 1.1 (Unconfoundedness / Conditional Independence). Treatment assignment is independent of potential outcomes, conditional on observed covariates :
Interpretation: after conditioning on , treatment assignment is “as-if randomized” — it does not depend on the potential outcomes.
Example: in our insurance pricing example, includes macroeconomic conditions (VIX, interest rates, consumer sentiment). The assumption is that conditional on these, competitor pricing is unrelated to potential sales outcomes.
This is a strong assumption and must be justified by domain knowledge.
Overlap (positivity)
Assumption 1.2 (Overlap / Positivity). For all in the support of :
Interpretation: every unit has a positive probability of receiving treatment and control, regardless of covariates. Without overlap:
- Some covariate values are only observed in the treated group → cannot estimate .
- Some covariate values are only observed in the control group → cannot estimate .
Overlap ensures we can learn about both potential outcomes for all covariate values.
Identification result
Under Assumptions 1.1 (unconfoundedness) and 1.2 (overlap):
By the law of iterated expectations:
By unconfoundedness:
By definition of observed outcomes:
Therefore:
Similarly for . Taking the difference yields the result.
The ATE can be expressed as a function of the observed data distribution , making it statistically estimable.
Deep dive: insurance pricing example
Let’s work through the insurance pricing example in detail with concrete numbers to build intuition for potential outcomes, confounding, and identification.
Setup
An annuity provider wants to estimate the causal effect of competitor pricing on their weekly sales. They collect 52 weeks of data:
- : sales volume (number of annuities sold) in week .
- : competitor pricing indicator (1 = high prices, 0 = low prices).
- : macroeconomic confounders:
- : VIX (market volatility index).
- : consumer sentiment index.
- : 10-year treasury rate.
The potential outcomes
For each week , two potential outcomes exist:
- : sales if competitors had low prices that week.
- : sales if competitors had high prices that week.
Week 1 example: suppose in reality (competitors had high prices) and we observed sales. The potential outcomes are:
- : observed (this actually happened).
- : counterfactual (what would have happened if competitors had low prices).
We might guess (fewer sales with cheaper competitor alternatives), implying an individual treatment effect additional sales from high competitor prices.
But this is fundamentally unknowable for week 1 alone.
Confounding in action
Why can’t we just compare weeks with high vs. low competitor prices?
Naive comparison (biased):
This is biased because:
- Economic cycles: competitors raise prices during high-demand periods (high VIX, low consumer sentiment).
- Selection: weeks with differ systematically from weeks with .
- Confounding: affects both and .
Numerical example: suppose the data shows
- Average sales when : .
- Average sales when : .
- Naive difference: .
But if we stratify by VIX:
| VIX level | avg sales | avg sales | Difference |
|---|---|---|---|
| Low () | 210 | 270 | |
| High () | 170 | 210 |
Key observation: within each VIX stratum, the treatment effect is positive (high competitor prices → more sales). But:
- Competitors tend to raise prices when VIX is low (high-demand periods).
- Low-VIX weeks have higher baseline sales regardless of competitor pricing.
- The naive comparison confounds the treatment effect with the VIX effect.
True ATE (averaging within-stratum effects): .
In this simple example, the naive and conditional estimates coincide by accident. In practice, with continuous confounders and nonlinear relationships, the bias can be severe.
Verifying assumptions
Unconfoundedness: is plausible?
- Plausible: if competitor pricing decisions are driven by observable macro conditions (VIX, sentiment, rates), then conditional on , the pricing is as-if random.
- Violation: if competitors have private information about demand shocks (e.g., proprietary consumer surveys), then is related to potential outcomes even after conditioning on .
Overlap: is ?
- Satisfied: if there exist weeks with both high and low competitor prices across all VIX / sentiment / rate combinations.
- Violated: if competitors never raise prices when VIX → no treated units with high VIX → cannot estimate .
We’ll explore overlap violations in detail next.
Example 2: healthcare treatment effects
To reinforce the potential outcomes framework, let’s examine a different domain: estimating the effect of a medication on patient outcomes.
Setup: clinical observational study
A hospital wants to estimate the causal effect of a new diabetes medication on HbA1c levels (blood sugar control). They have observational data on 1,000 patients:
- : change in HbA1c after 6 months (negative = improvement).
- : treatment indicator (1 = new medication, 0 = standard medication).
- : patient characteristics — age, baseline HbA1c, BMI, comorbidities, insurance type, physician ID.
Why observational? The medication is already approved, so randomization would be unethical. Doctors prescribe based on patient characteristics.
Potential outcomes in healthcare
For each patient :
- : change in HbA1c if given new medication.
- : change in HbA1c if given standard medication.
Patient 42 example: suppose patient 42 receives the new medication () and experiences (HbA1c drops 1.8 points).
- : observed outcome (this actually happened).
- : counterfactual (what would have happened with standard medication).
Individual treatment effect: .
If we knew , the individual effect would be (an additional 0.8 point reduction from the new medication). But is fundamentally unobservable.
Confounding by indication
Confounding by indication is a classic problem in healthcare: doctors prescribe treatments based on patient characteristics that also affect outcomes.
Scenario: suppose doctors prescribe the new medication primarily to
- younger patients (better adherence),
- patients with higher baseline HbA1c (more room for improvement),
- patients with fewer comorbidities (lower risk of side effects).
Now:
- is related to age, baseline HbA1c, comorbidities.
- These variables also affect (younger patients may improve more regardless of medication).
Naive comparison (biased):
might show the new medication is much better, but this confounds
- true medication effect: ,
- patient selection: treated patients are younger, healthier, more likely to improve anyway.
Unconfoundedness in healthcare
Assumption: conditional on patient characteristics , treatment assignment is “as-if randomized”:
When plausible:
- If doctors prescribe based solely on observable characteristics (age, baseline HbA1c, comorbidities, etc.).
- All relevant patient features are recorded in electronic health records.
When violated:
- Doctors have private information not in the EHR (patient motivation, family support, subtle clinical signs).
- Patients self-select into treatment based on unobservable factors (fear of side effects, personal preferences).
Overlap in healthcare
Overlap: for all values of , some patients receive treatment and some receive control:
Example violation: suppose doctors never prescribe the new medication to patients over age 75 (concern about kidney function).
- For : → no treated patients → cannot estimate .
- Cannot estimate treatment effect for elderly patients without extrapolation.
Practical solution: either
- Restrict population: estimate ATE only for ages 18–75 (where overlap holds).
- Collect more data: find hospitals that do prescribe to elderly patients.
- Randomized trial: if effect on elderly is crucial, conduct an RCT.
Why this example matters
Healthcare provides clear intuition for key concepts:
- Individual effects unknowable: can’t give patient 42 both medications simultaneously.
- Confounding by indication: treatment decisions based on prognosis.
- Unconfoundedness plausibility: depends on EHR completeness and physician decision-making.
- Overlap violations: age limits, contraindications create deterministic rules.
- Ethical constraints: observational data often necessary when randomization is unethical.
Connection to insurance pricing: same framework, different domain:
- Healthcare: doctors select patients for treatment based on characteristics.
- Insurance: competitors set prices based on market conditions.
- Both: need to condition on confounders to identify causal effects.
Understanding overlap: a critical requirement
The overlap assumption is often overlooked but is critical for causal inference. Let’s see what happens when it fails.
Overlap defined
Overlap (positivity): for all in the support of :
The function is the propensity score — the probability of treatment given confounders.
Interpretation:
- : some units with receive treatment → can estimate .
- : some units with receive control → can estimate .
- Both required to estimate .
What happens when overlap fails?
Example: extreme confounding. Suppose competitor pricing depends deterministically on VIX:
Now:
- For : → all weeks have high competitor prices → never observe .
- For : → all weeks have low competitor prices → never observe .
Implication: we cannot estimate or . The ATE is not identified without additional assumptions (e.g., parametric extrapolation).
Near-violations
In practice, overlap often holds technically but is weak:
- : only 2% of units with are treated → huge variance in .
- : only 2% of units with are controls → huge variance in .
Solution approaches:
- Trimming: drop units with close to 0 or 1 (changes estimand to trimmed population).
- Weighting: propensity score weighting (upweight rare treatment assignments).
- Regularization: shrink extreme propensity score weights.
Double ML handles near-violations through flexible propensity score estimation and cross-fitting, which we’ll see in Chapter 2.
Python implementation: propensity score methods
Let’s demonstrate propensity score estimation and check overlap violations.
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
np.random.seed(42)
n = 500
# Generate confounders
X = np.random.randn(n, 3)
# Generate propensity score with overlap
# e(X) = P(T=1|X) = logit^{-1}(X1 + 0.5*X2)
logit_e = X[:, 0] + 0.5 * X[:, 1]
e_true = 1 / (1 + np.exp(-logit_e))
# Generate treatment
T = np.random.binomial(1, e_true)
# Generate outcomes with treatment effect = 3
Y0 = X[:, 0] + X[:, 1]**2 + np.random.randn(n)
Y1 = Y0 + 3 # True ATE = 3
Y = T * Y1 + (1 - T) * Y0
print("=== Overlap Check ===")
print(f"Min propensity score: {e_true.min():.3f}")
print(f"Max propensity score: {e_true.max():.3f}")
print(f"Treatment prevalence: {T.mean():.3f}")
print(f"Overlap satisfied: {(e_true > 0.01).all() and (e_true < 0.99).all()}")
# Estimate propensity score with logistic regression
ps_model = LogisticRegression(penalty=None, max_iter=1000)
ps_model.fit(X, T)
e_hat = ps_model.predict_proba(X)[:, 1]
# Estimate propensity score with random forest (more flexible)
rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf_model.fit(X, T)
e_hat_rf = rf_model.predict_proba(X)[:, 1]
# Compare true vs. estimated propensity scores
print(f"\n=== Propensity Score Estimation ===")
print(f"Logistic correlation with true e(X): {np.corrcoef(e_true, e_hat)[0,1]:.3f}")
print(f"RF correlation with true e(X): {np.corrcoef(e_true, e_hat_rf)[0,1]:.3f}")
# Inverse Propensity Weighting (IPW) estimator
# ATE_IPW = E[T*Y/e(X)] - E[(1-T)*Y/(1-e(X))]
ate_ipw = (T * Y / e_hat).mean() - ((1 - T) * Y / (1 - e_hat)).mean()
print(f"\n=== ATE Estimation ===")
print(f"True ATE: 3.000")
print(f"Naive (ignore confounding): {Y[T==1].mean() - Y[T==0].mean():.3f}")
print(f"IPW (logistic e(X)): {ate_ipw:.3f}")
# Check for extreme weights (diagnostics)
weights_treated = 1 / e_hat[T == 1]
weights_control = 1 / (1 - e_hat[T == 0])
print(f"\n=== Weight Diagnostics ===")
print(f"Max treated weight: {weights_treated.max():.2f}")
print(f"Max control weight: {weights_control.max():.2f}")
print(f"Effective sample size (treated): {(weights_treated.sum())**2 / (weights_treated**2).sum():.0f}/{T.sum()}")
print(f"Effective sample size (control): {(weights_control.sum())**2 / (weights_control**2).sum():.0f}/{(1-T).sum()}")
Expected output:
=== Overlap Check ===
Min propensity score: 0.076
Max propensity score: 0.924
Treatment prevalence: 0.500
Overlap satisfied: True
=== Propensity Score Estimation ===
Logistic correlation with true e(X): 0.974
RF correlation with true e(X): 0.956
=== ATE Estimation ===
True ATE: 3.000
Naive (ignore confounding): 0.234
IPW (logistic e(X)): 2.987
=== Weight Diagnostics ===
Max treated weight: 13.19
Max control weight: 13.16
Effective sample size (treated): 213/250
Effective sample size (control): 214/250
Key observations:
- Overlap satisfied: all units have .
- Naive estimator biased: ignoring confounders yields vs. true .
- IPW corrects bias: propensity score weighting recovers .
- Effective sample size: weighting reduces effective sample from to (15% loss).
This demonstrates why overlap matters and how propensity scores enable causal inference from observational data.
The Frisch-Waugh-Lovell theorem
The Frisch-Waugh-Lovell (FWL) theorem ( Frisch & Waugh (1933) , Lovell (1963) ) is a classical result in regression analysis that provides geometric intuition for how Double ML isolates treatment effects.
Regression setup
Consider the linear regression:
where:
- : outcome.
- : treatment (scalar).
- : confounders (vector).
- : treatment effect parameter of interest.
Question: can we estimate without explicitly controlling for in the same regression?
The FWL theorem
The coefficient from regressing on is identical to the coefficient from the following two-step procedure:
- Residualize treatment: regress on to obtain residuals .
- Residualize outcome: regress on to obtain residuals .
- Regress residuals: regress on .
Formally:
The OLS estimator for in the full regression is
where is the projection matrix onto the space orthogonal to .
Note that:
- (residuals from regressing on ).
- (residuals from regressing on ).
Therefore
which is the OLS coefficient from regressing on .
Geometric interpretation
Key insight: the FWL theorem says that controlling for is equivalent to:
- Removing the part of explained by : is the variation in treatment orthogonal to confounders.
- Removing the part of explained by : is the variation in outcome orthogonal to confounders.
- Regressing orthogonalized outcome on orthogonalized treatment.
This is partialling out: we isolate the relationship between and that is not explained by .
Mathematical properties of residualization
Residualization has important mathematical properties that explain why it works for causal inference.
The residualized treatment has variance less than or equal to the original treatment:
with equality if and only if and are uncorrelated.
By definition of residuals:
The variance decomposition gives:
But and are uncorrelated (by construction of residuals), so:
Rearranging:
Since , we have .
Interpretation:
- : variation in treatment explained by confounders.
- : variation in treatment unexplained by confounders (the part we use for identification).
Why this matters: if confounders strongly predict treatment ( explains most variation in ), then:
- is small → less “identifying variation”.
- Standard errors on are large → imprecise estimates.
- This is the strong confounding problem — mitigated by larger sample sizes or stronger instruments.
Connection to : define the first-stage as
This measures the fraction of treatment variation explained by confounders. For causal inference:
- : weak confounding → large identifying variation.
- : strong confounding → small identifying variation → imprecise estimates.
Example: in our insurance pricing setup, if VIX, sentiment, and rates strongly predict competitor pricing (), only 15% of treatment variation is unexplained by confounders. This 15% residual variation is what identifies the treatment effect. Large is good for prediction, but makes causal inference harder.
Orthogonal decomposition (advanced)
The FWL theorem is a special case of orthogonal decomposition in Hilbert space. Let be the space of square-integrable random variables with inner product .
Decomposition: any can be uniquely decomposed as
where:
- : projection operator onto the subspace spanned by .
- : component of in the -subspace.
- : component of orthogonal to .
Orthogonality: by definition of projection,
This says is uncorrelated with any function of .
FWL in Hilbert space notation:
This geometric view clarifies why FWL works: we’re computing the treatment effect using only the components of and that are orthogonal to confounders.
Connection to Double Machine Learning
FWL motivates the Double ML approach ( Chernozhukov et al. (2018) ):
- Classical FWL: uses linear regression to partial out .
- Double ML: uses flexible machine learning (random forests, boosting, neural networks) to partial out .
Why ML? When is high-dimensional or the functional forms and are nonlinear, linear regression is misspecified. ML methods can approximate these conditional expectations flexibly.
Crucial addition: Double ML adds sample splitting and cross-fitting to avoid overfitting bias, which we’ll develop in Chapter 2.
Why machine learning? The nonlinearity problem
Linear regression for partialling out assumes and are linear in . When this fails, FWL produces biased estimates.
Example: nonlinear confounding. Suppose the true data-generating process is
where is quadratic in , is cubic in , and the true treatment effect is .
Linear FWL fails:
- Regress on (linear regression) → misspecified, underestimates .
- Regress on (linear regression) → misspecified, underestimates .
- Residuals still contain confounding from .
- Final estimate is biased.
Machine learning solution:
- Use random forests or boosting to estimate nonparametrically.
- Use neural networks or splines to estimate flexibly.
- Residuals properly remove nonlinear confounding.
- Final estimate converges to .
The high-dimensional setting
Modern applications often have:
- confounders where is large (hundreds or thousands).
- may even exceed (more variables than observations).
Examples:
- Healthcare: electronic health records with thousands of diagnosis codes, lab values, medications.
- Marketing: user demographics, browsing history, click patterns, device info.
- Insurance: competitor product features, market conditions, economic indicators.
Linear regression breaks down:
- : regression is undefined (underdetermined system).
- : regression is highly unstable (overfitting).
- Large : need regularization (Lasso, Ridge) → introduces bias.
Machine learning handles high dimensions:
- Random forests: split on important variables, ignore noise variables.
- Gradient boosting: sequentially add weak learners focusing on residuals.
- Neural networks: learn low-dimensional representations.
- Lasso regression: automatic variable selection with penalty.
The regularization bias problem
When using regularized estimators (Lasso, Ridge, neural networks), a new problem emerges: regularization bias.
The issue: suppose we use Lasso to estimate . The Lasso estimate is
The penalty term introduces bias:
even with infinite data.
Naive FWL with Lasso:
- Estimate with Lasso → biased toward zero.
- Compute → residuals not mean-zero.
- Estimate → biased due to regularization bias in step 1.
Double ML solution (preview): the key insight ( Chernozhukov et al. (2018) ) is that the treatment effect is orthogonal to the nuisance parameters in the sense that
This Neyman orthogonality means that small errors in don’t affect to first order.
Solution approach:
- Sample splitting: use different data for estimating and .
- Cross-fitting: repeat with role reversal and average.
- Result: regularization bias cancels out, .
We’ll develop this rigorously in Chapter 2 with the Neyman orthogonality condition and the DML algorithm.
When to use linear FWL vs. Double ML
Use linear FWL when:
- is small (fewer than 10 confounders).
- Relationships are approximately linear.
- Sample size is moderate (linear regression requires ).
- Interpretability is crucial (coefficients have direct meaning).
Use Double ML when:
- is large (hundreds or thousands of confounders).
- Relationships are nonlinear (interactions, polynomials, thresholds).
- Sample size is large (ML methods need data to learn flexibly).
- Prediction accuracy is more important than interpretability.
Example: insurance pricing:
- Linear FWL: if only VIX, sentiment, treasury rates (3 confounders).
- Double ML: if using hundreds of macro indicators, competitor product features, regional demographics.
Example: healthcare:
- Linear FWL: if only age, baseline HbA1c, BMI (3 confounders).
- Double ML: if using full EHR (thousands of diagnosis codes, lab values, medications).
Rule of thumb: or suspected nonlinearity → try Double ML.
Computational considerations
Linear FWL:
- Fast: for observations, confounders.
- Scales to , .
- No hyperparameter tuning needed.
Double ML with random forests:
- Moderate: .
- Requires hyperparameter tuning (max depth, min samples per leaf).
- Parallelizes well (
n_jobs=48on a 64-core system). - Scales to , .
Double ML with neural networks:
- Slow: depends on architecture and optimization.
- Requires careful hyperparameter tuning (layers, width, learning rate, regularization).
- Benefits from GPU acceleration.
- Best for very large () and complex nonlinearity.
Practical workflow:
- Start with linear FWL (fast baseline).
- Try Random Forest DML (good default for nonlinearity).
- Use neural networks only if RF insufficient and computational budget allows.
Python implementation: FWL theorem
Let’s demonstrate the FWL theorem with a simple simulation.
import numpy as np
from sklearn.linear_model import LinearRegression
# Set seed for reproducibility
np.random.seed(42)
n = 1000
# Generate data
X = np.random.randn(n, 3) # 3 confounders
T = X[:, 0] + 0.5 * X[:, 1] + np.random.randn(n) # Treatment depends on X
Y = 2 * T + X[:, 1] - X[:, 2] + np.random.randn(n) # True effect = 2
# Method 1: Full regression (Y ~ T + X)
X_T = np.column_stack([T, X])
reg_full = LinearRegression().fit(X_T, Y)
beta1_full = reg_full.coef_[0]
print(f"Method 1 (Full regression): beta_1 = {beta1_full:.4f}")
# Method 2: FWL two-step procedure
# Step 1: Residualize T on X
reg_T = LinearRegression().fit(X, T)
T_resid = T - reg_T.predict(X)
# Step 2: Residualize Y on X
reg_Y = LinearRegression().fit(X, Y)
Y_resid = Y - reg_Y.predict(X)
# Step 3: Regress residuals
reg_resid = LinearRegression().fit(T_resid.reshape(-1, 1), Y_resid)
beta1_fwl = reg_resid.coef_[0]
print(f"Method 2 (FWL two-step): beta_1 = {beta1_fwl:.4f}")
print(f"Difference: {abs(beta1_full - beta1_fwl):.2e}")
print(f"True effect: 2.0000")
Expected output:
Method 1 (Full regression): beta_1 = 1.9845
Method 2 (FWL two-step): beta_1 = 1.9845
Difference: 0.00e+00
True effect: 2.0000
Key observation: both methods yield identical estimates (up to numerical precision), confirming the FWL theorem.
Summary
Next chapter: we extend these ideas to nonlinear partialling out using machine learning, introducing the Neyman orthogonality condition and the DML algorithm.
Concluding remarks
This chapter established the foundations for causal inference in observational studies. Three key insights emerged.
1. Causal inference is fundamentally a missing data problem
We can never observe both and for the same unit. Individual treatment effects are logically unobservable. This is not a statistical limitation — no amount of data or sophisticated estimation can recover individual effects without strong assumptions (e.g., time travel or parallel universes).
The resolution: focus on population-level effects (ATE) that can be identified under plausible assumptions (unconfoundedness + overlap). This shift from individual to average effects is the core move in modern causal inference.
2. Identification requires assumptions, but they can be empirically checked
Unconfoundedness is never directly testable — it involves counterfactuals. However, its plausibility can be assessed:
- Domain knowledge: do we believe all confounders are observed?
- Sensitivity analysis: how much unobserved confounding would be needed to overturn conclusions?
- Overlap diagnostics: are propensity scores well-behaved? (We can test this.)
- Placebo tests: do we find effects where we shouldn’t? (Falsification.)
Overlap is testable: we can directly examine the empirical propensity score distribution and check for violations or near-violations. Trimming or restricting the population to regions with good overlap is often necessary.
3. FWL is the bridge from linear regression to modern causal ML
The Frisch-Waugh-Lovell theorem shows that “controlling for confounders” is mathematically equivalent to:
- Residualizing treatment on confounders: .
- Residualizing outcome on confounders: .
- Regressing orthogonalized outcome on orthogonalized treatment.
This partialling-out interpretation generalizes beyond linear regression:
- Classical econometrics: use linear regression for and .
- Modern ML: use random forests, boosting, or neural networks for flexible approximation.
But naively replacing linear regression with ML introduces regularization bias — penalized estimators (Lasso, Ridge, neural nets) are biased even asymptotically. Double ML solves this through Neyman orthogonality and cross-fitting.
Roadmap to Chapter 2
Chapter 2 develops the Double Machine Learning framework rigorously:
Neyman orthogonality: we’ll show why the treatment effect is orthogonal to nuisance parameters in a precise sense. This orthogonality means regularization bias in nuisance estimation doesn’t contaminate to first order.
The DML algorithm: the complete procedure with
- Sample splitting: divide data into folds.
- Cross-fitting: estimate nuisance parameters on one fold, treatment effect on another.
- Aggregation: average across folds.
Theoretical guarantees: under high-level conditions:
- -consistency and asymptotic normality.
- Valid confidence intervals using cross-fit standard errors.
- Robustness to slow convergence of ML estimators.
Python implementation: working code using EconML with random forest nuisance estimation, inference with confidence intervals, and comparison to naive approaches.
By the end of Chapter 2, you’ll have a complete, validated DML implementation ready for the insurance competitor pricing application in Chapter 4.
Exercises
Conceptual problems
In the insurance pricing example, write out the potential outcomes explicitly in words. What assumption would make them equal (no treatment effect)?
Explain why collecting more data does not solve the fundamental problem of causal inference. What would we need to observe to compute individual treatment effects?
Give an example where unconfoundedness is violated in the insurance pricing setting. What variable might be unobserved that affects both competitor pricing and sales?
Suppose overlap is violated: for , we never observe high competitor prices ( for ). Can we still estimate (a) the ATE for the full population? (b) the ATE conditional on ?
Mathematical problems
Prove that where .
Show that under unconfoundedness, the ATE can be written as
where is the propensity score.
Computational problems
Modify the FWL Python code to use nonlinear confounding:
# Generate data with nonlinearity
X = np.random.randn(n, 2)
T = X[:, 0]**2 + X[:, 1] + np.random.randn(n) # Quadratic in X1
Y = 2 * T + X[:, 0]**2 + X[:, 1]**3 + np.random.randn(n) # True effect = 2Does linear FWL recover the true effect ? If not, what estimator would work?
In the propensity score example, we computed effective sample sizes of for treated and for controls. Explain why weighting reduces effective sample size and when this is problematic.