Omitted Variable Bias

OLS

GLM

T

M

: relation between

*

Y

and

X

1

β

1

-1

T

M

: relation between

*

Y

and

X

3

β

3

1

M

1

: relation between

X

2

and

X

3

γ

2

2

M

1

,

M

2

: relation between

X

3

and

X

1

δ

1

2

M

1

,

M

2

: relation between

X

3

and

X

2

δ

2

2

β 2	γ 1	Δ b  β 11 ,  β 12
5.	5.	7.69

In social science research, control variables are often included out of concerns about inducing bias into the coefficients of interest [1, 2]. However, short of knowing the true data-generating process—an unlikely situation—the inclusion of even relevant controls may in fact aggravate the problem.

This is shown for the case of linear (OLS) and logit (GLM) models, where the true model includes three covariates. The first misspecified model omits the second and third covariates, and the second misspecified model omits only the third covariate. According to the logic of including controls, the bias on the expected value of the coefficient for the first covariate should always be larger in the first misspecified model, unless covariates are uncorrelated. This is not true for many GLM link functions, where coefficients may be biased even if included and excluded covariates are uncorrelated [3, 4]. At the red contour line no difference in bias exists between the first and second misspecified models. In regions where dashed contour lines indicate positive values, the inclusion of controls would indeed reduce bias. (Hover the mouse over the contour line to see the tooltip.) The lighter the region, the larger the reduction. In regions where solid contour lines indicate negative values, however, the inclusion of controls would induce bias. The darker the region, the larger the induction. For exact identification of coordinates, drag the cross-hairs locator to the desired position. The notation follows [1].