Linear Regression with Outliers

Linear regression based on minimizing squared distances does not work well when some of the data points are outliers, meaning points far away from a presumptive well-fitting line. We discuss an easy modification that allows linear regression to work well even in the presence of outliers.
June 21, 2017—Jan Segert

Utilities

Utilities to plot a function and to plot a list of points:
In[]:=
functionPlot[fn_,style_]:=Block[{x},Plot[fn[x],{x,0,7},PlotStyle->style]]​​listPlot[list_,style_]:=ListPlot[list,PlotStyle->style]
Utility to compute the slope and intercept of the linear function through two points:
In[]:=
twoPointLine[{{x1_,y1_},{x2_,y2_}}]:=Module{slope,intcpt},​​ slope=Ifx2-x1≠0,
y2-y1
x2-x1
,0,intcp=y1-slope*x1

Standard Least-Squares

Classical least-squares fits a straight line to a set of data points by minimizing the sum of the squared vertical distances. We’ll demonstrate this with an example.
First generate a set of “inlier” points near a fixed straight line:
In[]:=
Clear[x];​​linf[x_]=0.5*x+1.0;​​inlierPoints=Table[{x,linf[x]+RandomReal[{-0.2,0.2}]},{x,1,6}];​​​​Show[listPlot[inlierPoints,{Gray,PointSize[0.02]}],functionPlot[linf,{Gray,Dotted}]]
Out[]=
1
2
3
4
5
6
1
2
3
4
The solid (red) line that minimizes the sum of squared vertical distances to each of the inlier points is close to the original (gray) dotted line:
In[]:=
pointSet=inlierPoints;​​​​Clear[a,b,x,f,xp,yp]​​{xp,yp}=Transpose[pointSet];​​n=Length[pointSet];​​​​f[x_]=b*x+a;​​objective=
n
∑
i=1
2
(Indexed[yp,i]-f[Indexed[xp,i]])
;​​fitLine[x_]=a+bx/.Last[Minimize[objective,{a,b}]]​​​​Show[listPlot[pointSet,{Gray,PointSize[0.02]}],functionPlot[linf,{Gray,Dotted}],functionPlot[fitLine,Hue[1.0]]]
Out[]=
1.03039+0.503843x
Out[]=
1
2
3
4
5
6
1
2
3
4
But including a few “outlier” points that are far away from the original can have large, unwanted effects on the fitted line.
The least-squares minimization now gives a line (blue) that is not near the original gray line:
In[]:=
​​outlierPoints={{1.5,4},{5.5,1}};​​mixedPoints=Join[outlierPoints,inlierPoints];​​​​pointSet=mixedPoints;​​​​Clear[a,b,x,f,xp,yp]​​{xp,yp}=Transpose[pointSet];​​n=Length[pointSet];​​​​f[x_]=b*x+a;​​objective=
n
∑
i=1
2
(Indexed[yp,i]-f[Indexed[xp,i]])
;​​fitLine[x_]=a+bx/.Last[Minimize[objective,{a,b}]]​​​​Show[listPlot[pointSet,{Gray,PointSize[0.02]}],functionPlot[linf,{Gray,Dotted}],functionPlot[fitLine,Hue[0.65]]]
Out[]=
2.3337+0.110481x
Out[]=
1
2
3
4
5
6
1
2
3
4

Modifying Least-Squares to Handle Outliers

Behind the Scenes

FURTHER EXPLORATIONS
Random Sample Consensus (RANSAC)
Hough Transform
AUTHORSHIP INFORMATION
Jan Segert
6/21/17