# Nonparametric Curve Estimation by Smoothing Splines: Unbiased-Risk-Estimate Selector and its Robust Version via Randomized Choices

Nonparametric Curve Estimation by Smoothing Splines: Unbiased-Risk-Estimate Selector and its Robust Version via Randomized Choices

This Demonstration considers a simple nonparametric regression problem: how to recover a function of one variable, here over , when only couples () are known for that satisfy the model , where and the are independent, standard normal random variables. For simplicity, assume that the variance is also known.

f

[0,1]

n

x,y

i

i

i1,2,…,n

y=f(x)+σϵ

i

i

i

x∈[0,1]

i

ϵ

i

σ

2

The setting is the same as in [1] except that the (the design) are not regularly spaced, and this Demonstration uses the well-known smoothing spline method instead of kernel smoothers (allowing fast computations; notably, see the recent forum [2] where useful code is provided). Recall that, in place of a bandwidth value, a good value has to be chosen for the famous smoothing parameter, denoted by . Recall that a very small produces a quasi-interpolation of the data, and a very large yields the well-known polynomial regression fit, here of degree 1 since classical cubic splines are considered. A very popular method for a good choice is to try several values, to compute for each one the Mallows's criterion, and to retain the , which yields a minimal (as in [1], where is denoted as UBR since it is an unbiased risk estimate of the global prediction error).

x

i

λ

λ

λ

C

L

λ

C(λ)

L

C

L

It is frequently observed that the criterion as a function of the smoothing parameter may be a rather flat function around its minimum (this is also true for the similar GCV criterion). In such a case, even if the global prediction error may itself be similarly flat (and thus the impact on the predictive quality of the fit may be weak), a that is too small can then be produced by , where "too small" means that spurious oscillations (which could be wrongly interpreted as real peaks) are present in the final estimate of .

C

L

λ

C

L

f

See [3] for a recent review of several approaches to remedy such troubles. Let us recall that Mallows emphasized, in his original paper, that a careful examination of the whole curve should be preferred to a blind minimization of the pure (or GCV) criterion.

C(·)

L

C

L

In this Demonstration, we have implemented the randomization-based method introduced in [4, section 7.2], which permits computing a "more parsimonious yet 'near-optimal' fit". Such a fit is parameterized by a percentile , which determines an upward modification of the original choice.

p>50

C

L

As in [3], this parameterized modification is called the "robust choice corresponding to the percentile ".

C

L

p

By playing with this Manipulate, with various underlying functions, you can observe that the results are often very satisfactory for a large range of values (the noise magnitude), in the sense that the mentioned spurious oscillations are almost always eliminated. Furthermore, it is rather easy to choose since, very often, all the values of chosen among , or (and even in many cases) yield a quite similar (at least visually) final estimate of .

σ

p

p

90

95

99

75

f