WOLFRAM|DEMONSTRATIONS PROJECT

Nonparametric Curve Estimation by Smoothing Splines: Unbiased-Risk-Estimate Selector and its Robust Version via Randomized Choices

​
function underlying data
(sin(2 π x) + 1) / 2
noise level σ
FE`sigma$$8602176334041139369121298669193562922767
seed
7016
trial λ parameter (log scale)
FE`chosenLog10of\:03bbvalue$$8602176334041139369121298669193562922767
percentile p
50
75
90
95
99
This Demonstration considers a simple nonparametric regression problem: how to recover a function
f
of one variable, here over
[0,1]
, when only
n
couples (
x
i
,
y
i
) are known for
i1,2,…,n
that satisfy the model
y
i
=f(
x
i
)+σ
ϵ
i
, where
x
i
∈[0,1]
and the
ϵ
i
are independent, standard normal random variables. For simplicity, assume that the variance
2
σ
is also known.
The setting is the same as in [1] except that the
x
i
(the design) are not regularly spaced, and this Demonstration uses the well-known smoothing spline method instead of kernel smoothers (allowing fast computations; notably, see the recent forum [2] where useful code is provided). Recall that, in place of a bandwidth value, a good value has to be chosen for the famous smoothing parameter, denoted by
λ
. Recall that a very small
λ
produces a quasi-interpolation of the data, and a very large
λ
yields the well-known polynomial regression fit, here of degree 1 since classical cubic splines are considered. A very popular method for a good choice is to try several values, to compute for each one the Mallows's
C
L
criterion, and to retain the
λ
, which yields a minimal
C
L
(λ)
(as in [1], where
C
L
is denoted as UBR since it is an unbiased risk estimate of the global prediction error).
It is frequently observed that the
C
L
criterion as a function of the smoothing parameter may be a rather flat function around its minimum (this is also true for the similar GCV criterion). In such a case, even if the global prediction error may itself be similarly flat (and thus the impact on the predictive quality of the fit may be weak), a
λ
that is too small can then be produced by
C
L
, where "too small" means that spurious oscillations (which could be wrongly interpreted as real peaks) are present in the final estimate of
f
.
See [3] for a recent review of several approaches to remedy such troubles. Let us recall that Mallows emphasized, in his original paper, that a careful examination of the whole curve
C
L
(·)
should be preferred to a blind minimization of the pure
C
L
(or GCV) criterion.
In this Demonstration, we have implemented the randomization-based method introduced in [4, section 7.2], which permits computing a "more parsimonious yet 'near-optimal' fit". Such a fit is parameterized by a percentile
p>50
, which determines an upward modification of the original
C
L
choice.
As in [3], this parameterized modification is called the "robust
C
L
choice corresponding to the percentile
p
".
By playing with this Manipulate, with various underlying functions, you can observe that the results are often very satisfactory for a large range of
σ
values (the noise magnitude), in the sense that the mentioned spurious oscillations are almost always eliminated. Furthermore, it is rather easy to choose
p
since, very often, all the values of
p
chosen among
90
,
95
or
99
(and even
75
in many cases) yield a quite similar (at least visually) final estimate of
f
.