NAG Library Routine Document
G10ACF estimates the values of the smoothing parameter and fits a cubic smoothing spline to a set of data.
|SUBROUTINE G10ACF (
||METHOD, WEIGHT, N, X, Y, WT, YHAT, C, LDC, RSS, DF, RES, H, CRIT, RHO, U, TOL, MAXCAL, WK, IFAIL)
||N, LDC, MAXCAL, IFAIL
||X(N), Y(N), WT(*), YHAT(N), C(LDC,3), RSS, DF, RES(N), H(N), CRIT, RHO, U, TOL, WK(7*(N+2))
For a set of observations , for , the spline provides a flexible smooth function for situations in which a simple polynomial or nonlinear regression model is not suitable.
Cubic smoothing splines arise as the unique real-valued solution function
, with absolutely continuous first derivative and squared-integrable second derivative, which minimizes
is the (optional) weight for the
th observation and
is the smoothing parameter. This criterion consists of two parts: the first measures the fit of the curve and the second the smoothness of the curve. The value of the smoothing parameter
weights these two aspects; larger values of
give a smoother fitted curve but, in general, a poorer fit. For details of how the cubic spline can be fitted see Hutchinson and de Hoog (1985)
and Reinsch (1967)
The fitted values,
, and weighted residuals,
, can be written as:
for a matrix
. The residual degrees of freedom for the spline is
and the diagonal elements of
are the leverages.
can be estimated in a number of ways.
||The degrees of freedom for the spline can be specified, i.e., find such that for given .
||Minimize the cross-validation (CV), i.e., find such that the CV is minimized, where
||Minimize the generalized cross-validation (GCV), i.e., find such that the GCV is minimized, where
G10ACF requires the
to be strictly increasing. If two or more observations have the same
value then they should be replaced by a single observation with
equal to the (weighted) mean of the
values and weight,
, equal to the sum of the weights. This operation can be performed by G10ZAF
The algorithm is based on Hutchinson (1986)
is used to solve for
and the method of E04ABF/E04ABA
is used to minimize the GCV or CV.
Hastie T J and Tibshirani R J (1990) Generalized Additive Models Chapman and Hall
Hutchinson M F (1986) Algorithm 642: A fast procedure for calculating minimum cross-validation cubic smoothing splines ACM Trans. Math. Software 12 150–153
Hutchinson M F and de Hoog F R (1985) Smoothing noisy data with spline functions Numer. Math. 47 99–106
Reinsch C H (1967) Smoothing by spline functions Numer. Math. 10 177–183
- 1: METHOD – CHARACTER(1)Input
: indicates whether the smoothing parameter is to be found by minimization of the CV or GCV functions, or by finding the smoothing parameter corresponding to a specified degrees of freedom value.
- Cross-validation is used.
- The degrees of freedom are specified.
- Generalized cross-validation is used.
, or .
- 2: WEIGHT – CHARACTER(1)Input
: indicates whether user-defined weights are to be used.
- User-defined weights should be supplied in WT.
- The data is treated as unweighted.
- 3: N – INTEGERInput
On entry: , the number of observations.
- 4: X(N) – REAL (KIND=nag_wp) arrayInput
On entry: the distinct and ordered values
, for .
, for .
- 5: Y(N) – REAL (KIND=nag_wp) arrayInput
On entry: the values
, for .
- 6: WT() – REAL (KIND=nag_wp) arrayInput
the dimension of the array WT
must be at least
and at least
must contain the
is not referenced and unit weights are assumed.
if , , for .
- 7: YHAT(N) – REAL (KIND=nag_wp) arrayOutput
On exit: the fitted values,
, for .
- 8: C(LDC,) – REAL (KIND=nag_wp) arrayOutput
On exit: the spline coefficients. More precisely, the value of the spline approximation at is given by , where and .
- 9: LDC – INTEGERInput
: the first dimension of the array C
as declared in the (sub)program from which G10ACF is called.
On exit: the (weighted) residual sum of squares.
- 11: DF – REAL (KIND=nag_wp)Output
On exit: the residual degrees of freedom. If this will be to the required accuracy.
- 12: RES(N) – REAL (KIND=nag_wp) arrayOutput
On exit: the (weighted) residuals,
, for .
- 13: H(N) – REAL (KIND=nag_wp) arrayOutput
On exit: the leverages,
, for .
- 14: CRIT – REAL (KIND=nag_wp)Input/Output
, the required degrees of freedom for the spline.
need not be set.
, the value of the cross-validation, or if
, the value of the generalized cross-validation function, evaluated at the value of
returned in RHO
- 15: RHO – REAL (KIND=nag_wp)Output
On exit: the smoothing parameter, .
- 16: U – REAL (KIND=nag_wp)Input
: the upper bound on the smoothing parameter. If
will be used instead. See Section 8
for details on how this parameter is used.
- 17: TOL – REAL (KIND=nag_wp)Input
: the accuracy to which the smoothing parameter RHO
is required. TOL
should preferably be not much less than
is the machine precision
will be used instead.
- 18: MAXCAL – INTEGERInput
On entry: the maximum number of spline evaluations to be used in finding the value of . If , will be used instead.
- 19: WK() – REAL (KIND=nag_wp) arrayWorkspace
- 20: IFAIL – INTEGERInput/Output
must be set to
. If you are unfamiliar with this parameter you should refer to Section 3.3
in the Essential Introduction for details.
For environments where it might be inappropriate to halt program execution when an error is detected, the value
is recommended. If the output of error messages is undesirable, then the value
is recommended. Otherwise, if you are not familiar with this parameter, the recommended value is
. When the value is used it is essential to test the value of IFAIL on exit.
unless the routine detects an error or a warning has been flagged (see Section 6
6 Error Indicators and Warnings
If on entry
, explanatory error messages are output on the current error message unit (as defined by X04AAF
Errors or warnings detected by the routine:
|or||, or ,|
|or|| or ,|
|or|| and ,|
|or|| and .|
|On entry,|| and at least one element of .|
|On entry,||, for some , .|
and the required value of
for specified degrees of
. Try a larger value of U
; see Section 8
and the accuracy given by TOL
cannot be achieved. Try increasing the value of TOL
A solution to the accuracy given by TOL
has not been achieved in MAXCAL
iterations. Try increasing the value of TOL
and the optimal value of
. Try a larger value of U
; see Section 8
When minimizing the cross-validation or generalized cross-validation, the error in the estimate of should be within . When finding for a fixed number of degrees of freedom the error in the estimate of should be within .
Given the value of , the accuracy of the fitted spline depends on the value of and the position of the values. The values of and are scaled and is transformed to avoid underflow and overflow problems.
The time to fit the spline for a given value of is of order .
When finding the value of
that gives the required degrees of freedom, the algorithm examines the interval
. For small degrees of freedom the value of
can be large, as in the theoretical case of two degrees of freedom when the spline reduces to a straight line and
is infinite. If the CV or GCV is to be minimized then the algorithm searches for the minimum value in the interval
. If the function is decreasing in that range then the boundary value of U
will be returned. In either case, the larger the value of U
the more likely is the interval to contain the required solution, but the process will be less efficient.
Regression splines with a small
number of knots can be fitted by E02BAF
This example uses the data given by Hastie and Tibshirani (1990)
, which consists of the age,
, and C-peptide concentration (pmol/ml),
, from a study of the factors affecting insulin-dependent diabetes mellitus in children. The data is input, reduced to a strictly ordered set by G10ZAF
and a spline with
degrees of freedom is fitted by G10ACF. The fitted values and residuals are printed.
9.1 Program Text
Program Text (g10acfe.f90)
9.2 Program Data
Program Data (g10acfe.d)
9.3 Program Results
Program Results (g10acfe.r)