hide long namesshow long names
hide short namesshow short names
Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

PDF version (NAG web site, 64-bit version, 64-bit version)
Chapter Contents
Chapter Introduction
NAG Toolbox

NAG Toolbox: nag_correg_linregm_fit_stepwise (g02ef)

Purpose

nag_correg_linregm_fit_stepwise (g02ef) calculates a full stepwise selection from pp variables by using Clarke's sweep algorithm on the correlation matrix of a design and data matrix, ZZ. The (weighted) variance-covariance, (weighted) means and sum of weights of ZZ must be supplied.

Syntax

[isx, b, se, rsq, rms, df, user, ifail] = g02ef(n, wmean, c, sw, isx, 'm', m, 'fin', fin, 'fout', fout, 'tau', tau, 'monfun', monfun, 'user', user)
[isx, b, se, rsq, rms, df, user, ifail] = nag_correg_linregm_fit_stepwise(n, wmean, c, sw, isx, 'm', m, 'fin', fin, 'fout', fout, 'tau', tau, 'monfun', monfun, 'user', user)
Note: the interface to this routine has changed since earlier releases of the toolbox:
Mark 23: monfun now optional, monlevel dropped from interface
.

Description

The general multiple linear regression model is defined by
y = β0 + Xβ + ε,
y = β0 +Xβ+ε,
where
nag_correg_linregm_fit_stepwise (g02ef) employs a full stepwise regression to select a subset of explanatory variables from the pp available variables (the intercept is included in the model) and computes regression coefficients and their standard errors, and various other statistical quantities, by minimizing the sum of squares of residuals. The method applies repeatedly a forward selection step followed by a backward elimination step and halts when neither step updates the current model.
The criterion used to update a current model is the variance ratio of residual sum of squares. Let s1s1 and s2s2 be the residual sum of squares of the current model and this model after undergoing a single update, with degrees of freedom q1q1 and q2q2, respectively. Then the condition:
( (s2s1) / (q2q1) )/( s1 / q1 ) > f1 ,
( s2 - s1 ) / ( q2 - q1 ) s1 / q1 > f1 ,
must be satisfied if a variable kk will be considered for entry to the current model, and the condition:
( (s1s2) / (q1q2) )/( s1 / q1 ) < f2 ,
( s1 - s2 ) / ( q1 - q2 ) s1 / q1 < f2 ,
must be satisfied if a variable kk will be considered for removal from the current model, where f1f1 and f2f2 are user-supplied values and f2f1f2f1.
In the entry step the entry statistic is computed for each variable not in the current model. If no variable is associated with a test value that exceeds f1f1 then this step is terminated; otherwise the variable associated with the largest value for the entry statistic is entered into the model.
In the removal step the removal statistic is computed for each variable in the current model. If no variable is associated with a test value less than f2f2 then this step is terminated; otherwise the variable associated with the smallest value for the removal statistic is removed from the model.
The data values XX and yy are not provided as input to the function. Instead, summary statistics of the design and data matrix Z = (Xy)Z=(Xy) are required.
Explanatory variables are entered into and removed from the current model by using sweep operations on the correlation matrix RR of ZZ, given by:
R =
  1 … r1p r1y ⋮ ⋱ ⋮ ⋮ rp1 … 1 rpy ry1 … ryp 1  
,
R = 1 r1p r1y rp1 1 rpy ry1 ryp 1 ,
where rijrij is the correlation between the explanatory variables ii and jj, for i = 1,2,,pi=1,2,,p and j = 1,2,,pj=1,2,,p, and ryiryi (and riyriy) is the correlation between the response variable yy and the iith explanatory variable, for i = 1,2,,pi=1,2,,p.
A sweep operation on the kkth row and column (kpkp) of RR replaces:
rkk ​ by ​ 1 / rkk ;
rik ​ by ​ rik / |rkk| ,  i = 1,2,,p + 1 ​ ​ (ik) ;
rkj ​ by ​ rkj / |rkk| ,  j = 1,2,,p + 1 ​ ​ (jk) ;
rij ​ by ​ rij rik rkj / |rkk| ,  ​ i = 1,2,,p + 1 ​ ​ (ik) ; ​ j = 1,2,,p + 1 ​ ​ (jk) .
rkk ​ by ​ -1 / rkk ; rik ​ by ​ rik / |rkk| ,  i=1,2,,p+1 ​ ​ (ik) ; rkj ​ by ​ rkj / |rkk| ,  j=1,2,,p+1 ​ ​ (jk) ; rij ​ by ​ rij - rik rkj / |rkk| ,  ​ i=1,2,,p+1 ​ ​ (ik) ; ​ j=1,2,,p+1 ​ ​ (jk) .
The kkth explanatory variable is eligible for entry into the current model if it satisfies the collinearity tests: rkk > τrkk>τ and
(rii( rik rki )/( rkk )) τ1 ,
( rii - rik rki rkk ) τ1 ,
for a user-supplied value ( > 0>0) of ττ and where the index ii runs over explanatory variables in the current model. The sweep operation is its own inverse, therefore pivoting on an explanatory variable kk in the current model has the effect of removing it from the model.
Once the stepwise model selection procedure is finished, the function calculates:
(a) the least squares estimate for the iith explanatory variable included in the fitted model;
(b) standard error estimates for each coefficient in the final model;
(c) the square root of the mean square of residuals and its degrees of freedom;
(d) the multiple correlation coefficient.
The function makes use of the symmetry of the sweep operations and correlation matrix which reduces by almost one half the storage and computation required by the sweep algorithm, see Clarke (1981) for details.

References

Clarke M R B (1981) Algorithm AS 178: the Gauss–Jordan sweep operator with detection of collinearity Applied Statistics 31 166–169
Dempster A P (1969) Elements of Continuous Multivariate Analysis Addison–Wesley
Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley

Parameters

Compulsory Input Parameters

1:     n – int64int32nag_int scalar
The number of observations used in the calculations.
Constraint: n > 1n>1.
2:     wmean(m + 1m+1) – double array
The mean of the design matrix, ZZ.
3:     c((m + 1) × (m + 2) / 2(m+1)×(m+2)/2) – double array
The upper-triangular variance-covariance matrix packed by column for the design matrix, ZZ. Because the function computes the correlation matrix RR from c, the variance-covariance matrix need only be supplied up to a scaling factor.
4:     sw – double scalar
If weights were used to calculate c then sw is the sum of positive weight values; otherwise sw is the number of observations used to calculate c.
Constraint: sw > 1.0sw>1.0.
5:     isx(m) – int64int32nag_int array
m, the dimension of the array, must satisfy the constraint m > 1m>1.
The value of isx(j)isxj determines the set of variables used to perform full stepwise model selection, for j = 1,2,,mj=1,2,,m.
isx(j) = 1isxj=-1
To exclude the variable corresponding to the jjth column of XX from the final model.
isx(j) = 1isxj=1
To consider the variable corresponding to the jjth column of XX for selection in the final model.
isx(j) = 2isxj=2
To force the inclusion of the variable corresponding to the jjth column of XX in the final model.
Constraint: isx(j) = 1,1​ or ​2isxj=-1,1​ or ​2, for j = 1,2,,mj=1,2,,m.

Optional Input Parameters

1:     m – int64int32nag_int scalar
Default: The dimension of the array isx.
The number of explanatory variables available in the design matrix, ZZ.
Constraint: m > 1m>1.
2:     fin – double scalar
The value of the variance ratio which an explanatory variable must exceed to be included in a model.
Default: 4.04.0
Constraint: fin > 0.0fin>0.0.
3:     fout – double scalar
The explanatory variable in a model with the lowest variance ratio value is removed from the model if its value is less than fout. fout is usually set equal to the value of fin; a value less than fin is occasionally preferred.
Default: finfin
Constraint: 0.0foutfin0.0foutfin.
4:     tau – double scalar
The tolerance, ττ, for detecting collinearities between variables when adding or removing an explanatory variable from a model. Explanatory variables deemed to be collinear are excluded from the final model.
Default: 0.0000010.000001
Constraint: tau > 0.0tau>0.0.
5:     monfun – function handle or string containing name of m-file
You may define your own function or specify the NAG defined default function nag_correg_linregm_fit_stepwise_sample_monfun (g02efh).
If monlev = 0monlev=0, monfun is not referenced; otherwise its specification is:
[user] = monfun(flag, var, val, user)

Input Parameters

1:     flag – string (length ≥ 1)
The value of flag indicates the stage of the stepwise selection of explanatory variables.
flag = 'A'flag='A'
Variable var was added to the current model.
flag = 'B'flag='B'
Beginning the backward elimination step.
flag = 'C'flag='C'
Variable var failed the collinearity test and is excluded from the model.
flag = 'D'flag='D'
Variable var was dropped from the current model.
flag = 'F'flag='F'
Beginning the forward selection step
flag = 'K'flag='K'
Backward elimination did not remove any variables from the current model.
flag = 'S'flag='S'
Starting stepwise selection procedure.
flag = 'V'flag='V'
The variance ratio for variable var takes the value val.
flag = 'X'flag='X'
Finished stepwise selection procedure.
2:     var – int64int32nag_int scalar
The index of the explanatory variable in the design matrix ZZ to which flag pertains.
3:     val – double scalar
If flag = 'V'flag='V', val is the variance ratio value for the coefficient associated with explanatory variable index var.
4:     user – Any MATLAB object
monfun is called from nag_correg_linregm_fit_stepwise (g02ef) with the object supplied to nag_correg_linregm_fit_stepwise (g02ef).

Output Parameters

1:     user – Any MATLAB object
6:     user – Any MATLAB object
user is not used by nag_correg_linregm_fit_stepwise (g02ef), but is passed to monfun. Note that for large objects it may be more efficient to use a global variable which is accessible from the m-files than to use user.

Input Parameters Omitted from the MATLAB Interface

monlev iuser ruser

Output Parameters

1:     isx(m) – int64int32nag_int array
The value of isx(j)isxj indicates the status of the jjth explanatory variable in the model.
isx(j) = 1isxj=-1
Forced exclusion.
isx(j) = 0isxj=0
Excluded.
isx(j) = 1isxj=1
Selected.
isx(j) = 2isxj=2
Forced selection.
2:     b(m + 1m+1) – double array
b(1)b1 contains the estimate for the intercept term in the fitted model. If isx(j)0isxj0 then b(j + 1)bj+1 contains the estimate for the jjth explanatory variable in the fitted model; otherwise b(j + 1) = 0bj+1=0.
3:     se(m + 1m+1) – double array
se(j)sej contains the standard error for the estimate of b(j)bj, for j = 1,2,,m + 1j=1,2,,m+1.
4:     rsq – double scalar
The R2R2-statistic for the fitted regression model.
5:     rms – double scalar
The mean square of residuals for the fitted regression model.
6:     df – int64int32nag_int scalar
The number of degrees of freedom for the sum of squares of residuals.
7:     user – Any MATLAB object
8:     ifail – int64int32nag_int scalar
ifail = 0ifail=0 unless the function detects an error (see [Error Indicators and Warnings]).

Error Indicators and Warnings

Errors or warnings detected by the function:

Cases prefixed with W are classified as warnings and do not generate an error of type NAG:error_n. See nag_issue_warnings.

  ifail = 1ifail=1
On entry,m1m1,
orn1n1,
orsw1.0sw1.0,
orfin0.0fin0.0,
orfout < 0.0fout<0.0,
orfout > finfout>fin,
ortau0.0tau0.0.
  ifail = 2ifail=2
On entry,at least one element of isx was set incorrectly,
orthere are no explanatory variables to select from isx(i)1isxi1, for i = 1,2,,mi=1,2,,m,
orinvalid value for monlev.
W ifail = 3ifail=3
Warning: the design and data matrix ZZ is not positive definite, results may be inaccurate.
  ifail = 4ifail=4
All variables are collinear, there is no model to select.

Accuracy

nag_correg_linregm_fit_stepwise (g02ef) returns a warning if the design and data matrix is not positive definite.

Further Comments

Although the condition for removing or adding a variable to the current model is based on a ratio of variances, these values should not be interpreted as FF-statistics with the usual interpretation of significance unless the probability levels are adjusted to account for correlations between variables under consideration and the number of possible updates (see, e.g., Draper and Smith (1985)).
nag_correg_linregm_fit_stepwise (g02ef) allocates internally O (4 × m + (m + 1) × (m + 2) / 2 + 2) O ( 4×m+ (m+1) × (m+2) /2+2 ) of double storage.

Example

function nag_correg_linregm_fit_stepwise_example
n = int64(13);
wmean = [7.461538461538462;
     48.15384615384615;
     11.76923076923077;
     30;
     95.42307692307693];
c = [415.2307692307692;
     251.0769230769231;
     2905.692307692308;
     -372.6153846153845;
     -166.5384615384615;
     492.3076923076923;
     -290;
     -3041;
     38;
     3362;
     775.9615384615385;
     2292.953846153846;
     -618.2307692307694;
     -2481.7;
     2715.763076923076];
sw = 13;
isx = [int64(1);1;1;1];
[isxOut, b, se, rsq, rms, df, user, ifail] = ...
    nag_correg_linregm_fit_stepwise(n, wmean, c, sw, isx, 'monfun', @monfun)

function [user] = monfun(flag, var, val, user)

  switch flag
    case 'C'
      fprintf('\nVariable %d aliased\n', var);
    case 'S'
      fprintf('\nStarting Stepwise Selection\n');
    case 'F'
      fprintf('\nForward Selection\n');
    case 'V'
      fprintf('Variable %d  Variance ratio = %12.3f\n', var, val);
    case 'A'
      fprintf('\nAdding variable %d to model\n', var);
    case 'B'
      fprintf('\nBackward Selection\n');
    case 'D'
      fprintf('\nDropping variable %d from model\n', var);
    case 'K'
      fprintf('\nKeeping all current variables\n');
    case 'X'
      fprintf('\nFinished Stepwise Selection\n');
  end;
 

Starting Stepwise Selection

Forward Selection
Variable 1  Variance ratio =       12.603
Variable 2  Variance ratio =       21.961
Variable 3  Variance ratio =        4.403
Variable 4  Variance ratio =       22.799

Adding variable 4 to model

Backward Selection
Variable 4  Variance ratio =       22.799

Keeping all current variables

Forward Selection
Variable 1  Variance ratio =      108.224
Variable 2  Variance ratio =        0.172
Variable 3  Variance ratio =       40.295

Adding variable 1 to model

Backward Selection
Variable 1  Variance ratio =      108.224
Variable 4  Variance ratio =      159.295

Keeping all current variables

Forward Selection
Variable 2  Variance ratio =        5.026
Variable 3  Variance ratio =        4.236

Adding variable 2 to model

Backward Selection
Variable 1  Variance ratio =      154.008
Variable 2  Variance ratio =        5.026
Variable 4  Variance ratio =        1.863

Dropping variable 4 from model

Forward Selection
Variable 3  Variance ratio =        1.832
Variable 4  Variance ratio =        1.863

Finished Stepwise Selection

isxOut =

                    1
                    1
                    0
                    0


b =

   52.5773
    1.4683
    0.6623
         0
         0


se =

    2.2943
    0.1213
    0.0459
         0
         0


rsq =

    0.9787


rms =

    5.7904


df =

                   10


user =

     0


ifail =

                    0


function g02ef_example
n = int64(13);
wmean = [7.461538461538462;
     48.15384615384615;
     11.76923076923077;
     30;
     95.42307692307693];
c = [415.2307692307692;
     251.0769230769231;
     2905.692307692308;
     -372.6153846153845;
     -166.5384615384615;
     492.3076923076923;
     -290;
     -3041;
     38;
     3362;
     775.9615384615385;
     2292.953846153846;
     -618.2307692307694;
     -2481.7;
     2715.763076923076];
sw = 13;
isx = [int64(1);1;1;1];
[isxOut, b, se, rsq, rms, df, user, ifail] = ...
    g02ef(n, wmean, c, sw, isx, 'monfun', @monfun)

function [user] = monfun(flag, var, val, user)

  switch flag
    case 'C'
      fprintf('\nVariable %d aliased\n', var);
    case 'S'
      fprintf('\nStarting Stepwise Selection\n');
    case 'F'
      fprintf('\nForward Selection\n');
    case 'V'
      fprintf('Variable %d  Variance ratio = %12.3f\n', var, val);
    case 'A'
      fprintf('\nAdding variable %d to model\n', var);
    case 'B'
      fprintf('\nBackward Selection\n');
    case 'D'
      fprintf('\nDropping variable %d from model\n', var);
    case 'K'
      fprintf('\nKeeping all current variables\n');
    case 'X'
      fprintf('\nFinished Stepwise Selection\n');
  end;
 

Starting Stepwise Selection

Forward Selection
Variable 1  Variance ratio =       12.603
Variable 2  Variance ratio =       21.961
Variable 3  Variance ratio =        4.403
Variable 4  Variance ratio =       22.799

Adding variable 4 to model

Backward Selection
Variable 4  Variance ratio =       22.799

Keeping all current variables

Forward Selection
Variable 1  Variance ratio =      108.224
Variable 2  Variance ratio =        0.172
Variable 3  Variance ratio =       40.295

Adding variable 1 to model

Backward Selection
Variable 1  Variance ratio =      108.224
Variable 4  Variance ratio =      159.295

Keeping all current variables

Forward Selection
Variable 2  Variance ratio =        5.026
Variable 3  Variance ratio =        4.236

Adding variable 2 to model

Backward Selection
Variable 1  Variance ratio =      154.008
Variable 2  Variance ratio =        5.026
Variable 4  Variance ratio =        1.863

Dropping variable 4 from model

Forward Selection
Variable 3  Variance ratio =        1.832
Variable 4  Variance ratio =        1.863

Finished Stepwise Selection

isxOut =

                    1
                    1
                    0
                    0


b =

   52.5773
    1.4683
    0.6623
         0
         0


se =

    2.2943
    0.1213
    0.0459
         0
         0


rsq =

    0.9787


rms =

    5.7904


df =

                   10


user =

     0


ifail =

                    0



PDF version (NAG web site, 64-bit version, 64-bit version)
Chapter Contents
Chapter Introduction
NAG Toolbox

© The Numerical Algorithms Group Ltd, Oxford, UK. 2009–2013