hide long namesshow long names
hide short namesshow short names
Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

PDF version (NAG web site, 64-bit version, 64-bit version)
Chapter Contents
Chapter Introduction
NAG Toolbox

NAG Toolbox Chapter Introduction

G02 — Correlation and Regression Analysis

Scope of the Chapter

This chapter is concerned with two techniques – correlation analysis and regression modelling – both of which are concerned with determining the inter-relationships among two or more variables.
Other chapters of the NAG Toolbox which cover similar problems are Chapters E02 and E04. Chapter E02 functions may be used to fit linear models by criteria other than least squares, and also for polynomial regression; Chapter E04 functions may be used to fit nonlinear models and linearly constrained linear models.

Background to the Problems

Correlation

Aims of correlation analysis

Correlation analysis provides a single summary statistic – the correlation coefficient – describing the strength of the association between two variables. The most common types of association which are investigated by correlation analysis are linear relationships, and there are a number of forms of linear correlation coefficients for use with different types of data.

Correlation coefficients

The (Pearson) product-moment correlation coefficients measure a linear relationship, while Kendall's tau and Spearman's rank order correlation coefficients measure monotonicity only. All three coefficients range from 1.0-1.0 to + 1.0+1.0. A coefficient of zero always indicates that no linear relationship exists; a + 1.0+1.0 coefficient implies a ‘perfect’ positive relationship (i.e., an increase in one variable is always associated with a corresponding increase in the other variable); and a coefficient of 1.0-1.0 indicates a ‘perfect’ negative relationship (i.e., an increase in one variable is always associated with a corresponding decrease in the other variable).
Consider the bivariate scattergrams in Figure 1: (a) and (b) show strictly linear functions for which the values of the product-moment correlation coefficient, and (since a linear function is also monotonic) both Kendall's tau and Spearman's rank order coefficients, would be + 1.0+1.0 and 1.0-1.0 respectively. However, though the relationships in figures (c) and (d) are respectively monotonically increasing and monotonically decreasing, for which both Kendall's and Spearman's nonparametric coefficients would be + 1.0+1.0 (in (c)) and 1.0-1.0 (in (d)), the functions are nonlinear so that the product-moment coefficients would not take such ‘perfect’ extreme values. There is no obvious relationship between the variables in figure (e), so all three coefficients would assume values close to zero, while in figure (f) though there is an obvious parabolic relationship between the two variables, it would not be detected by any of the correlation coefficients which would again take values near to zero; it is important therefore to examine scattergrams as well as the correlation coefficients.
In order to decide which type of correlation is the most appropriate, it is necessary to appreciate the different groups into which variables may be classified. Variables are generally divided into four types of scales: the nominal scale, the ordinal scale, the interval scale, and the ratio scale. The nominal scale is used only to categorise data; for each category a name, perhaps numeric, is assigned so that two different categories will be identified by distinct names. The ordinal scale, as well as categorising the observations, orders the categories. Each category is assigned a distinct identifying symbol, in such a way that the order of the symbols corresponds to the order of the categories. (The most common system for ordinal variables is to assign numerical identifiers to the categories, though if they have previously been assigned alphabetic characters, these may be transformed to a numerical system by any convenient method which preserves the ordering of the categories.) The interval scale not only categorises and orders the observations, but also quantifies the comparison between categories; this necessitates a common unit of measurement and an arbitrary zero-point. Finally, the ratio scale is similar to the interval scale, except that it has an absolute (as opposed to arbitrary) zero-point.
For a more complete discussion of these four types of scales, and some examples, you are referred to Churchman and Ratoosh (1959) and Hays (1970).
Figure 1
Figure 1
Product-moment correlation coefficients are used with variables which are interval (or ratio) scales; these coefficients measure the amount of spread about the linear least squares equation. For a product-moment correlation coefficient, rr, based on nn pairs of observations, testing against the null hypothesis that there is no correlation between the two variables, the statistic
r×sqrt((n2)/(1r2))
rn-2 1-r2
has a Student's tt-distribution with n2n-2 degrees of freedom; its significance can be tested accordingly.
Ranked and ordinal scale data are generally analysed by nonparametric methods – usually either Spearman's or Kendall's tau rank order correlation coefficients, which, as their names suggest, operate solely on the ranks, or relative orders, of the data values. Interval or ratio scale variables may also be validly analysed by nonparametric methods, but such techniques are statistically less powerful than a product-moment method. For a Spearman rank order correlation coefficient, RR, based on nn pairs of observations, testing against the null hypothesis that there is no correlation between the two variables, for large samples the statistic
R×sqrt((n2)/(1R2))
Rn-2 1-R2
has approximately a Student's tt-distribution with n2n-2 degrees of freedom, and may be treated accordingly. (This is similar to the product-moment correlation coefficient, rr, see above.) Kendall's tau coefficient, based on nn pairs of observations, has, for large samples, an approximately Normal distribution with mean zero and standard deviation
sqrt((4n + 10)/(9n(n1)))
4n+10 9n(n-1)
when tested against the null hypothesis that there is no correlation between the two variables; the coefficient should therefore be divided by this standard deviation and tested against the standard Normal distribution, N(0,1)N(0,1).
When the number of ordinal categories a variable takes is large, and the number of ties is relatively small, Spearman's rank order correlation coefficients have advantages over Kendall's tau; conversely, when the number of categories is small, or there are a large number of ties, Kendall's tau is usually preferred. Thus when the ordinal scale is more or less continuous, Spearman's rank order coefficients are preferred, whereas Kendall's tau is used when the data is grouped into a smaller number of categories; both measures do however include corrections for the occurrence of ties, and the basic concepts underlying the two coefficients are quite similar. The absolute value of Kendall's tau coefficient tends to be slightly smaller than Spearman's coefficient for the same set of data.
There is no authoritative dictum on the selection of correlation coefficients – particularly on the advisability of using correlations with ordinal data. This is a matter of discretion for you.

Partial correlation

The correlation coefficients described above measure the association between two variables ignoring any other variables in the system. Suppose there are three variables X,YX,Y and ZZ as shown in the path diagram below.
Figure G02INT2
Figure 2
The association between YY and ZZ is made up of the direct association between YY and ZZ and the association caused by the path through XX, that is the association of both YY and ZZ with the third variable XX. For example if ZZ and YY were cholesterol level and blood pressure and XX were age since both blood pressure and cholesterol level may increase with age the correlation between blood pressure and cholesterol level eliminating the effect of age is required.
The correlation between two variables eliminating the effect of a third variable is known as the partial correlation. If ρzyρzy, ρzxρzx and ρxyρxy represent the correlations between xx, yy and zz then the partial correlation between ZZ and YY given XX is
(ρzyρzxρxy)/(sqrt((1ρzx2)(1ρxy2))).
ρzy-ρzxρxy(1-ρzx2)(1-ρxy2).
The partial correlation is then estimated by using product-moment correlation coefficients.
In general, let a set of variables be partitioned into two groups YY and XX with nyny variables in YY and nxnx variables in XX and let the variance-covariance matrix of all ny + nxny+nx variables be partitioned into
[ Σxx Σyx Σxy Σyy ]
.
[ Σxx Σyx Σxy Σyy ] .
Then the variance-covariance of YY conditional on fixed values of the XX variables is given by
Σyx = ΣyyΣyxΣxx1Σxy.
Σyx=Σyy-ΣyxΣxx -1Σxy.
The partial correlation matrix is then computed by standardizing ΣyxΣyx.

Robust estimation of correlation coefficients

The product-moment correlation coefficient can be greatly affected by the presence of a few extreme observations or outliers. There are robust estimation procedures which aim to decrease the effect of extreme values.
Mathematically these methods can be described as follows. A robust estimate of the variance-covariance matrix, CC, can be written as
C = τ2(ATA)1
C=τ2(ATA)-1
where τ2τ2 is a correction factor to give an unbiased estimator if the data is Normal and AA is a lower triangular matrix. Let xixi be the vector of values for the iith observation and let zi = A(xiθ)zi=A(xi-θ), θθ being a robust estimate of location, then θθ and AA are found as solutions to
n
1/nw(zi2)zi = 0
i = 1
1ni=1nw(zi2)zi=0
and
n
1/nw(zi2)ziziTv(zi2)I = 0,
i = 1
1n i= 1n w ( zi2 ) zi ziT -v ( zi2 ) I=0,
where w(t)w(t), u(t)u(t) and v(t)v(t) are functions such that they return a value of 11 for reasonable values of tt and decreasing values for large tt. The correlation matrix can then be calculated from the variance-covariance matrix. If ww, uu, and vv returned 11 for all values then the product-moment correlation coefficient would be calculated.

Missing values

When there are missing values in the data these may be handled in one of two ways. Firstly, if a case contains a missing observation for any variable, then that case is omitted in its entirety from all calculations; this may be termed casewise treatment of missing data. Secondly, if a case contains a missing observation for any variable, then the case is omitted from only those calculations involving the variable for which the value is missing; this may be called pairwise treatment of missing data. Pairwise deletion of missing data has the advantage of using as much of the data as possible in the computation of each coefficient. In extreme circumstances, however, it can have the disadvantage of producing coefficients which are based on a different number of cases, and even on different selections of cases or samples; furthermore, the ‘correlation’ matrices formed in this way need not necessarily be positive semidefinite, a requirement for a correlation matrix. Casewise deletion of missing data generally causes fewer cases to be used in the calculation of the coefficients than does pairwise deletion. How great this difference is will obviously depend on the distribution of the missing data, both among cases and among variables.
Pairwise treatment does therefore use more information from the sample, but should not be used without careful consideration of the location of the missing observations in the data matrix, and the consequent effect of processing the missing data in that fashion.

Nearest Correlation Matrix

A correlation matrix is, by definition, a symmetric, positive semidefinite matrix with unit diagonals and all elements in the range [1,1] [-1,1].
In practice, rather than having a true correlation matrix, you may find that you have a matrix of pairwise correlations. This usually occurs in the presence of missing values, when the missing values are treated in a pairwise fashion as discussed in Section [Missing values]. Matrices constructed in this way may not be not positive semidefinite, and therefore are not a valid correlation matrix. However, a valid correlation matrix can be calculated that is in some sense ‘close’ to the original.
Given an n × nn×n matrix, GG, there are a number of available ways of computing the ‘nearest’ correlation matrix, ΣΣ to GG:
(a) Frobenius Norm
Find ΣΣ such that
nn
(sijσij)2
i = 1j = 1
i=1 n j=1 n ( sij - σij ) 2
is minimized.
Where SS is the symmetric matrix defined as S = (1/2) (G + GT) S = 12 (G+GT)  and sijsij and σijσij denotes the elements of SS and ΣΣ respectively.
A weighted Frobenius norm can also be used. The term being summed across therefore becomes wi wj (sijσij)2 wi wj ( sij - σij ) 2  if row and column weights are being used or wij (sijσij)2 wij ( sij - σij ) 2  when element-wise weights are used.
(b) Factor Loading Method
This method is similar to (a) in that it finds a ΣΣ that is closest to SS in the Frobenius norm. However, it also ensures that ΣΣ has a kk-factor structure, that is ΣΣ can be written as
Σ = XXT + diag(IXXT)
Σ= XXT+ diag(I-XXT)
where II is the identity matrix and XX has nn rows and kk columns.
XX is often referred to as the factor loading matrix. This problem primarily arises when a factor model ξ = Xη + Dεξ=Xη+Dε is used to describe a multivariate time series or collateralized debt obligations. In this model ηkηk and ξnξn are vectors of independent random variables having zero mean and unit variance, with ηη and εε independent of each other, and Xn × kXn×k with Dn × nDn×n diagonal. In the case of modelling debt obligations ξξ can, for example, model the equity returns of nn different companies of a portfolio where ηη describes kk factors influencing all companies, in contrast to the elements of εε having only an effect on the equity of the corresponding company. With this model the complex behaviour of a portfolio, with potentially thousands of equities, is captured by looking at the major factors driving the behaviour.
The number of factors usually chosen is a lot smaller than nn, perhaps between 11 and 1010, yielding a large reduction in the complexity. The number of the factors, kk, which yields a matrix XX such that GXXT + diag(IXXT)FG-XXT+diag(I-XXT)F is within a required tolerance can also be determined, by experimenting with the input kk and comparing the norms.

Regression

Aims of regression modelling

In regression analysis the relationship between one specific random variable, the dependent or response variable, and one or more known variables, called the independent variables or covariates, is studied. This relationship is represented by a mathematical model, or an equation, which associates the dependent variable with the independent variables, together with a set of relevant assumptions. The independent variables are related to the dependent variable by a function, called the regression function, which involves a set of unknown parameters. Values of the parameters which give the best fit for a given set of data are obtained; these values are known as the estimates of the parameters.
The reasons for using a regression model are twofold. The first is to obtain a description of the relationship between the variables as an indicator of possible causality. The second reason is to predict the value of the dependent variable from a set of values of the independent variables. Accordingly, the most usual statistical problems involved in regression analysis are:
(i) to obtain best estimates of the unknown regression parameters;
(ii) to test hypotheses about these parameters;
(iii) to determine the adequacy of the assumed model; and
(iv) to verify the set of relevant assumptions.

Regression models and designed experiments

One application of regression models is in the analysis of experiments. In this case the model relates the dependent variable to qualitative independent variables known as factors. Factors may take a number of different values known as levels. For example, in an experiment in which one of four different treatments is applied, the model will have one factor with four levels. Each level of the factor can be represented by a dummy variable taking the values 00 or 11. So in the example there are four dummy variables xjxj, for j = 1,2,3,4j=1,2,3,4, such that:
xij = 1​ if the ​ith observation received the ​jth treatment
= 0​ otherwise,
xij =1​ if the ​ith observation received the ​jth treatment =0​ otherwise,
along with a variable for the mean x0x0:
xi0 = 1​ for all ​i.
xi0 =1​ for all ​i.
If there were 77 observations the data would be:
Treatment Y x0 x1 x2 x3 x4
1 y1 1 1 0 0 0
2 y2 1 0 1 0 0
2 y3 1 0 1 0 0
3 y4 1 0 0 1 0
3 y5 1 0 0 1 0
4 y6 1 0 0 0 1
4 y7 1 0 0 0 1
Treatment Y x0 x1 x2 x3 x4 1 y1 1 1 0 0 0 2 y2 1 0 1 0 0 2 y3 1 0 1 0 0 3 y4 1 0 0 1 0 3 y5 1 0 0 1 0 4 y6 1 0 0 0 1 4 y7 1 0 0 0 1
When dummy variables are used it is common for the model not to be of full rank. In the case above, the model would not be of full rank because
xi4 = xi0xi1xi2xi3,  i = 1,2,,7.
xi4=xi0-xi1-xi2-xi3,  i=1,2,,7.
This means that the effect of x4x4 cannot be distinguished from the combined effect of x0,x1,x2x0,x1,x2 and x3x3. This is known as aliasing. In this situation, the aliasing can be deduced from the experimental design and as a result the model to be fitted; in such situations it is known as intrinsic aliasing. In the example above no matter how many times each treatment is replicated (other than 00) the aliasing will still be present. If the aliasing is due to a particular dataset to which the model is to be fitted then it is known as extrinsic aliasing. If in the example above observation 11 was missing then the x1x1 term would also be aliased. In general intrinsic aliasing may be overcome by changing the model, e.g., remove x0x0 or x1x1 from the model, or by introducing constraints on the parameters, e.g., β1 + β2 + β3 + β4 = 0β1+β2+β3+β4=0.
If aliasing is present then there will no longer be a unique set of least squares estimates for the parameters of the model but the fitted values will still have a unique estimate. Some linear functions of the parameters will also have unique estimates; these are known as estimable functions. In the example given above the functions (β0 + β1β0+β1) and (β2β3β2-β3) are both estimable.

Selecting the regression model

In many situations there are several possible independent variables, not all of which may be needed in the model. In order to select a suitable set of independent variables, two basic approaches can be used.
(a) All possible regressions
In this case all the possible combinations of independent variables are fitted and the one considered the best selected. To choose the best, two conflicting criteria have to be balanced. One is the fit of the model which will improve as more variables are added to the model. The second criterion is the desire to have a model with a small number of significant terms. Depending on how the model is fit, statistics such as R2R2, which gives the proportion of variation explained by the model, and CpCp, which tries to balance the size of the residual sum of squares against the number of terms in the model, can be used to aid in the choice of model.
(b) Stepwise model building
In stepwise model building the regression model is constructed recursively, adding or deleting the independent variables one at a time. When the model is built up the procedure is known as forward selection. The first step is to choose the single variable which is the best predictor. The second independent variable to be added to the regression equation is that which provides the best fit in conjunction with the first variable. Further variables are then added in this recursive fashion, adding at each step the optimum variable, given the other variables already in the equation. Alternatively, backward elimination can be used. This is when all variables are added and then the variables dropped one at a time, the variable dropped being the one which has the least effect on the fit of the model at that stage. There are also hybrid techniques which combine forward selection with backward elimination.

Linear Regression Models

When the regression model is linear in the parameters (but not necessarily in the independent variables), then the regression model is said to be linear; otherwise the model is classified as nonlinear.
The most elementary form of regression model is the simple linear regression of the dependent variable, YY, on a single independent variable, xx, which takes the form
E(Y) = β0 + β1x
E(Y)=β0+β1x
(1)
where E(Y)E(Y) is the expected or average value of YY and β0β0 and β1β1 are the parameters whose values are to be estimated, or, if the regression is required to pass through the origin (i.e., no constant term),
E(Y) = β1x
E(Y)=β1x
(2)
where β1β1 is the only unknown parameter.
An extension of this is multiple linear regression in which the dependent variable, YY, is regressed on the pp (p > 1p>1) independent variables, x1,x2,,xpx1,x2,,xp, which takes the form
E(Y) = β0 + β1x1 + β2x2 + + βpxp
E(Y)=β0+β1x1+β2x2++βpxp
(3)
where β1,β2,,βpβ1,β2,,βp and β0β0 are the unknown parameters. Multiple linear regression models test include factors are sometimes known as General Linear (Regression) Models.
A special case of multiple linear regression is polynomial linear regression, in which the pp independent variables are in fact powers of the same single variable xx (i.e., xj = xjxj=xj, for j = 1,2,,pj=1,2,,p).
In this case, the model defined by (3) becomes
E(Y) = β0 + β1x + β2x2 + + βpxp.
E(Y)=β0+β1x+β2x2++βpxp.
(4)
There are a great variety of nonlinear regression models; one of the most common is exponential regression, in which the equation may take the form
E(Y) = a + becx.
E(Y)=a+becx.
(5)
It should be noted that equation (4) represents a linear regression, since even though the equation is not linear in the independent variable, xx, it is linear in the parameters β0,β1,β2, . ,βpβ0,β1,β2,.,βp, whereas the regression model of equation (5) is nonlinear, as it is nonlinear in the parameters (aa, bb and cc).

Fitting the regression model – least squares estimation

One method used to determine values for the parameters is, based on a given set of data, to minimize the sums of squares of the differences between the observed values of the dependent variable and the values predicted by the regression equation for that set of data – hence the term least squares estimation. For example, if a regression model of the type given by equation (3), namely
E(Y) = β0x0 + β1x1 + β2x2 + + βpxp,
E(Y)=β0x0+β1x1+β2x2++βpxp,
where x0 = 1x0=1 for all observations, is to be fitted to the nn data points
(x01,x11,x21,,xp1,y1)
(x02,x12,x22,,xp2,y2)
(x0n,x1n,x2n,,xpn,yn)
(x01,x11,x21,,xp1,y1) (x02,x12,x22,,xp2,y2) (x0n,x1n,x2n,,xpn,yn)
(6)
such that
yi = β0x0 + β1x1i + β2x2i + + βpxpi + ei,   i = 1,2,,n
yi=β0x0+β1x1i+β2x2i++βpxpi+ei,   i= 1,2,,n
where eiei are unknown independent random errors with E(ei) = 0E(ei)=0 and var(ei) = σ2var(ei)=σ2, σ2σ2 being a constant, then the method used is to calculate the estimates of the regression parameters β0,β1,β2,,βpβ0,β1,β2,,βp by minimizing
n
ei2.
i = 1
i=1nei2.
(7)
If the errors do not have constant variance, i.e.,
var(ei) = σi2 = (σ2)/(wi)
var(ei)=σi2=σ2wi
then weighted least squares estimation is used in which
n
wiei2
i = 1
i=1nwiei2
is minimized. For a more complete discussion of these least squares regression methods, and details of the mathematical techniques used, see Draper and Smith (1985) or Kendall and Stuart (1973).

Computational methods for least squares regression

Let XX be the nn by pp matrix of independent variables and yy be the vector of values for the dependent variable. To find the least squares estimates of the vector of parameters, β̂β^, the QRQR decomposition of XX is found, i.e.,
X = QR*
X=QR*
where R* =
(R)
0
R*= R 0 , RR being a pp by pp upper triangular matrix, and QQ an nn by nn orthogonal matrix. If RR is of full rank then β̂β^ is the solution to
Rβ̂ = c1
Rβ^=c1
where c = QTyc=QTy and c1c1 is the first pp rows of cc. If RR is not of full rank, a solution is obtained by means of a singular value decomposition (SVD) of RR,
R = Q*
(D0)
0 0
PT,
R=Q* D 0 0 0 PT,
where DD is a kk by kk diagonal matrix with nonzero diagonal elements, kk being the rank of RR, and Q*Q* and PP are pp by pp orthogonal matrices. This gives the solution
β̂ = P1D1 Q*1T c1,
β^=P1D-1 Q*1T c1,
P1P1 being the first kk columns of PP and Q*1Q*1 being the first kk columns of Q*Q*.
This will be only one of the possible solutions. Other estimates may be obtained by applying constraints to the parameters. If weighted regression with a vector of weights ww is required then both XX and yy are premultiplied by w1 / 2w1/2.
The method described above will, in general, be more accurate than methods based on forming (XTXXTX), (or a scaled version), and then solving the equations
(XTX)β̂ = XTy.
(XTX)β^=XTy.

Examining the fit of the model

Having fitted a model two questions need to be asked: first, ‘are all the terms in the model needed?’ and second, ‘is there some systematic lack of fit?’. To answer the first question either confidence intervals can be computed for the parameters or tt-tests can be calculated to test hypotheses about the regression parameters – for example, whether the value of the parameter, βkβk, is significantly different from a specified value, bkbk (often zero). If the estimate of βkβk is β̂kβ^k and its standard error is se(β̂k)se(β^k) then the tt-statistic is
(β̂kbk)/(sqrt(se(β̂k))).
β^k-bk se(β^k) .
It should be noted that both the tests and the confidence intervals may not be independent. Alternatively FF-tests based on the residual sums of squares for different models can also be used to test the significance of terms in the model. If model 11, giving residual sum of squares RSS1RSS1 with degrees of freedom ν1ν1, is a sub-model of model 22, giving residual sum of squares RSS2RSS2 with degrees of freedom ν2ν2, i.e., all terms in model 11 are also in model 22, then to test if the extra terms in model 22 are needed the FF-statistic
F = ((RSS1RSS2) / (ν1ν2))/(RSS2 / ν2)
F=(RSS1-RSS2)/(ν1-ν2) RSS2/ν2
may be used. These tests and confidence intervals require the additional assumption that the errors, eiei, are Normally distributed.
To check for systematic lack of fit the residuals, ri = yiiri=yi-y^i, where iy^i is the fitted value, should be examined. If the model is correct then they should be random with no discernible pattern. Due to the way they are calculated the residuals do not have constant variance. Now the vector of fitted values can be written as a linear combination of the vector of observations of the dependent variable, yy, = Hyy^=Hy. The variance-covariance matrix of the residuals is then (IH)σ2(I-H)σ2, II being the identity matrix. The diagonal elements of HH, hiihii, can therefore be used to standardize the residuals. The hiihii are a measure of the effect of the iith observation on the fitted model and are sometimes known as leverages.
If the observations were taken serially the residuals may also be used to test the assumption of the independence of the eiei and hence the independence of the observations.

Ridge regression

When data on predictor variables xx are multicollinear, ridge regression models provide an alternative to variable selection in the multiple regression model. In the ridge regression case, parameter estimates in the linear model are found by penalised least squares:
i = 1n [(j = 1pxijβ̂j)yi]2 + h j = 1p β̂j2 ,   h+ ,
i=1 n [ ( j=1 p xij β^j ) - yi ] 2 + h j=1 p β^j2 ,   h+ ,
where the value of the ridge parameter hh controls the trade-off between the goodness-of-fit and smoothness of a solution.

Robust Estimation

Least squares regression can be greatly affected by a small number of unusual, atypical, or extreme observations. To protect against such occurrences, robust regression methods have been developed. These methods aim to give less weight to an observation which seems to be out of line with the rest of the data given the model under consideration. That is to seek to bound the influence. For a discussion of influence in regression, see Hampel et al. (1986) and Huber (1981).
There are two ways in which an observation for a regression model can be considered atypical. The values of the independent variables for the observation may be atypical or the residual from the model may be large.
The first problem of atypical values of the independent variables can be tackled by calculating weights for each observation which reflect how atypical it is, i.e., a strongly atypical observation would have a low weight. There are several ways of finding suitable weights; some are discussed in Hampel et al. (1986).
The second problem is tackled by bounding the contribution of the individual eiei to the criterion to be minimized. When minimizing (7) a set of linear equations is formed, the solution of which gives the least squares estimates. The equations are
n
eixij = 0, j = 0,1,,k.
i = 1
i=1neixij=0, j=0,1,,k.
These equations are replaced by
n
ψ(ei / σ)xij = 0, j = 0,1,,k,
i = 1
i=1nψ(ei/σ)xij=0, j=0,1,,k,
(8)
where σ2σ2 is the variance of the eiei, and ψψ is a suitable function which down weights large values of the standardized residuals ei / σei/σ. There are several suggested forms for ψψ, one of which is Huber's function,
ψ(t) =
{ − c,t < c − t,|t| ≤ c − c,t > c
ψ(t)={ -c,t<c -t,|t|c -c,t>c
(9)
Figure 3
Figure 3
The solution to (8) gives the MM-estimates of the regression coefficients. The weights can be included in (8) to protect against both types of extreme value. The parameter σσ can be estimated by the median absolute deviations of the residuals or as a solution to, in the unweighted case,
n
χ(ei / σ̂) = (nk)β,
i = 1
i=1nχ(ei/σ^)=(n-k)β,
where χχ is a suitable function and ββ is a constant chosen to make the estimate unbiased. χχ is often chosen to be ψ2 / 2ψ2/2 where ψψ is given in (9). Another form of robust regression is to minimize the sum of absolute deviations, i.e.,
n
|ei|.
i = 1
i=1n|ei|.
For details of robust regression, see Hampel et al. (1986) and Huber (1981).
Robust regressions using least absolute deviations can be computed using functions in Chapter E02.

Generalized Linear Models

Generalized linear models are an extension of the general linear regression model discussed above. They allow a wide range of models to be fitted. These included certain nonlinear regression models, logistic and probit regression models for binary data, and log-linear models for contingency tables. A generalized linear model consists of three basic components:
(a) A suitable distribution for the dependent variable YY. The following distributions are common:
(i) Normal
(ii) binomial
(iii) Poisson
(iv) gamma
In addition to the obvious uses of models with these distributions it should be noted that the Poisson distribution can be used in the analysis of contingency tables while the gamma distribution can be used to model variance components. The effect of the choice of the distribution is to define the relationship between the expected value of YY, E(Y) = μE(Y)=μ, and its variance and so a generalized linear model with one of the above distributions may be used in a wider context when that relationship holds.
(b) A linear model η = βjxjη=βjxj, ηη is known as a linear predictor.
(c) A link function g( · )g(·) between the expected value of YY and the linear predictor, g(μ) = ηg(μ)=η. The following link functions are available:
For the binomial distribution εε, observing yy out of tt:
(i) logistic link: η = log(μ/(tμ)) η=log(μt-μ ) ;
(ii) probit link: η = Φ1 (μ/t) η=Φ-1 (μt) ;
(iii) complementary log-log: η = log(log(1μ/t)) η=log(-log(1-μt) ) .
For the Normal, Poisson, and gamma distributions:
(i) exponent link: η = μaη=μa, for a constant aa;
(ii) identity link: η = μη=μ;
(iii) log link: η = logμη=logμ;
(iv) square root link: η = sqrt(μ)η=μ;
(v) reciprocal link: η = 1/μ η=1μ .
For each distribution there is a canonical link. For the canonical link there exist sufficient statistics for the parameters. The canonical links are:
(i) Normal – identity;
(ii) binomial – logistic;
(iii) Poisson – logarithmic;
(iv) gamma – reciprocal.
For the general linear regression model described above the three components are:
(i) Distribution – Normal;
(ii) Linear model – βjxjβjxj;
(iii) Link – identity.
The model is fitted by maximum likelihood; this is equivalent to least squares in the case of the Normal distribution. The residual sums of squares used in regression models is generalized to the concept of deviance. The deviance is the logarithm of the ratio of the likelihood of the model to the full model in which μ̂i = yiμ^i=yi, where μ̂iμ^i is the estimated value of μiμi. For the Normal distribution the deviance is the residual sum of squares. Except for the case of the Normal distribution with the identity link, the χ2χ2 and FF-tests based on the deviance are only approximate; also the estimates of the parameters will only be approximately Normally distributed. Thus only approximate zz- or tt-tests may be performed on the parameter values and approximate confidence intervals computed.
The estimates are found by using an iterative weighted least squares procedure. This is equivalent to the Fisher scoring method in which the Hessian matrix used in the Newton–Raphson method is replaced by its expected value. In the case of canonical links the Fisher scoring method and the Newton–Raphson method are identical. Starting values for the iterative procedure are obtained by replacing the μiμi by yiyi in the appropriate equations.

Linear Mixed Effects Regression

In a standard linear model the independent (or explanatory) variables are assumed to take the same set of values for all units in the population of interest. This type of variable is called fixed. In contrast, an independent variable that fluctuates over the different units is said to be random. Modelling a variable as fixed allows conclusions to be drawn only about the particular set of values observed. Modelling a variable as random allows the results to be generalized to the different levels that may have been observed. In general, if the effects of the levels of a variable are thought of as being drawn from a probability distribution of such effects then the variable is random. If the levels are not a sample of possible levels then the variable is fixed. In practice many qualitative variables can be considered as having fixed effects and most blocking, sampling design, control and repeated measures as having random effects.
In a general linear regression model, defined by
y = Xβ + ε
y=Xβ+ε
where yy is a vector of nn observations on the dependent variable,
XX is an nn by pp design matrix of independent variables,
ββ is a vector of pp unknown parameters,
and εε is a vector of nn, independent and identically distributed, unknown errors, with ε ~ N (0,σ2) ε ~ N ( 0 , σ 2 ) ,
there are pp fixed effects (the ββ) and a single random effect (the error term εε).
An extension to the general linear regression model that allows for additional random effects is the linear mixed effects regression model, (sometimes called the variance components model). One parameterisation of a linear mixed effects model is
y = Xβ + Zν + ε
y=Xβ+Zν+ε
where yy is a vector of nn observations on the dependent variable,
XX is an nn by pp design matrix of fixed independent variables,
ββ is a vector of pp unknown fixed effects,
ZZ is an nn by qq design matrix of random independent variables,
νν is a vector of length qq of unknown random effects,
εε is a vector of length nn of unknown random errors,
and νν and εε are normally distributed with expectation zero and variance / covariance matrix defined by
Var
  ν ε  
=
  G 0 0 R  
.
Var( ν ε ) = ( G 0 0 R ) .
The functions currently available in this chapter are restricted to cases where R = σR2 I R= σ R 2 I , II is the n × nn×n identity matrix and GG is a diagonal matrix. Given this restriction the random variables, Z Z, can be subdivided into gqgq groups containing one or more variables. The variables in the iith group are identically distributed with expectation zero and variance σi2σi2. The model therefore contains three sets of unknowns, the fixed effects, ββ, the random effects, νν, and a vector of g + 1g+1 variance components, γγ, with γ = {σ12,σ22,,,,σg12,σg2,σR2} γ = {σ12,σ22,,,, σ g-1 2 ,σg2,σR2} . Rather than work directly with γγ and the full likelihood function, γγ is replaced by γ* = { σ12 / σR2 , σ22 / σR2 ,, σg12 / σR2 , σg2 / σR2 ,1} γ* = { σ12 / σR2 , σ22 / σR2 ,, σg-12 / σR2 , σg2 / σR2 ,1}  and the profiled likelihood function is used instead.
The model parameters are estimated using an iterative method based on maximizing either the restricted (profiled) likelihood function or the (profiled) likelihood functions. Fitting the model via restricted maximum likelihood involves maximizing the function
2 lR = log(|V|) + (np) log(rTV1r) + log|XTV1X| + (np) (1 + log(2π / (np))) + (np) .
-2 l R = log( |V| ) + ( n-p ) log( rT V-1 r ) + log| XT V-1 X | + ( n-p ) ( 1+ log( 2 π / ( n-p ) ) ) + ( n-p ) .
Whereas fitting the model via maximum likelihood involves maximizing
2 lR = log(|V|) + n log(rTV1r) + n log(2π / n) + n .
-2 l R = log( |V| ) + n log( rT V-1 r ) + n log( 2 π / n ) +n .
In both cases
V = ZG ZT + R,   r = yXb   and   b = (XTV1X)1 XT V1 y .
V = ZG ZT + R,   r=y-Xb   and   b = ( XT V-1 X )-1 XT V-1 y .
Once the final estimates for γ* γ *  have been obtained, the value of σR2 σR2  is given by
σR2 = (rTV1r) / (np) .
σR2 = ( rT V-1 r ) / ( n - p ) .
Case weights, Wc Wc , can be incorporated into the model by replacing XTX XTX  and ZTZ ZTZ  with XTWcX XTWcX  and ZTWcZ ZTWcZ  respectively, for a diagonal weight matrix Wc Wc .

Quantile Regression

Quantile regression is related to least squares regression in that both are interested in studying the relationship between a response variable and one or more independent or explanatory variables. However, whereas least squares regression is concerned with modelling the conditional mean of the dependent variable, quantile regression models the conditional ττth quantile of the dependent variable, for some value of τ(0,1)τ(0,1). So, for example, τ = 0.5τ=0.5 would be the median.
Throughout this section we will be making use of the following definitions:
(a) If ZZ is a real valued random variable with distribution function FF and density function ff, such that
α
F(α) = P(Zα) = f(z)dz
F ( α ) = P ( Z α ) = - α f ( z ) dz
then the ττth quantile, αα, can be defined as
α = F1 (τ) = inf{z : F(z)τ} , τ (0, 1 ) .
α = F-1 (τ) = inf{ z: F ( z ) τ } , τ (0,1) .
(b) I (L) I ( L )  denotes an indicator function taking the value 11 if the logical expression LL is true and 0 otherwise, e.g., I (z < 0) = 1 I ( z<0 ) = 1  if z < 0z<0 and 00 if z0z0.
(c) yy denotes a vector of nn observations on the dependent (or response) variable, y = {yi : i = 1,2,,n} y = { y i : i = 1, 2, , n } .
(d) XX denotes an n × pn×p matrix of explanatory or independent variables, often referred to as the design matrix, and xixi denotes a column vector of length pp which holds the iith row of XX.

Finding a sample quantile as an optimization problem

Consider the piecewise linear loss function
ρτ (z) = z (τI(z < 0))
ρ τ ( z ) = z ( τ - I ( z < 0 ) )
The minimum of the expectation
α
E(ρτ(zα)) = (τ1)(zα)f(z)dz + τ(zα)f(z)dz
α
E ( ρ τ ( z - α ) ) = ( τ - 1 ) - α ( z - α ) f ( z ) dz + τ α ( z - α ) f ( z ) dz
can be obtained by using the integral rule of Leibnitz to differentiate with respect to zz and then setting the result to zero, giving
α
(1τ)f(z)dzf(z)dz = F(α)τ = 0
α
( 1 - τ ) - α f ( z ) dz - α f ( z ) dz = F ( α ) - τ = 0
hence α = F 1 (τ) α = F - 1 ( τ )  when the solution is unique. If the solution is not unique then there exists a range of quantiles, each of which is equally valid. Taking the smallest value of such a range ensures that the empirical quantile function is left-continuous. Therefore obtaining the ττth quantile of a distribution FF can be achieved by minimizing the expected value of the loss function ρτρτ.
This idea of obtaining the quantile by solving an optimization problem can be extended to finding the ττth sample quantile. Given a vector of nn observed values, yy, from some distribution the empirical distribution function, Fn (α) = n1 i = 1n I (yiα) F n ( α ) = n-1 i=1 n I ( y i α )  provides an estimate of the unknown distribution function FF giving an expected loss of
n
E(ρτ(yα)) = n 1 ρτ (yiα)
i = 1
E ( ρ τ ( y - α ) ) = n - 1 i=1 n ρ τ ( y i - α )
and therefore the problem of finding the ττth sample quantile, α̂ (τ) α^ ( τ ) , can be expressed as finding the solution to the problem
n
minimize  ρτ (yiα)
α i = 1
minimize α i=1 n ρ τ ( y i - α )
effectively replacing the operation of sorting, usually required when obtaining a sample quantile, with an optimization.

From least squares to quantile regression

Given the vector yy it is a well known result that the sample mean, y^, solves the least squares problem
n
minimize  (yiμ)2 .
μ i = 1
minimize μ i=1 n ( y i - μ ) 2 .
This result leads to least squares regression where, given design matrix XX and defining the conditional mean of yy as μ (X) = X β μ ( X ) = X β , an estimate of ββ is obtained from the solution to
n
minimize  (yixiTβ)2 .
β p i = 1
minimize β p i=1 n ( y i - xiT β ) 2 .
Quantile regression can be derived in a similar manner by specifying the ττth conditional quantile as Qy (τ|X) = X β (τ) Q y ( τ | X ) = X β ( τ )  and estimating β (τ) β ( τ )  as the solution to
n
minimize  ρτ (yixiTβ) .
β p i = 1
minimize β p i=1 n ρ τ ( y i - xiT β ) .
(10)

Quantile regression as a linear programming problem

By introducing 2n2n slack variables, u = {ui : i = 1,2,,n} u = { u i : i = 1, 2, , n }  and v = {ui : i = 1,2,,n} v = { u i : i = 1, 2, , n } , the quantile regression minimization problem, (10), can be expressed as a linear programming (LP) problem, with primal and associated dual formulations
(a) Primal form
minimize τeTu + (1τ)eTv​   subject to  y = Xβ + uv
(u,v,β) + n × + n × p
minimize ( u, v, β ) + n × + n × p τ eT u + ( 1 - τ ) eT v ​   subject to   y = Xβ + u - v
(11)
where ee is a vector of length nn, where each element is 11.
If riri denotes the iith residual, ri = yi xiT β r i = y i - xiT β , then the slack variables, (u,v)(u,v), can be thought as corresponding to the absolute value of the positive and negative residuals respectively with
u i = { r i ​ if ​ r i > 0 0 ​ otherwise v i = { - r i ​ if ​ r i < 0 0 ​ otherwise
ui = { ri ​ if ​ ri > 0 0 ​ otherwise
vi = { − ri ​ if ​ ri < 0 0 ​ otherwise
(b) Dual form
The dual formulation of (11) is given by
maximize yTd​   subject to  XTd = 0, d [τ1,τ]n
d
maximize d yT d ​   subject to   XTd=0, d [τ-1,τ] n
which, on setting a = d + (1τ)ea=d+(1-τ)e, is equivalent to
maximize yTa​   subject to   XT a = (1τ) XT e , a [0,1]n
a
maximize a yT a ​   subject to   XT a = ( 1 - τ ) XT e , a [0,1] n
(12)
(c) Canonical form
Linear programming problems are often described in a standard way, called the canonical form. The canonical form of an LP problem is
minimize cTz​   subject to  ll{
z
Az
}
lu.
z
minimize z cT z ​   subject to   ll { z Az } lu .
Letting 0p0p denote a vector of pp zeros ± p±p denote a vector of pp arbitarily small or large values, In × nIn×n denote the n × nn×n identity matrix, c = {a,b}c={a,b} denote the row vector constructed by concatenating the elements of vector bb to the elements of vector aa and C = [A,B]C=[A,B] denote the matrix constructed by concatenating the columns of matrix BB onto the columns of matrix AA then setting
cT = {0p, τ eT , (1τ) eT } zT = {βT,uT,vT}
A = [X,In × n, In × n ] b = y
lu = { + p,n,n,y} ll = {p,0n,0n,y}
cT = { 0 p , τ eT , ( 1 - τ ) eT } zT = {βT,uT,vT} A = [X, I n×n , - I n×n ] b=y lu = { +p,n,n,y} ll = {-p,0n,0n,y}
gives the quantile regression LP problem as described in (11).
Once expressed as an LP problem the parameter estimates β̂(τ)β^(τ) can be obtained in a number of ways, for example via the inertia-controlling method of Gill and Murray (1978) (see nag_opt_lp_solve (e04mf)), the simplex method or an interior point method as used by nag_correg_quantile_linreg_easy (g02qf) and nag_correg_quantile_linreg (g02qg).

Estimation of the covariance matrix

Koenker (2005) shows that the limiting covariance matrix of sqrt(n) (β̂(τ)β(τ)) n ( β^ (τ) - β (τ) )  is of the form of a Huber Sandwich. Therefore, under the assumption of Normally distributed errors
sqrt(n) (β̂(τ)β(τ)) N (0, τ (1τ) Hn (τ) 1 Jn Hn (τ) 1 )
n ( β^ (τ) - β (τ) ) N (0, τ ( 1 - τ ) H n ( τ ) -1 J n H n ( τ ) -1 )
(13)
where
n
Jn = n1 xi xiT
i = 1
n
Hn(τ) = lim n1 xi xiT fi (Qyi(τ|xi))
n i = 1
Jn = n-1 i=1 n x i xiT H n ( τ ) = lim n n-1 i=1 n x i xiT f i ( Q y i ( τ | x i ) )
and fi (Qyi(τ|xi)) f i ( Q y i ( τ | x i ) )  denotes the conditional density of the response yy evaluated at the ττth conditional quantile.
More generally, the asymptotic covariance matrix for β̂ (τ1) , β̂ (τ1) ,, β̂ (τn) β^ ( τ 1 ) , β^ ( τ 1 ) ,, β^ ( τ n )  has blocks defined by
cov( sqrt(n) (β̂(τi)β(τi)) , sqrt(n) (β̂(τj)β(τj)) ) = (min (τi,τj)τiτj) Hn (τi) 1 Jn Hn (τj) 1
cov( n ( β^ ( τ i ) - β ( τ i ) ) , n ( β^ ( τ j ) - β ( τ j ) ) ) = ( min(τi,τj) - τi τj ) H n ( τi ) -1 Jn H n ( τj ) -1
(14)
Under the assumption of independent, identically distributed (iid) errors, (13) simplifies to
sqrt(n) (β̂(τ)β(τ)) N (0, τ(1τ) s(τ) 2 (XTX)1 )
n ( β^(τ) - β(τ) ) N (0, τ(1-τ) s(τ) 2 ( XT X )-1 )
where s(τ)s(τ) is the sparsity function, given by
s(τ) = 1/( f (F1(τ)) )
s(τ) = 1 f ( F-1(τ) )
a similar simplification occurs with (14).
In cases where the assumption of iid errors does not hold, Powell (1991) suggests using a kernel estimator of the form
n
n(τ) = (ncn)1 K(( yi xiT β̂ (τ) )/(cn)) xi xiT
i = 1
H ^ n (τ) = ( n cn )-1 i=1 n K( yi - xiT β^ ( τ ) cn ) xi xiT
for some bandwidth parameter cncn satisfying limn  cn 0 lim n c n 0  and limn  sqrt(n) cn lim n n c n  and Hendricks and Koenker (1991) suggest a method based on an extension of the idea of sparsity.
Rather than use an asymptotic estimate of the covariance matrix, it is also possible to use bootstrapping. Roughly speaking the original data is resampled and a set of parameter estimates obtained from each new sample. A sample covariance matrix is then constructed from the resulting matrix of parameter estimates.

Latent Variable Methods

Regression by means of projections to latent structures also known as partial least squares, is a latent variable linear model suited to data for which:
Latent variables are linear combinations of xx-variables that explain variance in xx and yy-variables. These latent variables, known as factors, are extracted iteratively from the data. A choice of the number of factors to include in a model can be made by considering diagnostic statistics such as the variable influence on projections (VIP).

Recommendations on Choice and Use of Available Functions

Correlation

Product-moment correlation

Let SSxSSx be the sum of squares of deviations from the mean, xx-, for the variable xx for a sample of size nn, i.e.,
n
SSx = (xix)2
i = 1
SSx=i=1n (xi-x-) 2
and let SCxySCxy be the cross-products of deviations from the means, xx- and yy-, for the variables xx and yy for a sample of size nn, i.e.,
n
SCxy = (xix)(yiy).
i = 1
SCxy=i=1n(xi-x-)(yi-y-).
Then the sample covariance of xx and yy is
cov(x,y) = (SCxy)/((n1))
cov(x,y)=SCxy (n-1)
and the product-moment correlation coefficient is
r = (cov(x,y))/(sqrt( var(x) var(y))) = (SCxy)/(sqrt(SSxSSy)).
r=cov(x,y) var(x) var(y) =SCxySSxSSy .
nag_correg_coeffs_pearson (g02ba) computes the product-moment correlation coefficients.
nag_correg_ssqmat_update (g02bt) updates the sample sums of squares and cross-products and deviations from the means by the addition/deletion of a (weighted) observation.
nag_correg_ssqmat (g02bu) computes the sample sums of squares and cross-products deviations from the means (optionally weighted). The output from multiple calls to nag_correg_ssqmat (g02bu) can be combined via a call to nag_correg_ssqmat_combine (g02bz), allowing large datasets to be summarised across multiple processing units.
nag_correg_ssqmat_update (g02bt) updates the sample sums of squares and cross-products and deviations from the means by the addition/deletion of a (weighted) observation.
nag_correg_ssqmat_to_corrmat (g02bw) computes the product-moment correlation coefficients from the sample sums of squares and cross-products of deviations from the means.
The three functions compute only the upper triangle of the correlation matrix which is stored in a one-dimensional array in packed form.
nag_correg_corrmat (g02bx) computes both the (optionally weighted) covariance matrix and the (optionally weighted) correlation matrix. These are returned in two-dimensional arrays. (Note that nag_correg_ssqmat_update (g02bt) and nag_correg_ssqmat (g02bu) can be used to compute the sums of squares from zero.)
nag_correg_coeffs_pearson_subset (g02bg) can be used to calculate the correlation coefficients for a subset of variables in the data matrix.

Product-moment correlation with missing values

If there are missing values then nag_correg_ssqmat (g02bu) and nag_correg_corrmat (g02bx), as described above, will allow casewise deletion by you giving the observation zero weight (compared with unit weight for an otherwise unweighted computation).
Other functions also handle missing values in the calculation of unweighted product-moment correlation coefficients. Casewise exclusion of missing values is provided by nag_correg_coeffs_pearson_miss_case (g02bb) while pairwise omission of missing values is carried out by nag_correg_coeffs_pearson_miss_pair (g02bc). These two functions calculate a correlation matrix for all the variables in the data matrix; similar output but for only a selected subset of variables is provided by functions nag_correg_coeffs_pearson_subset_miss_case (g02bh) and nag_correg_coeffs_pearson_subset_miss_pair (g02bj) respectively. As well as providing the Pearson product-moment correlation coefficients, these functions also calculate the means and standard deviations of the variables, and the matrix of sums of squares and cross-products of deviations from the means. For all four functions you are free to select appropriate values for consideration as missing values, bearing in mind the nature of the data and the possible range of valid values. The missing values for each variable may be either different or alike and it is not necessary to specify missing values for all the variables.

Nonparametric correlation

There are five functions which perform nonparametric correlations, each of which is capable of producing both Spearman's rank order and Kendall's tau correlation coefficients. The basic underlying concept of both these methods is to replace each observation by its corresponding rank or order within the observations on that variable, and the correlations are then calculated using these ranks.
It is obviously more convenient to order the observations and calculate the ranks for a particular variable just once, and to store these ranks for subsequent use in calculating all coefficients involving that variable; this does however require an amount of store of the same size as the original data matrix, which in some cases might be excessive. Accordingly, some functions calculate the ranks only once, and replace the input data matrix by the matrix of ranks, which are then also made available to you on exit from the function, while others preserve the data matrix and calculate the ranks a number of times within the function; the ranks of the observations are not provided as output by functions which work in the latter way. If it is possible to arrange the program in such a way that the first technique can be used, then efficiency of timing is achieved with no additional storage, whereas in the second case, it is necessary to have a second matrix of the same size as the data matrix, which may not be acceptable in certain circumstances; in this case it is necessary to reach a compromise between efficiency of time and of storage, and this may well be dependent upon local conditions.
Functions nag_correg_coeffs_kspearman_overwrite (g02bn) and nag_correg_coeffs_kspearman (g02bq) both calculate Kendall's tau and/or Spearman's rank order correlation coefficients taking no account of missing values; nag_correg_coeffs_kspearman_overwrite (g02bn) does so by calculating the ranks of each variable only once, and replacing the data matrix by the matrix of ranks, whereas nag_correg_coeffs_kspearman (g02bq) calculates the ranks of each variable several times. Functions nag_correg_coeffs_kspearman_miss_case_overwrite (g02bp) and nag_correg_coeffs_kspearman_miss_case (g02br) provide the same output, but treat missing values in a ‘casewise’ manner (see above); nag_correg_coeffs_kspearman_miss_case_overwrite (g02bp) calculates the ranks of each variable only once, and overwrites the data matrix of ranks, while nag_correg_coeffs_kspearman_miss_case (g02br) determines the ranks of each variable several times. For ‘pairwise’ omission of missing data (see above), the function nag_correg_coeffs_kspearman_miss_pair (g02bs) provides Kendall and/or Spearman coefficients.
Since nag_correg_coeffs_kspearman_overwrite (g02bn) and nag_correg_coeffs_kspearman_miss_case_overwrite (g02bp) order the observations and calculate the ranks of each variable only once, then if there are MM variables involved, there are only MM separate ‘ranking’ operations; this should be contrasted with the method used by functions nag_correg_coeffs_kspearman (g02bq) and nag_correg_coeffs_kspearman_miss_case (g02br) which perform M(M1) / 2 + 1M(M-1)/2+1 similar ranking operations. These ranking operations are by far the most time-consuming parts of these nonparametric functions, so for a matrix of as few as five variables, the time taken by one of the slower functions can be expected to be at least a factor of two slower than the corresponding efficient function; as the number of variables increases, so this relative efficiency factor increases. Only one function, nag_correg_coeffs_kspearman_miss_pair (g02bs), is provided for pairwise missing values, and this function carries out M(M1)M(M-1) separate rankings; since by the very nature of the pairwise method it is necessary to treat each pair of variables separately and rank them individually, it is impossible to reduce this number of operations, and so no alternative function is provided.

Partial correlation

nag_correg_corrmat_partial (g02by) computes a matrix of partial correlation coefficients from the correlation coefficients or variance-covariance matrix returned by nag_correg_corrmat (g02bx).

Robust correlation

nag_correg_robustm_corr_user_deriv (g02hl) and nag_correg_robustm_corr_user (g02hm) compute robust estimates of the variance-covariance matrix by solving the equations
n
1/nw(zi2)zi = 0
i = 1
1ni=1nw(zi2)zi=0
and
n
1/nu(zi2)ziziTv(zi2)I = 0,
i = 1
1n i= 1n u (zi2) zi ziT -v (zi2) I=0,
as described in Section [Robust estimation of correlation coefficients] for user-supplied functions ww and uu. Two options are available for vv, either v(t) = 1v(t)=1 for all tt or v(t) = u(t)v(t)=u(t).
nag_correg_robustm_corr_user (g02hm) requires only the function ww and uu to be supplied while nag_correg_robustm_corr_user_deriv (g02hl) also requires their derivatives.
In general nag_correg_robustm_corr_user_deriv (g02hl) will be considerably faster than nag_correg_robustm_corr_user (g02hm) and should be used if derivatives are available.
nag_correg_robustm_corr_huber (g02hk) computes a robust variance-covariance matrix for the following functions:
u(t) = au / t2​ if ​t < au2
u(t) = 1​ if ​au2tbu2
u(t) = bu / t2​ if ​t > bu2
u(t) =au/t2​ if ​t<au2 u(t) =1​ if ​au2tbu2 u(t) =bu/t2​ if ​t>bu2
and
w(t) = 1​ if ​tcw
w(t) = cw / t​ if ​t > cw
w(t) = 1​ if ​tcw w(t) =cw/t​ if ​t>cw
for constants auau, bubu and cwcw.
These functions solve a minimax space problem considered by Huber (1981). The values of auau, bubu and cwcw are calculated from the fraction of gross errors; see Hampel et al. (1986) and Huber (1981).
To compute a correlation matrix from the variance-covariance matrix nag_correg_ssqmat_to_corrmat (g02bw) may be used.

Nearest correlation matrix

Four functions are provided to calculate a nearest correlation matrix. The choice of function will depend on what definition of ‘nearest’ is required and whether there is any particular structure desired in the resulting correlation matrix.
nag_correg_corrmat_nearest (g02aa) computes the nearest correlation matrix in the Frobenius norm, using the method of Qi and Sun (2006).
nag_correg_corrmat_nearest_bounded (g02ab) uses an extension of the method implemented in nag_correg_corrmat_nearest (g02aa) allowing for the row and column weighted Frobenius norm to be used as well as bounds on the eigenvalues of the resulting correlation matrix to be specified.
nag_correg_corrmat_nearest_kfactor (g02ae) computes the factor loading matrix, allowing a correlation matrix with a kk-factor structure to be computed.
nag_nearest_correlation_h_weight (g02aj) again computes the nearest correlation matrix in the Frobenius norm, but allows for element-wise weighting as well as bounds on the eigenvalues.

Regression

Simple linear regression

Four functions are provided for simple linear regressions: nag_correg_linregs_const (g02ca) and nag_correg_linregs_const_miss (g02cc) perform the simple linear regression with a constant term (equation (1) above), while nag_correg_linregs_noconst (g02cb) and nag_correg_linregs_noconst_miss (g02cd) fit the simple linear regression with no constant term (equation (2) above). Two of these functions, nag_correg_linregs_const_miss (g02cc) and nag_correg_linregs_noconst_miss (g02cd), take account of missing values, which the others do not. In these two functions, an observation is omitted if it contains a missing value for either the dependent or the independent variable; this is equivalent to both the casewise and pairwise methods, since both are identical when there are only two variables involved. Input to these functions consists of the raw data, and output includes the coefficients, their standard errors and tt values for testing the significance of the coefficients; the FF value for testing the overall significance of the regression is also given.

Ridge regression

nag_correg_ridge_opt (g02ka) calculates a ridge regression, optimizing the ridge parameter according to one of four prediction error criteria.
nag_correg_ridge (g02kb) calculates ridge regressions for a given set of ridge parameters.

Polynomial regression and nonlinear regression

No functions are currently provided in this chapter for polynomial regression. If you wish to perform polynomial regressions you have three alternatives: you can use the multiple linear regression functions, nag_correg_linregm_fit (g02da), with a set of independent variables which are in fact simply the same single variable raised to different powers, or you can use the function nag_anova_dummyvars (g04ea) to compute orthogonal polynomials which can then be used with nag_correg_linregm_fit (g02da), or you can use the functions in Chapter E02 (Curve and Surface Fitting) which fit polynomials to sets of data points using the techniques of orthogonal polynomials. This latter course is to be preferred, since it is more efficient and liable to be more accurate, but in some cases more statistical information may be required than is provided by those functions, and it may be necessary to use the functions of this chapter.
More general nonlinear regression models may be fitted using the optimization functions in Chapter E04, which contains functions to minimize the function
n
ei2
i = 1
i=1nei2
where the regression parameters are the variables of the minimization problem.

Multiple linear regression – general linear model

nag_correg_linregm_fit (g02da) fits a general linear regression model using the QRQR method and an SVD if the model is not of full rank. The results returned include: residual sum of squares, parameter estimates, their standard errors and variance-covariance matrix, residuals and leverages. There are also several functions to modify the model fitted by nag_correg_linregm_fit (g02da) and to aid in the interpretation of the model.
nag_correg_linregm_obs_edit (g02dc) adds or deletes an observation from the model.
nag_correg_linregm_update (g02dd) computes the parameter estimates, and their standard errors and variance-covariance matrix for a model that is modified by nag_correg_linregm_obs_edit (g02dc), nag_correg_linregm_var_add (g02de) or nag_correg_linregm_var_del (g02df).
nag_correg_linregm_var_add (g02de) adds a new variable to a model.
nag_correg_linregm_var_del (g02df) drops a variable from a model.
nag_correg_linregm_fit_newvar (g02dg) fits the regression to a new dependent variable, i.e., keeping the same independent variables.
nag_correg_linregm_constrain (g02dk) calculates the estimates of the parameters for a given set of constraints, (e.g., parameters for the levels of a factor sum to zero) for a model which is not of full rank and the SVD has been used.
nag_correg_linregm_estfunc (g02dn) calculates the estimate of an estimable function and its standard error.
Note:  nag_correg_linregm_var_add (g02de) also allows you to initialize a model building process and then to build up the model by adding variables one at a time.
If you wish to use methods based on forming the cross-products/correlation matrix (i.e., (XTXXTX) matrix) rather than the recommended use of nag_correg_linregm_fit (g02da) then the following functions should be used.
For regression through the origin (i.e., no constant) nag_correg_linregm_coeffs_noconst (g02ch) preceded by:
For regression with intercept (i.e., with constant) nag_correg_linregm_coeffs_const (g02cg) preceded by:
Note that the four functions using pairwise deletion of missing value (marked with * *) should be used with great caution as the use of this method can lead to misleading results, particularly if a significant proportion of values are missing.
Both nag_correg_linregm_coeffs_const (g02cg) and nag_correg_linregm_coeffs_noconst (g02ch) require that the correlations/sums of squares involving the dependent variable must appear as the last row/column. Because the layout of the variables in your data array may not be arranged in this way, two functions, nag_correg_linregm_service_select (g02ce) and nag_correg_linregm_service_reorder (g02cf), are provided for rearranging the rows and columns of vectors and matrices. nag_correg_linregm_service_reorder (g02cf) simply reorders the rows and columns while nag_correg_linregm_service_select (g02ce) forms smaller vectors and matrices from larger ones.
Output from nag_correg_linregm_coeffs_const (g02cg) and nag_correg_linregm_coeffs_noconst (g02ch) consists of the coefficients, their standard errors, R2R2-values, tt and FF statistics.

Selecting regression models

To aid the selection of a regression model the following functions are available.
nag_correg_linregm_rssq (g02ea) computes the residual sums of squares for all possible regressions for a given set of dependent variables. The function allows some variables to be forced into all regressions.
nag_correg_linregm_rssq_stat (g02ec) computes the values of R2R2 and CpCp from the residual sums of squares as provided by nag_correg_linregm_rssq (g02ea).
nag_correg_linregm_fit_onestep (g02ee) enables you to fit a model by forward selection. You may call nag_correg_linregm_fit_onestep (g02ee) a number of times. At each call the function will calculate the changes in the residual sum of squares from adding each of the variables not already included in the model, select the variable which gives the largest change and then if the change in residual sum of squares meets the given criterion will add it to the model.
nag_correg_linregm_fit_stepwise (g02ef) uses a full stepwise selection to choose a subset of the explanatory variables. The method repeatedly applies a forward selection step followed by a backward elimination step until neither step updates the current model.

Residuals

nag_correg_linregm_stat_resinf (g02fa) computes the following standardized residuals and measures of influence for the residuals and leverages produced by nag_correg_linregm_fit (g02da):
(i) Internally studentized residual;
(ii) Externally studentized residual;
(iii) Cook's DD statistic;
(iv) Atkinson's TT statistic.
nag_correg_linregm_stat_durbwat (g02fc) computes the Durbin–Watson test statistic and bounds for its significance to test for serial correlation in the errors, eiei.

Robust regression

For robust regression using MM-estimates instead of least squares the function nag_correg_robustm (g02ha) will generally be suitable. nag_correg_robustm (g02ha) provides a choice of four ψψ-functions (Huber's, Hampel's, Andrew's and Tukey's) plus two different weighting methods and the option not to use weights. If other weights or different ψψ-functions are needed the function nag_correg_robustm_user (g02hd) may be used. nag_correg_robustm_user (g02hd) requires you to supply weights, if required, and also functions to calculate the ψψ-function and, optionally, the χχ-function. nag_correg_robustm_wts (g02hb) can be used in calculating suitable weights. The function nag_correg_robustm_user_varmat (g02hf) can be used after a call to nag_correg_robustm_user (g02hd) in order to calculate the variance-covariance estimate of the estimated regression coefficients.
For robust regression, using least absolute deviation, nag_fit_glin_l1sol (e02ga) can be used.

Generalized linear models

There are four functions for fitting generalized linear models. The output includes: the deviance, parameter estimates and their standard errors, fitted values, residuals and leverages.
nag_correg_glm_normal (g02ga) Normal distribution.
nag_correg_glm_binomial (g02gb) binomial distribution.
nag_correg_glm_poisson (g02gc) Poisson distribution.
nag_correg_glm_gamma (g02gd) gamma distribution.
While nag_correg_glm_normal (g02ga) can be used to fit linear regression models (i.e., by using an identity link) this is not recommended as nag_correg_linregm_fit (g02da) will fit these models more efficiently. nag_correg_glm_poisson (g02gc) can be used to fit log-linear models to contingency tables.
In addition to the functions to fit the models there is one function to predict from the fitted model and two functions to aid interpretation when the fitted model is not of full rank, i.e., aliasing is present.
nag_correg_glm_predict (g02gp) computes a predicted value and its associated standard error based on a previously fitted generalized linear model.
nag_correg_glm_constrain (g02gk) computes parameter estimates for a set of constraints, (e.g., sum of effects for a factor is zero), from the SVD solution provided by the fitting function.
nag_correg_glm_estfunc (g02gn) calculates an estimate of an estimable function along with its standard error.

Linear mixed effects regression

There are four functions for fitting linear mixed effects regression.
nag_correg_mixeff_reml (g02ja) and nag_correg_mixeff_hier_reml (g02jd) uses restricted maximum likelihood (REML) to fit the model.
nag_correg_mixeff_ml (g02jb) and nag_correg_mixeff_hier_ml (g02je) uses maximum likelihood to fit the model.
For all functions the output includes: either the maximum likelihood or restricted maximum likelihood and the fixed and random parameter estimates, along with their standard errors. Whilst it is possible to fit a hierachical model using nag_correg_mixeff_reml (g02ja) or nag_correg_mixeff_ml (g02jb), nag_correg_mixeff_hier_reml (g02jd) and nag_correg_mixeff_hier_ml (g02je) allow the model to be specified in a more intuitive way. nag_correg_mixeff_hier_init (g02jc) must be called prior to calling nag_correg_mixeff_hier_reml (g02jd) or nag_correg_mixeff_hier_ml (g02je).
As the estimates of the variance components are found using an iterative procedure initial values must be supplied for each σσ. In all four functions you can either specify these initial values, or allow the function to calculate them from the data using minimum variance quadratic unbiased estimation (MIVQUE0). Setting the maximum number of iterations to zero in any of the functions will return the corresponding likelihood, parameter estimates and standard errors based on these initial values.

Linear quantile regression

Two functions are provided for performing linear quantile regression, nag_correg_quantile_linreg_easy (g02qf) and nag_correg_quantile_linreg (g02qg). Of these, nag_correg_quantile_linreg_easy (g02qf) provides a simplified interface to nag_correg_quantile_linreg (g02qg), where many of the input parameters have been given default values and the amount of output available has been reduced.
Prior to calling nag_correg_quantile_linreg (g02qg) the optional argument array must be initialized by calling nag_correg_optset (g02zk) with optstr set to Initialize. Once these arrays have been initialized nag_correg_optget (g02zl) can be called to query the value of an optional argument.

Partial Least Squares (PLS)

nag_correg_pls_svd (g02la) calculates a nonlinear, iterative PLS by using singular value decomposition.
nag_correg_pls_wold (g02lb) calculates a nonlinear, iterative PLS by using Wold's method.
nag_correg_pls_fit (g02lc) calculates parameter estimates for a given number of PLS factors.
nag_correg_pls_pred (g02ld) calculates predictions given a PLS model.

Functionality Index

Computes the nearest correlation matrix using element-wise weights nag_nearest_correlation_h_weight (g02aj)
Computes the nearest correlation matrix using the k-factor model nag_correg_corrmat_nearest_kfactor (g02ae)
Computes the nearest correlation matrix using the method of Qi and Sun, 
    augmented nag_correg_corrmat_nearest (g02aa) to incorporate weights and bounds nag_correg_corrmat_nearest_bounded (g02ab)
    unweighted nag_correg_corrmat_nearest (g02aa)
Correlation-like coefficients, 
    all variables, 
        casewise treatment of missing values nag_correg_coeffs_zero_miss_case (g02be)
        no missing values nag_correg_coeffs_zero (g02bd)
        pairwise treatment of missing values nag_correg_coeffs_zero_miss_pair (g02bf)
    subset of variables, 
        casewise treatment of missing values nag_correg_coeffs_zero_subset_miss_case (g02bl)
        no missing values nag_correg_coeffs_zero_subset (g02bk)
        pairwise treatment of missing values nag_correg_coeffs_zero_subset_miss_pair (g02bm)
Generalized linear models, 
    binomial errors nag_correg_glm_binomial (g02gb)
    computes estimable function nag_correg_glm_estfunc (g02gn)
    gamma errors nag_correg_glm_gamma (g02gd)
    Normal errors nag_correg_glm_normal (g02ga)
    Poisson errors nag_correg_glm_poisson (g02gc)
    prediction nag_correg_glm_predict (g02gp)
    transform model parameters nag_correg_glm_constrain (g02gk)
Hierarchical mixed effects regression, 
    initiation nag_correg_mixeff_hier_init (g02jc)
    using maximum likelihood nag_correg_mixeff_hier_ml (g02je)
    using restricted maximum likelihood nag_correg_mixeff_hier_reml (g02jd)
Linear mixed effects regression, 
    via maximum likelihood (ML) nag_correg_mixeff_ml (g02jb)
    via restricted maximum likelihood (REML) nag_correg_mixeff_reml (g02ja)
Multiple linear regression, 
    from correlation coefficients nag_correg_linregm_coeffs_const (g02cg)
    from correlation-like coefficients nag_correg_linregm_coeffs_noconst (g02ch)
Multiple linear regression/General linear model, 
    add/delete observation from model nag_correg_linregm_obs_edit (g02dc)
    add independent variable to model nag_correg_linregm_var_add (g02de)
    computes estimable function nag_correg_linregm_estfunc (g02dn)
    delete independent variable from model nag_correg_linregm_var_del (g02df)
    general linear regression model nag_correg_linregm_fit (g02da)
    regression for new dependent variable nag_correg_linregm_fit_newvar (g02dg)
    regression parameters from updated model nag_correg_linregm_update (g02dd)
    transform model parameters nag_correg_linregm_constrain (g02dk)
Non-parametric rank correlation (Kendall and/or Spearman): 
    missing values, 
        casewise treatment of missing values, 
            overwriting input data nag_correg_coeffs_kspearman_miss_case_overwrite (g02bp)
            preserving input data nag_correg_coeffs_kspearman_miss_case (g02br)
        pairwise treatment of missing values nag_correg_coeffs_kspearman_miss_pair (g02bs)
    no missing values, 
        overwriting input data nag_correg_coeffs_kspearman_overwrite (g02bn)
        preserving input data nag_correg_coeffs_kspearman (g02bq)
Partial least squares, 
    calculates predictions given an estimated PLS model nag_correg_pls_pred (g02ld)
    fits a PLS model for a given number of factors nag_correg_pls_fit (g02lc)
    orthogonal scores using SVD nag_correg_pls_svd (g02la)
    orthogonal scores using Wold's method nag_correg_pls_wold (g02lb)
Product-moment correlation, 
    correlation coefficients, all variables, 
        casewise treatment of missing values nag_correg_coeffs_pearson_miss_case (g02bb)
        no missing values nag_correg_coeffs_pearson (g02ba)
        pairwise treatment of missing values nag_correg_coeffs_pearson_miss_pair (g02bc)
    correlation coefficients, subset of variables, 
        casewise treatment of missing values nag_correg_coeffs_pearson_subset_miss_case (g02bh)
        no missing values nag_correg_coeffs_pearson_subset (g02bg)
        pairwise treatment of missing values nag_correg_coeffs_pearson_subset_miss_pair (g02bj)
    correlation matrix, 
        compute correlation and covariance matrices nag_correg_corrmat (g02bx)
        compute from sum of squares matrix nag_correg_ssqmat_to_corrmat (g02bw)
        compute partial correlation and covariance matrices nag_correg_corrmat_partial (g02by)
    sum of squares matrix, 
        combine nag_correg_ssqmat_combine (g02bz)
        compute nag_correg_ssqmat (g02bu)
        update nag_correg_ssqmat_update (g02bt)
Quantile regression, 
    linear, 
        comprehensive nag_correg_quantile_linreg (g02qg)
        simple nag_correg_quantile_linreg_easy (g02qf)
Residuals, 
    Durbin–Watson test nag_correg_linregm_stat_durbwat (g02fc)
    standardized residuals and influence statistics nag_correg_linregm_stat_resinf (g02fa)
Ridge regression, 
    ridge parameter(s) supplied nag_correg_ridge (g02kb)
    ridge parameter optimized nag_correg_ridge_opt (g02ka)
Robust correlation, 
    Huber's method nag_correg_robustm_corr_huber (g02hk)
    user-supplied weight function only nag_correg_robustm_corr_user (g02hm)
    user-supplied weight function plus derivatives nag_correg_robustm_corr_user_deriv (g02hl)
Robust regression, 
    compute weights for use with nag_correg_robustm_user (g02hd) nag_correg_robustm_wts (g02hb)
    standard M-estimates nag_correg_robustm (g02ha)
    user-supplied weight functions nag_correg_robustm_user (g02hd)
    variance-covariance matrix following nag_correg_robustm_user (g02hd) nag_correg_robustm_user_varmat (g02hf)
Selecting regression model, 
    all possible regressions nag_correg_linregm_rssq (g02ea)
    forward selection nag_correg_linregm_fit_onestep (g02ee)
    R2 and Cp statistics nag_correg_linregm_rssq_stat (g02ec)
Service functions, 
    for multiple linear regression, 
        reorder elements from vectors and matrices nag_correg_linregm_service_reorder (g02cf)
        select elements from vectors and matrices nag_correg_linregm_service_select (g02ce)
    general option getting function nag_correg_optget (g02zl)
    general option setting function nag_correg_optset (g02zk)
Simple linear regression, 
    no intercept nag_correg_linregs_noconst (g02cb)
    no intercept with missing values nag_correg_linregs_noconst_miss (g02cd)
    with intercept nag_correg_linregs_const (g02ca)
    with intercept and with missing values nag_correg_linregs_const_miss (g02cc)
Stepwise linear regression, 
    Clarke's sweep algorithm nag_correg_linregm_fit_stepwise (g02ef)

References

Atkinson A C (1986) Plots, Transformations and Regressions Clarendon Press, Oxford
Churchman C W and Ratoosh P (1959) Measurement Definitions and Theory Wiley
Cook R D and Weisberg S (1982) Residuals and Influence in Regression Chapman and Hall
Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley
Gill P E and Murray W (1978) Numerically stable methods for quadratic programming Math. Programming 14 349–372
Hammarling S (1985) The singular value decomposition in multivariate statistics SIGNUM Newsl. 20(3) 2–25
Hampel F R, Ronchetti E M, Rousseeuw P J and Stahel W A (1986) Robust Statistics. The Approach Based on Influence Functions Wiley
Hays W L (1970) Statistics Holt, Rinehart and Winston
Hendricks W and Koenker R (1991) Hierarchical spline models for conditional quantiles and the demand for electricity Journal of the Maerican Statistical Association 87 58–68
Huber P J (1981) Robust Statistics Wiley
Kendall M G and Stuart A (1973) The Advanced Theory of Statistics (Volume 2) (3rd Edition) Griffin
Koenker R (2005) Quantile Regression Econometric Society Monographs, Cambridge University Press, New York
McCullagh P and Nelder J A (1983) Generalized Linear Models Chapman and Hall
Powell J L (1991) Estimation of monotonic regression models under quantile restrictions Nonparametric and Semiparametric Methods in Econometrics Cambridge University Press, Cambridge
Qi H and Sun D (2006) A quadratically convergent Newton method for computing the nearest correlation matrix SIAM J. Matrix AnalAppl 29(2) 360–385
Searle S R (1971) Linear Models Wiley
Siegel S (1956) Non-parametric Statistics for the Behavioral Sciences McGraw–Hill
Weisberg S (1985) Applied Linear Regression Wiley

PDF version (NAG web site, 64-bit version, 64-bit version)
Chapter Contents
Chapter Introduction
NAG Toolbox

© The Numerical Algorithms Group Ltd, Oxford, UK. 2009–2013