MORE INFORMATION
The LINEST(known_y's, known_x's, intercept, statistics)
function is used to perform linear regression. A least squares criterion is
used, and LINEST tries to find the best fit under that criterion. Known_y's
represents data on the dependent variable, and known_x's represents data on one
or more independent variables. The second argument is optional. If the second
argument is omitted, it is assumed to be an array of the same size as known_y's
that contains the values {1, 2, 3, and so on}.
The last argument is
set to TRUE if you want additional statistics (various sums of squares,
r-squared, f-statistic, or standard errors of the regression coefficients, for
example). In this case, LINEST must be entered as an array formula. The last
argument is optional; if it is omitted, it is interpreted as FALSE. The array's
dimensions are five rows by a number of columns that is equal to the number of
independent variables plus one if the third argument is set to TRUE (if the
third argument is not set to TRUE, the number of columns is equal to the number
of independent variables). Setting the third argument to FALSE in Excel 2002
and earlier requires a workaround. This workaround is discussed later in this
article.
In the most common uses of LINEST, the argument intercept is
set to TRUE. This setting means that you want the linear regression model to
include the possibility of a non-zero intercept coefficient in its model. If
known_x's is represented in data columns, setting intercept to TRUE tells
LINEST to add a data column that is filled with 1s as data on an additional
independent variable. The intercept argument should be set to FALSE only if you
want to force the regression line to go through the origin. For Excel 2002 and
earlier, setting this argument to FALSE always returns results that are not
correct, at least in the detailed statistics that are available from LINEST.
This article discusses this issue and provides a workaround. This problem has
been corrected in Excel 2003. The third argument is optional; if it is omitted,
it is interpreted as TRUE.
For ease of exposition in the remainder of
this article, assume that the data is arranged in columns, so that known_y's is
a column of y data and known_x's is one or more columns of x data. The
dimensions (or lengths) of each of these columns must be equal. All the
following observations are equally true if the data is not arranged in columns,
but it is easier to discuss this single (most frequently used)
case.
Another reason for setting the intercept argument to FALSE is if
you have already explicitly modeled the intercept in the data by including a
column of 1s. In Excel 2002 and earlier versions, the best solution is to
ignore the column of 1s and to call LINEST with this column missing from
known_x's and with the intercept argument set to TRUE. Excel 2002 and earlier
versions always return results that are not correct when the intercept argument
is set to FALSE. For Excel 2003, this approach is also preferred although the
formulas have been corrected for Excel 2003.
The performance of LINEST
in earlier versions of Excel (or more precisely, the performance of the
Analysis ToolPak's linear regression tool that calls LINEST) has been
justifiably criticized (see the "References" section in this article for more
information). The main concern about Excel's linear regression tools is a lack
of attention to issues of collinear (or nearly collinear) predictor variables.
Using datasets that were provided by the National Institute for Standards and
Technology (NIST, formerly the National Bureau of Standards) that were designed
to test the effectiveness of statistical software, numeric inaccuracies were
found in the areas of linear regression, analysis of variance, and non-linear
regression. In Excel 2003, these problems have been addressed, except for
non-linear regression, caused by an issue with the Solver add-in instead of
with the statistical functions or the Analysis ToolPak. The RAND function in
Excel was also put through standard tests of randomness and reported subpar
results. The RAND function has also been revised in Excel 2003.
LINEST was using the "Normal Equations" for finding regression
coefficients. This method is less stable numerically than Singular Value
Decomposition or QR Decomposition. Excel 2003 has implemented QR Decomposition.
While this is a standard technique that is described in many texts, a small
example is discussed in this article. QR Decomposition effectively analyzes
collinearity issues and excludes any data column from the final model if that
column can be expressed as a sum of multiples of the included columns. Near
collinearity is treated in the same way; a set of columns is nearly collinear
if, when you try to express one data column as a sum of multiples of others,
the resulting fit is extremely close. For example, the sum of squared
differences between the data column and the fitted values is less than
10^(-12).
The LINEST Help file has been updated for Excel
2003. For more information about LINEST, click
Microsoft Excel Help on the
Help menu, type
linest in the
Search for box in the
Assistance pane, and then click
Start searching to view the
topic.
In summary, the main changes are:
- The computational formulas for additional statistics (such
as r-squared and various sums of squares) that are used when intercept is set
to FALSE have been corrected.
- QR Decomposition has been implemented for solving all
cases, regardless of the settings of the third and fourth arguments.
Syntax
LINEST(known_y's, known_x's, intercept, statistics)
The arguments, known_y's and known_x's, must be arrays or cell ranges
that have related dimensions. If known_y's is one column by m rows, known_x's
should be c columns by m rows where c is greater than or equal to one; c is the
number of predictor variables; and m is the number of data points. (Similar
relationships must hold if known_y's is laid out in a single row; known_x's
should be in r rows where r is greater than or equal to one row, and
known_y's and known_x's should have the same number of columns.) The intercept
and statistics arguments must be set to TRUE or FALSE (or 0 or 1, which Excel
interprets as FALSE or TRUE, respectively). The last three arguments to LINEST
are all optional. If you omit the second argument, LINEST assumes a single
predictor that contains the entries {1, 2, 3, ...}. If the third argument is
omitted, it is interpreted as TRUE. If the fourth argument is omitted, it is
interpreted as FALSE.
The most common usage of LINEST includes two
ranges of cells that contain the data, such as LINEST(A1:A100, B1:F100, TRUE,
TRUE). Because there is typically more than one predictor variable, the second
argument in this example contains multiple columns. In this example, there are
one hundred subjects, one dependent variable value (known_y's) for each
subject, and five independent variable values (known_x's) for each subject.
Example usage
Separate Excel worksheet examples are provided to illustrate
different key concepts.
To illustrate a negative sum of squares in
Excel 2003 and earlier versions with the third argument set to FALSE, follow
these steps:
- Create a blank Excel worksheet, and then copy the following
table.
- Click cell A1 in your blank Excel worksheet, and then on
the Edit menu, click Paste so that the
entries in the table fill cells A1:H19 in your worksheet.
- After you paste the table into your new Excel worksheet,
click Paste Options, and then click Match Destination
Formatting.
- While the pasted range is still selected, on the
Format menu, point to Column, and then click
AutoFit Selection.
X | Y | | | | | | |
1 | 11 | | | | | | |
2 | 12 | | | | | | |
3 | 13 | | | | | | |
| | | Excel 2002 and
earlier | | | Excel 2003 | |
LINEST OUTPUT: | | | LINEST
OUTPUT: | | | LINEST OUTPUT: | |
5.285714286 | 0 | | 5.285714286 | 0 | | 5.285714286 | 0 |
1.237179148 | #N/A | | 1.237179148 | #N/A | | 1.237179148 | #N/A |
0.901250823 | 4.629100499 | | -20.42857143 | 4.629100499 | | 0.901250823 | 4.629100499 |
18.25333333 | 2 | | -1.906666667 | 2 | | 18.25333333 | 2 |
391.1428571 | 42.85714286 | | -40.85714286 | 42.85714286 | | 391.1428571 | 42.85714286 |
| | | | | | | |
2 | <--LINEST's total sum of
squares | | | | | | |
42.85714286 | <--LINEST'S correct residual sum of
squares | | | | | | |
-40.85714286 | <-- difference, LINEST's regression
sum of squares | | | | | | |
| | | | | | | |
434 | <--Correct total sum of
squares | | | | | | |
42.85714286 | <--LINEST's correct residual sum of
squares | | | | | | |
391.1428571 | <-- difference, correct regression sum
of squares | | | | | | |
Entries in cells A7:B11 correspond to output in Excel
2003. To generate output that is appropriate for your version of Excel, click
cell A7, select the cell range A7:B11, and then enter the following array
formula:
= LINEST(B2:B4, A2:A4, FALSE, TRUE)
This example focuses on a LINEST model that has the third argument set
to FALSE. Excel 2002 and earlier versions of LINEST use a formula for total sum
of squares that is not correct in this case. This formula underestimates the
real total sum of squares and always leads to values of the regression sum of
squares that are not correct. This formula sometimes yields negative regression
sum of squares and negative r-squared values. Cells D6:E11 show the LINEST
output in Excel 2002 and earlier. In Excel 2002 and earlier, LINEST computes
the total sum of squares for the model that has the third argument set to FALSE
as the sum of squared deviations of y-values about the y column mean. This
value is shown in cell A13 and is an appropriate computation when the third
argument is set to TRUE. However, when the third argument is set to FALSE, the
correct total sum of squares is the sum of squares of the y-values and is shown
in cell A17. Use of the wrong formula for total sum of squares leads to the
negative regression sum of squares in cell A15. The correct output in Excel
2003 is shown in cells G6:H11.
If you use an earlier version of Excel
and you want to force the best fit linear regression through the origin, you
must compute some entries in the last three rows of the output array again. To
do so, use the following workaround.
Note You can refer to the previous worksheet.
- Call Excel with the fourth argument set to TRUE to generate
the detailed output array. Because you use Excel 2002 or earlier, assume that
this output is in cells D7:E11.
Note that only the following entries
require modification: r squared, f statistic, and regression sum of squares.
These entries appear in cells D9, D10, and D11. - Compute the total sum of squares again as SUMSQ(known_y's).
In this example, SUMSQ(B2:B4).
The regression sum of squares (the
value to replace the entry in cell D11) is SUMSQ(B2:B4) - E11. This value is
total sum of squares minus the residual sum of squares (as computed correctly
by LINEST). - R squared (the value to replace the entry in cell D9) is
then the regression sum of squares divided by total sum of squares.
- F statistic is f statistic for LINEST (in cell D10)
multiplied by the correct regression sum of squares, and then divided by the
LINEST regression sum of squares (in cell D11).
This procedure corrects the formulas in Excel 2002 and earlier
versions but does not address collinearity. Therefore, the procedure works well
only without collinearity (the typical case in practice). Numeric problems can
be magnified when collinearity or near-collinearity exists, similar to what
occurs in the NIST datasets. Even simple cases can create problems, as
illustrated in the next example.
Predictor columns (known_x's) are
collinear if at least one column, c, can be expressed as a sum of multiples of
others (c1, c2, and perhaps additional columns). Column c is frequently called
redundant because the information that it contains can be constructed from the
columns c1, c2, and other columns. The fundamental principle in the presence of
collinearity is that results should not be affected by whether a redundant
column is included in the original data or removed from the original data.
Because Excel 2002 and earlier versions of LINEST did not look for
collinearity, this principle was easily violated. Predictor columns are nearly
collinear if at least one column, c, can be expressed as almost equal to a sum
of multiples of others (c1, c2, and others). In this case, "almost equal" means
a very small sum of squared deviations of entries in c from corresponding
entries in the weighted sum of c1, c2, and other columns; "very small" might be
less than 10^(-12), for example.
To illustrate collinearity, follow
these steps:
- Create a blank Excel worksheet, and then copy the following
table.
- Click cell A1 in your blank Excel worksheet, and then on
the Edit menu, click Paste so that the
entries in the table fill cells A1:N27 in your worksheet.
- After you paste the table into your new Excel worksheet,
click Paste Options, and then click Match Destination
Formatting.
- While the pasted range is still selected, on the
Format menu, point to Column, and then click
AutoFit Selection.
y's: | x's: | | | | | | | | | | | | |
1 | 1 | 2 | 1 | | | | | | | | | | |
2 | 3 | 4 | 1 | | | | | | | | | | |
3 | 4 | 5 | 1 | | | | | | | | | | |
4 | 6 | 7 | 1 | | | | | | | | | | |
5 | 7 | 8 | 1 | | | | | | | | | | |
| | | | | | | | | | | | | |
LINEST using columns
B,C: | | | | | Excel 2002 and earlier
values: | | | | | Excel 2003
values: | | | |
| | | | | #NUM! | #NUM! | #NUM! | | | 0 | 0.657895 | 0.236842 | |
| | | | | #NUM! | #NUM! | #NUM! | | | 0 | 0.04386 | 0.206653 | |
| | | | | #NUM! | #NUM! | #NUM! | | | 0.986842 | 0.209427 | #N/A | |
| | | | | #NUM! | #NUM! | #NUM! | | | 225 | 3 | #N/A | |
| | | | | #NUM! | #NUM! | #NUM! | | | 9.868421 | 0.131579 | #N/A | |
| | | | | | | | | | | | | |
LINEST using columns B, C, D with FALSE 3rd arg:
| | | | | | | | | | | | | |
| | | | | 0.403646 | -0.1668 | 0.824698 | 0 | | 0 | 0.236842 | 0.421053 | 0 |
| | | | | 2484491 | 2484491 | 2484491 | #N/A | | 0 | 0.206653 | 0.246552 | #N/A |
| | | | | 0.986842 | 0.256495 | #N/A | #N/A | | 0.997608 | 0.209427 | #N/A | #N/A |
| | | | | 50 | 2 | #N/A | #N/A | | 625.5 | 3 | #N/A | #N/A |
| | | | | 9.868421 | 0.131579 | #N/A | #N/A | | 54.86842 | 0.131579 | #N/A | #N/A |
| | | | | | | | | | | | | |
LINEST using column B
only | | | | | | | | | | | | | |
| | | | | 0.657895 | 0.236842 | | | | 0.657895 | 0.236842 | | |
| | | | | 0.04386 | 0.206653 | | | | 0.04386 | 0.206653 | | |
| | | | | 0.986842 | 0.209427 | | | | 0.986842 | 0.209427 | | |
| | | | | 225 | 3 | | | | 225 | 3 | | |
| | | | | 9.868421 | 0.131579 | | | | 9.868421 | 0.131579 | | |
Data is included in cells A1:D6. Results of three
different calls to LINEST are shown for Excel 2002 and earlier in cells F8:I27,
and the results for Excel 2003 are in cells K8:N27. To verify that the results
in your version coincide with the results in cells F8:I27 or in cells K8:N27,
you can enter the following three array formulas:
- Select cell A9 and the cell range A9:C13, and then enter
the following formula as an array formula:
=LINEST(A2:A6,B2:C6,TRUE,TRUE)
- Select cell A16 and the cell range A16:D20, and then enter
the following formula as an array formula:
=LINEST(A2:A6,B2:D6,FALSE,TRUE)
- Select cell A23 and the cell range A23:B27, and then enter
the following formula as an array formula:
=LINEST(A2:A6,B2:B6,TRUE,TRUE)
The first model, in rows 8 to 13, uses columns B and C as
predictors. By omitting the third argument, the first model requests Excel to
model the intercept. Excel then effectively inserts an additional predictor
column that looks just like cells D2:D6. Entries in column C in rows 2 to 6 are
exactly equal to the sum of the corresponding entries in columns B and D.
Therefore, collinearity is present because column C is a sum of multiples of
column B and an additional column of 1s inserted by LINEST because the third
argument to LINEST was omitted or set to TRUE (the "normal" case). Collinearity
causes numeric problems, Excel 2002 and earlier cannot compute results, and the
LINEST output table is filled with #NUM!.
The second model, in rows
15 to 20, uses columns B, C, and D as predictors but sets the third argument of
LINEST to FALSE. Because the intercept was explicitly modeled through column D,
you do not want Excel to separately model the intercept by building a second
column of 1s. Again, collinearity is present because entries in column C in
rows 2 to 6 are exactly equal to the sum of corresponding entries in columns B
and D. Analyzing the presence of collinearity is not affected by the fact that
column D is explicitly used in this model and a similar column of 1s is created
internally by Excel in the first model. In this case, values are computed for
the LINEST output table, but some of the values are not
appropriate.
Any version of Excel can handle the third model (in rows
22 to 27). There is no collinearity, and Excel models the intercept, thereby
avoiding the model with the third argument set to FALSE (that uses the
incorrect formulas to compute some statistics in versions of Excel earlier than
Excel 2003). This example is included in this article for the following
reasons:
- This example is perhaps most typical of practical cases: no
collinearity is present and the third argument to LINEST is either TRUE or
omitted. All versions of Excel can handle these cases. If you use Excel 2002 or
earlier, numeric problems are not likely to occur in these cases.
- This example is used to compare behavior of Excel 2003 in
the three models. Most major statistical packages analyze collinearity, remove
a column that is a sum of multiples of others from the model, and alert you
with a message like "Column C is linearly dependent on other predictor columns
and has been removed from the analysis."
In Excel 2003, the message is conveyed in the LINEST output
table instead of in a text string. A regression coefficient that is zero and
whose standard error is also zero corresponds to a coefficient for a column
that has been removed from the model. The entries in cells K9:K10 show this. In
this case, LINEST removed column C (coefficients in cells K9, L9, M9 correspond
to columns C, B, and Excel's intercept column, respectively). When collinearity
is present, any one of the columns that are involved can be
removed.
The second model in rows 15 to 20 sets the third argument of
LINEST to FALSE. The entries in cells N16:N17 are Excel's standard way of
conveying this information. Entries in cells K16:K17 show that LINEST removed
one column (column D) from the model. Coefficients in columns L and M are for
data columns C and B, respectively.
In the third model, in rows 22 to
27, no collinearity is present and no columns are removed. The predicted y
values are the same in all three models because explicitly modeling an
intercept (like in the second model) provides exactly the same modeling
capability as implicitly modeling it in Excel internally (like in the first
model and the third model). Also, removing a redundant column that is a sum of
multiples of others (like in the first model and the second model) does not
reduce the goodness of fit of the resulting model. Such columns are removed
precisely because they represent no value added in trying to find the best
least squares fit.
The following example is a final example of
collinearity. The data in this example is also used in the QR Decomposition
example in this article. To illustrate the final example of collinearity,
follow these steps:
- Create a blank Excel worksheet, and then copy the following
table.
- Click cell A1 in your blank Excel worksheet, and then on
the Edit menu, click Paste so that the
entries in the table fill cells A1:D25 in your worksheet.
- After you paste the table into your new Excel worksheet,
click Paste Options, and then click Match Destination
Formatting.
- While the pasted range is still selected, on the
Format menu, point to Column, and then click
AutoFit Selection.
- Select cell A7 and the cell range A7:C11. The formula
editing bar should display the following information:
=LINEST(A2:A5,C2:D5,,TRUE)
- Enter the information from the formula editing bar as an
array formula by pressing CTRL+SHIFT+ENTER.
Cells A7:C11 show LINEST
results that match the values in cells A13:C18 or cells A20:C25, depending on
the version of Excel that you use.
Y | | X0 | X1 |
10 | | 1 | 11 |
20 | | 4 | 20 |
30 | | 8 | 32 |
40 | | 7 | 29 |
| | | |
=LINEST(A2:A5,C2:D5,,TRUE) | =LINEST(A2:A5,C2:D5,,TRUE) | =LINEST(A2:A5,C2:D5,,TRUE) | |
=LINEST(A2:A5,C2:D5,,TRUE) | =LINEST(A2:A5,C2:D5,,TRUE) | =LINEST(A2:A5,C2:D5,,TRUE) | |
=LINEST(A2:A5,C2:D5,,TRUE) | =LINEST(A2:A5,C2:D5,,TRUE) | =LINEST(A2:A5,C2:D5,,TRUE) | |
=LINEST(A2:A5,C2:D5,,TRUE) | =LINEST(A2:A5,C2:D5,,TRUE) | =LINEST(A2:A5,C2:D5,,TRUE) | |
=LINEST(A2:A5,C2:D5,,TRUE) | =LINEST(A2:A5,C2:D5,,TRUE) | =LINEST(A2:A5,C2:D5,,TRUE) | |
| | | |
Excel 2002 values: | | | |
-3.5 | 14.1666666666667 | 34.6666666666666 | |
0 | 0 | 0 | |
0.806666666666667 | 9.83192080250175 | #N/A | |
2.08620689655172 | 1 | #N/A | |
403.333333333333 | 96.6666666666666 | #N/A | |
| | | |
Excel 2003 values: | | | |
1.22222222222222 | 0 | -3.11111111111111 | |
0.423098505881328 | 0 | 10.3334826751454 | |
0.806666666666667 | 6.95221787153807 | #N/A | |
8.3448275862069 | 2 | #N/A | |
403.333333333333 | 96.6666666666667 | #N/A | |
This model illustrates that the presence of
collinearity might not be that easy to determine. Examining cells C2:D5
requires that you are aware that LINEST is modeling the intercept by providing
a built-in column of 1s. If you call this column, X2, you might notice that X1
can be represented as 3*X0 + 8*X2.
All versions of Excel provide the
same goodness of fit as measured by cell B18 and cell B25. However, Excel 2002
provides all zeros as the values for the standard errors of the regression
coefficients.
The entries for df in cell B17 and cell B24 differ. The
f-statistics in cell A17 and cell A24 also differ. The df for Excel 2003 is
correct for a model with two predictor columns, exactly what the model uses
(Excel's built-in intercept column and X1). The df for Excel 2002 is
appropriate for three predictor columns. However, because of collinearity,
there are only two predictor columns. There are only two predictor columns
because after you have used any two of the three columns, expanding the model
to use the third column has no value added. Therefore, because of collinearity,
the entry in cell B17 is not correct and the entry in cell B24 is correct. The
incorrect value of df affects statistics that depend on df: the f ratios in
cell A17 and cell A24 and the standard error of y in cell B16 and cell B23.
Entries in cell A17 and cell B16 are not correct; the entries in cell A24 and
cell B23 are correct.
The following example illustrates the QR
Decomposition algorithm. It has two primary advantages over the algorithm that
uses the "Normal Equations." First, results are more stable numerically. When
collinearity is not an issue, results are typically accurate to more decimal
places with QR Decomposition. Second, QR Decomposition appropriately handles
collinearity. It can be thought of as "processing" columns one at a time, and
it does not process columns that are linearly dependent on previously processed
columns. The previous algorithm does not correctly handle collinearity. If
collinearity is present, the results from the previous algorithm are frequently
distorted, sometimes to the point of returning #NUM!.
Y | | X0 | X1 | | | | | | | | |
10 | | 1 | 11 | | | | | | | | |
20 | | 4 | 20 | | | | | | | | |
30 | | 8 | 32 | | | | | | | | |
40 | | 7 | 29 | | | | | | | | |
col
means: | | | | | | | | | | | |
=AVERAGE(A2:A5)
| | =AVERAGE(C2:C5) | =AVERAGE(D2:D5) | | | | | | | | |
| | | | | | | | | | | |
centered data with added col of 1's,
X2: | | | | | | | | | | | |
Y | | X0 | X1 | X2 | | | | | | | |
=A2-A$7 | | =C2-C$7 | =D2-D$7 | 1 | | | | | | | |
=A3-A$7 | | =C3-C$7 | =D3-D$7 | 1 | | | | | | | |
=A4-A$7 | | =C4-C$7 | =D4-D$7 | 1 | | | | | | | |
=A5-A$7 | | =C5-C$7 | =D5-D$7 | 1 | | | | | | | |
| | | | | | | | | | | |
TotalSS: | | | | | | | | | | | |
=SUMSQ(A11:A14) | | | | | | | | | | | |
X col squared
lengths: | | | | | | | | | | | |
| | =SUMSQ(C11:C14) | =SUMSQ(D11:D14) | =SUMSQ(E11:E14) | | | | | | | |
| | | | | | | | | | | |
after swapping
cols: | | | | | | | | | | | |
Y
| | X1 | X0 | X2 | | | | | | | |
=A11 | | =D11 | =C11 | 1 | | | | | | | |
=A12 | | =D12 | =C12 | 1 | | | | | | | |
=A13 | | =D13 | =C13 | 1 | | | | | | | |
=A14 | | =D14 | =C14 | 1 | | | | | | | |
| | | | | | | | | | | |
compute
V: | | | | V | | and
VTV: | | and V times V transpose: | | | |
=C23 | =SQRT(D19) | 1 | | =A29+B$29*C29 | | =SUMSQ(E29:E32) | | =MMULT(E29:E32,TRANSPOSE(E29:E32)) | =MMULT(E29:E32,TRANSPOSE(E29:E32)) | =MMULT(E29:E32,TRANSPOSE(E29:E32)) | =MMULT(E29:E32,TRANSPOSE(E29:E32)) |
=C24 | | 0 | | =A30+B$29*C30 | | | | =MMULT(E29:E32,TRANSPOSE(E29:E32)) | =MMULT(E29:E32,TRANSPOSE(E29:E32)) | =MMULT(E29:E32,TRANSPOSE(E29:E32)) | =MMULT(E29:E32,TRANSPOSE(E29:E32)) |
=C25 | | 0 | | =A31+B$29*C31 | | | | =MMULT(E29:E32,TRANSPOSE(E29:E32)) | =MMULT(E29:E32,TRANSPOSE(E29:E32)) | =MMULT(E29:E32,TRANSPOSE(E29:E32)) | =MMULT(E29:E32,TRANSPOSE(E29:E32)) |
=C26 | | 0 | | =A32+B$29*C32 | | | | =MMULT(E29:E32,TRANSPOSE(E29:E32)) | =MMULT(E29:E32,TRANSPOSE(E29:E32)) | =MMULT(E29:E32,TRANSPOSE(E29:E32)) | =MMULT(E29:E32,TRANSPOSE(E29:E32)) |
compute P = I -
(2/VTV)*VVtranspose | | | | | | premultiply
X by P: | | | | | and Y by P: |
| | | | | | X1 | X0 | X2 | | | Y |
=-(2/$G$29)*I29+1 | =-(2/$G$29)*J29 | =-(2/$G$29)*K29 | =-(2/$G$29)*L29 | | | =MMULT(A35:D38,C23:E26) | =MMULT(A35:D38,C23:E26) | =MMULT(A35:D38,C23:E26) | | | =MMULT(A35:D38,A23:A26) |
=-(2/$G$29)*I30 | =-(2/$G$29)*J30+1 | =-(2/$G$29)*K30 | =-(2/$G$29)*L30 | | | =MMULT(A35:D38,C23:E26) | =MMULT(A35:D38,C23:E26) | =MMULT(A35:D38,C23:E26) | | | =MMULT(A35:D38,A23:A26) |
=-(2/$G$29)*I31 | =-(2/$G$29)*J31 | =-(2/$G$29)*K31+1 | =-(2/$G$29)*L31 | | | =MMULT(A35:D38,C23:E26) | =MMULT(A35:D38,C23:E26) | =MMULT(A35:D38,C23:E26) | | | =MMULT(A35:D38,A23:A26) |
=-(2/$G$29)*I32 | =-(2/$G$29)*J32 | =-(2/$G$29)*K32 | =-(2/$G$29)*L32+1 | | | =MMULT(A35:D38,C23:E26) | =MMULT(A35:D38,C23:E26) | =MMULT(A35:D38,C23:E26) | | | =MMULT(A35:D38,A23:A26) |
| | | | | | | | | | | |
Squared lengths of X, Y cols are unchanged after you
premultiply by
P: | | | | | | =SUMSQ(G35:G38) | =SUMSQ(H35:H38) | =SUMSQ(I35:I38) | | | =SUMSQ(L35:L38) |
Algorithm continues with only bold portions of the revised X
matrix and Y
column | | | | | | | | | | | |
squared lengths of last 3 rows of X
vectors: | | | | | | | =SUMSQ(H36:H38) | =SUMSQ(I36:I38) | | | |
| | | | | | | | | | | |
after swapping
cols: | | | | | | | | | | | |
Y | | X1 | X2 | X0 | | | | | | | |
=L35 | | =G35 | =I35 | =H35 | | | | | | | |
=L36 | | =G36 | =I36 | =H36 | | | | | | | |
=L37 | | =G37 | =I37 | =H37 | | | | | | | |
=L38 | | =G38 | =I38 | =H38 | | | | | | | |
| | | | | | | | | | | |
compute
V: | | | | V | | and
VTV: | | and V times V transpose: | | | |
=D47 | =SQRT(I42) | 1 | | =A52+B$52*C52 | | =SUMSQ(E52:E54) | | =MMULT(E52:E54,TRANSPOSE(E52:E54)) | =MMULT(E52:E54,TRANSPOSE(E52:E54)) | =MMULT(E52:E54,TRANSPOSE(E52:E54)) | |
=D48 | | 0 | | =A53+B$52*C53 | | | | =MMULT(E52:E54,TRANSPOSE(E52:E54)) | =MMULT(E52:E54,TRANSPOSE(E52:E54)) | =MMULT(E52:E54,TRANSPOSE(E52:E54)) | |
=D49 | | 0 | | =A54+B$52*C54 | | | | =MMULT(E52:E54,TRANSPOSE(E52:E54)) | =MMULT(E52:E54,TRANSPOSE(E52:E54)) | =MMULT(E52:E54,TRANSPOSE(E52:E54)) | |
compute P = I -
(2/VTV)*VVtranspose | | | | | | premultiply
X by P: | | | | | and Y by P: |
| | | | | | X1 | X2 | X0 | | | Y |
=-(2/$G$52)*I52+1 | =-(2/$G$52)*J52 | =-(2/$G$52)*K52 | | | | =C46 | =D46 | =E46 | | | =L35 |
=-(2/$G$52)*I53 | =-(2/$G$52)*J53+1 | =-(2/$G$52)*K53 | | | | =G36 | =MMULT(A57:C59,D47:E49) | =MMULT(A57:C59,D47:E49) | | | =MMULT(A57:C59,A47:A49) |
=-(2/$G$52)*I54 | =-(2/$G$52)*J54 | =-(2/$G$52)*K54+1 | | | | =G37 | =MMULT(A57:C59,D47:E49) | =MMULT(A57:C59,D47:E49) | | | =MMULT(A57:C59,A47:A49) |
| | | | | | =G38 | =MMULT(A57:C59,D47:E49) | =MMULT(A57:C59,D47:E49) | | | =MMULT(A57:C59,A47:A49) |
| | | | | | | | | | | |
| | | | | | | | | | | |
Rewrite: effectively 0 -->
0: | | | | | | X1 | X2 | X0 | | | Y |
| | | | | | =G35 | 0 | =I57 | | | =L35 |
| | | | | | 0 | =H58 | 0 | | | 0 |
| | | | | | 0 | 0 | 0 | | | =L59 |
| | | | | | 0 | 0 | 0 | | | =L60 |
| | | | | | | | | | | |
| | | | | | | | | | | |
| | | | | | | | | | | |
QR Decomposition main loop terminates because longest
remaining sub-vector has length
0 | | | | | | | | | | | |
| | | | | | | | | | | |
regression coeffts by
backsubstitution: | | | | | | =(L64-
H64*H71)/G64 | =L65/H65 | 0 | | | |
residual SS from last 2 rows of
Y: | | | | | | =SUMSQ(L66:L67) | | | | | |
| | | | | | | | Excel
2003 LINEST: | | | |
SSRegression = SSTotal -
SSResidual: | | | | =A17-G72 | | | | 1.22222222222222 | 0 | -3.11111111111111 | |
R-squared = SSRegression /
SSTotal | | | | =E74/A17 | | | | 0.423098505881328 | 0 | 10.3334826751454 | |
DF = 2 (see
article) | | | | | | | | 0.806666666666667 | 6.95221787153807 | #N/A | |
stdErrorY =
sqrt(SSResidual/DF) | | | | =SQRT(G72/2) | | | | 8.3448275862069 | 2 | #N/A | |
FStatistic = (SSRegression / (DF Regression)) /
(SSResidual/DF): | | | | | | | =(E74/(0+1))/(G72/2) | 403.333333333333 | 96.6666666666667 | #N/A | |
| | | | | | | | | | | |
Intercept: | | =A7 - I71*C7 -
G71*D7 | | | | | | | | | |
This worksheet uses the same data as the previous
worksheet. QR Decomposition performs a sequence of orthogonal linear
transformations. The original data is in cells A1:D5. The first step is to
"center" the data for the original columns, and then explicitly add a column of
1s (it is assumed that the third argument to LINEST is TRUE or omitted). The
revised data is shown in cells A10:D14. Column X2 is the added column of 1s. To
center the data, find the mean in each column (shown in cells A7:D7) and then
subtract it from each observation in the respective column. Therefore, for the
Y, X0, and X1 columns, the original data has been replaced by deviations about
the column means. Centering is useful in minimizing round off errors. The value
in cell A17 is the sum of squares of the centered Y values, the Total Sum of
Squares for the regression analysis. Values in cells C19:E19 are sums of
squares of the centered X0 and X1 columns and the (non-centered) column of 1s
that is named X2. You can interchange columns X0, X1, and X2 so that the one
with the largest sum of squares comes first. These results are in cells
A21:D26. When you interchange columns, you must keep track of the location of
each original column.
After these preliminary changes, you can use the
main loop of the QR Decomposition algorithm. You want to find a 4x4 matrix
(because there are 4 rows of data) that you can use to premultiply each column.
This transformation does not change the squared lengths of each column. You
first find the column vector V by taking the first column and adding the square
root of the column's sum of squares (computed in cell B29) to its first entry.
Other entries in the first column are not changed. This action yields the
vector in cells E29:E32. The sum of squares in V (as VTV) is in cell G29. (
Note The T must be a superscript.) The 4x4 matrix VVT is in cells
I29:L32. Use this information to compute the 4x4 transformation matrix, P, by
using the following formula.
P = I - (2/ VTV)* VVT
Note All Ts must be superscript.
The resulting matrix P is
displayed in cells A35:D38. If you premultiply the revised X columns in cells
C23:E26 by P, you receive the results in cells G35:I38. Similarly, if you
premultiply the revised Y column in cells A23:A26 by P, you receive the results
in cells L35:L38. The X1 column has been transformed so that it still has the
same sum of squares as before, but all entries except the top entry in the
column are 0. More precisely, entries in cells G36:G38 are "effectively 0"
because they are zero to fifteen decimal places. In row 40, sums of squares for
all columns are computed and are not changed by the
transformation.
The algorithm continues for a second iteration of the
main loop and uses only the X0 and X2 data in cells H36:I38 and the Y data in
cells L36:38. Because you are concerned with only three rows, you can calculate
the sums of squares for only the last three rows of the X0 and X2 columns.
These values are displayed in cells H42:H43. The sum of squares of X0 is
essentially 0. The X0 and X2 columns are swapped because X2 has the larger
relevant sum of squares. After the columns are swapped, revised columns are
displayed in cells A45:E49. V is computed exactly as in the first iteration
except that now V has only three rows. Computations of VTV, VVT, and P continue
exactly as before and are shown in rows 51-54 and cells A57:C59. You can then
premultiply only the last three rows of the X2, X0, and Y columns by P to yield
the revised columns in cells G56:L60. To make this more readable, these columns
are rewritten in cells G63:L67 by setting values that are effectively zero to
exactly zero.
The next iteration only involves the X0 column and its
last two rows. Because the sum of squares of entries in these rows is zero, the
main loop of the algorithm terminates.
Regression coefficients can
then be found by back substitution, realizing that the matrix in cells G64:H65
is upper triangular. X2's coefficient is 0 because -2X2 = 0, and X1's
coefficient is from -16.431677X1 +0X2 = -20.08316. Knowing X2 = 0, this becomes
-16.431677X1 = -20.08316 or X1 = 1.222222. These values are shown in row
71.
The residual sum of squares is the sum of squares of revised Y
vector entries below the second row. All the rows that were not processed at
the time the main loop of the QR Decomposition algorithm terminated are
included here. In this case, processing stopped because the last two rows in
the X0 column contained only zeros. The residual sum of squares is calculated
in cell G72. You can see from the entries in cells G63:L67 that any values for
the coefficients of the Xs leave a fitted value of zero for each of these last
two rows. The values of coefficients for X1 and X2 that have been found yield
an exact fit to Y values in the first two rows. Therefore, Y has been
transformed so that its total sum of squares is not changed, the residual sum
of squares is the sum of squares in the last two rows, and the regression sum
of squares is the sum of squares in the first two rows.
The algorithm
spotted collinearity when it noticed that the remaining entries in the X0
column were zero. At this point, no columns remain whose coefficients may
improve the fit. The X0 column does not contain any useful additional
information because X1 and X2 are already included in the model. Although X2
has a coefficient of zero, this does not make it a redundant column that is
eliminated as a result of collinearity.
At this point, you can extract
most of the summary statistics that LINEST provides. However, this article does
not discuss how to determine standard errors of the regression coefficients.
Values from LINEST output in Excel 2003 are shown in cells I74:K78 for
comparison. The regression sum of squares is calculated in cell E74 and
R-squared is calculated in cell E75; these values are displayed in the LINEST
output in cell I78 and cell I76, respectively. The residual sum of squares (or
error sum of squares) is calculated in cell G72 and displayed in the LINEST
output in cell J78.
Other entries in the LINEST output depend on the
degrees of freedom (DF). Many statistical packages report Regression DF, Error
DF, and Total DF. Excel reports only Error DF (in cell J77). Earlier versions
of Excel compute Error DF correctly in all cases except when there is
collinearity that should have eliminated one or more predictor columns. The
value of Error DF depends on the number of predictor columns that are actually
used. With collinearity, Excel 2003 handles this computation correctly, while
earlier versions count all predictor columns even though one or more should
have been eliminated by collinearity.
Degrees of freedom is examined
here in more detail. Assume that collinearity is not an issue. When the
intercept is fitted, in other words, the third argument to LINEST is missing or
true:
- Total DF equals the number of rows (or datapoints) minus
one.
- Regression DF equals the number of predictor columns (not
including the column for intercept).
- Error DF equals Total DF minus Regression DF.
When the intercept is not fitted, in other words, the third
argument to LINEST is false:
- Total DF equals the number of rows (or
datapoints).
- Regression DF equals the number of predictor
columns.
- Error DF equals Total DF minus Regression DF.
The only difference between these two cases is the "minus one"
in the formula for Total DF in the more common case where the intercept is
fitted.
Earlier versions of Excel use these formulas to correctly
compute DF, except that Excel 2002 does not look for collinearity. Looking for
collinearity is one of the reasons for using QR Decomposition for these
computations.
The predictor columns form a matrix. If the intercept is
fitted, there is effectively an additional column of 1s that does not appear on
your spreadsheet. QR Decomposition determines the rank of this matrix. The
previous formulas for Regression DF should be changed to the following
formulas:
- For the "fitted" case: Regression DF equals the rank of the
matrix of predictor columns (including a column of 1s for intercept) minus one
(for the column for intercept)
- For the "not fitted" case: Regression DF equals the rank of
the matrix of predictor columns
Also, because Excel uses finite arithmetic, "rank" is really
"approximate rank", so a column is linearly dependent on a set of other columns
if there is a weighted sum of columns in the set, a vector, whose Euclidean
distance from the column is very close to zero.
In the example on the
worksheet, the intercept was fitted. Total DF is 4 - 1 = 3; Regression DF is 2
- 1 = 1; Error DF is Total DF - Regression DF = 3 - 1 = 2. For this example,
Excel 2002 and earlier calculated Regression DF as 3 - 1 = 2 and Error DF as 3
- 2 = 1. The difference comes from the failure to look for collinearity.
Earlier versions of Excel noted that there were three predictor columns; Excel
2003 examined these three columns and found that there were really only
two.
Standard error of Y is calculated in cell E77 and is shown in the
LINEST output in cell J76. The f statistic is calculated in cell H78 and in the
LINEST output in cell I77. The formula for the f statistic is:
(SSRegression / DF Regression) / (SSError / DF Error)
In this example, the f statistic is:
(403.333 / 1) / (96.667 / 2) = 8.345
The LINEST output in cells I74:K74 shows the regression coefficients for
X1, X0, and the fitted intercept. The coefficient for the intercept, -3.1111,
differs from the coefficient for the column X2. This difference occurs because
the data was centered to find the best fit linear regression model for this
data. Optimal regression coefficient values for X1 and X0 are not affected by
centering this data. Centering the data causes the best fit to pass through the
origin. Centering this data is the reason that an optimal coefficient of zero
for X2 (the column that was added to represent the intercept) was found.
Fortunately, you can recover the corresponding intercept coefficient for the
original model with little additional effort. The intercept coefficient can be
found using the following formula:
Y col mean minus the sum over all X columns (except the intercept column) of X col regression coefficient times X col mean
This value is calculated in C80 and agrees with the LINEST output
in cell K74.
Summary of Results in Earlier Versions of Excel
LINEST used a formula that is not correct to find the total sum of
squares when the third argument to LINEST was set to FALSE. This formula caused
values that are not correct in the regression sum of squares and values that
are not correct for the other output that depends on the regression sum of
squares: r squared and the f statistic.
Regardless of the value of the
third argument, LINEST was calculated by using an approach that paid no
attention to collinearity issues. The presence of collinearity caused round off
errors, standard errors of regression coefficients that are not appropriate,
and degrees of freedom that are not appropriate. Sometimes, round off problems
were sufficiently severe that LINEST filled its output table with #NUM!. LINEST
generally provides acceptable results if the following conditions are true:
- There are no collinear (or nearly collinear) predictor
columns.
- The third argument to LINEST is TRUE or is
omitted.
However, solving for regression coefficients using the "Normal
Equations" is more prone to round off errors than using the QR Decomposition
approach that is used in Excel 2003. Even though these coefficients are more
prone to round off errors, they are not likely to be problematic for most
practical cases.
Summary of results in Excel 2003
Improvements include correcting the formula for total sum of
squares in the case where the third argument to LINEST was set to FALSE and
switching to the QR Decomposition method of determining the regression
coefficients. QR Decomposition has the following two advantages:
- Better numeric stability (generally smaller round off
errors)
- Analysis of collinearity issues
Collinearity is the main concern, particularly after tests were
performed on NIST datasets.
Conclusions
LINEST has been greatly improved for Excel 2003. If you use an
earlier version of Excel, verify that predictor columns are not collinear
before you use LINEST. Be careful to use the workaround in this article if the
third argument in LINEST is set to FALSE. Note that collinearity is only a
problem in a small percentage of cases and calls to LINEST with the third
argument set to FALSE are also relatively rare in practice. Earlier versions of
Excel give acceptable LINEST results when there is no collinearity and the
third argument in LINEST is TRUE or omitted. Improvements in LINEST affect the
Analysis ToolPak's linear regression tool that calls LINEST and the following
related functions: