Excel statistical functions: PEARSON (828129)



The information in this article applies to:

  • Microsoft Office Excel 2003

SUMMARY

This article describes the PEARSON function in Microsoft Excel. This article discusses how to use the function, and this article compares results of PEARSON in Excel 2003 with the results of PEARSON in earlier versions of Excel.

MORE INFORMATION

The PEARSON(array1, array2) function returns the Pearson product-moment correlation coefficient between two arrays of data.

Syntax

PEARSON(array1, array2)

The arguments, array1 and array2, must be either numbers, names, array constants, or references that contain numbers.

The most common use of PEARSON includes two ranges of cells that contain the data, such as PEARSON(A1:A100, B1:B100).

Example Usage

To illustrate the PEARSON function, follow these steps:
  1. Create a blank Excel worksheet, and then copy the following table.
    1= 3 + 10^$D$2Power of 10 to add to data
    2=4 + 10^$D$20
    3=2 + 10^$D$2
    4=5 + 10^$D$2
    5=4+10^$D$2
    6=7+10^$D$2pre-Excel 2003
    when D2 = 7.5
    =PEARSON(A1:A6,B1:B6)0.702038
    =CORREL(A1:A6,B1:B6)0.713772
    when D2 = 8
    #DIV/0!
    0.713772

  2. Select cell A1 in your blank Excel worksheet, and then click Paste on the Edit menu so that the entries in the table fill cells A1:D13 in your worksheet.
  3. Click the Paste Options button, and then click Match Destination Formatting. With the pasted range still selected, point to Column on the Format menu, and then click AutoFit Selection.

    Note You may want to format cells B1:B6 as Number with 0 decimal places.
Cells A1:A6 and B1:B6 contain the two data arrays that are used in this example to call PEARSON and the CORREL function in cells A8 and A9. PEARSON and CORREL both compute the Pearson product-moment correlation coefficient and their results should agree.

In versions of Excel that are earlier than Excel 2003, PEARSON may exhibit round-off errors. The behavior of PEARSON has been improved in Excel 2003. CORREL has always been implemented with the improved procedure that is now used in Excel 2003. Therefore, if you are using PEARSON for a version of Excel that is earlier than Excel 2003, Microsoft recommends that you use CORREL instead.

In versions of Excel that are earlier than Excel 2003, you can use the worksheet in this article to run an experiment and discover when round-off errors occur. If you add a constant to each of the observations in B1:B6, the value of PEARSON or CORREL should not be affected. If you increase the value in D2, a larger constant is added to B1:B6. If D2 is less than 7, there are no round-off errors that appear in the first six decimal places of PEARSON. Now change the value of D2 to 7.25, 7.5, 7.75, and then 8. Cells D6:D13 of the worksheet show values of PEARSON and CORREL when D2 = 7.5 and when D2 = 8, respectively.

CORREL is still accurate, but round-off errors in PEARSON have become so severe that division by 0 occurs when D2 = 8.

Earlier versions of Excel exhibit incorrect answers in these cases because the effects of round-off errors are more profound with the computational formula used by these versions. Still, the cases in this experiment can be viewed as extreme.

In Excel 2003, you should see no changes in values of PEARSON if you try the experiment. However, cells D6:D13 show the same round-off errors that you obtain in earlier versions of Excel.

Results in Earlier Versions of Excel

If you name the two data arrays X's and Y's, earlier versions of Excel use a single pass through the data to compute the sum of squares of X's, the sum of squares of Y's, the sum of X's, the sum of Y's, the sum of XY's, and the count of the number of observations in each array. These quantities are then combined in the computational formula in the Help file in earlier versions of Excel.

Results in Excel 2003

The procedure that is used in Excel 2003 uses a two-pass process through the data. First, the sums of X's and Y's and the count of the number of observations in each array are computed. From these, the means (averages) of X and Y observations can be computed. Then, on the second pass, the squared difference between each X and the X mean is found; these squared differences are summed. The squared difference between each Y and the Y mean is found; these squared differences are summed. Additionally, the products (X - X mean) * (Y - Y mean) are found for each pair of data points and are summed. These three sums are combined in the formula for PEARSON. None of these three sums are affected by adding a constant to each value in the Y array (or the X array), because that same value is added to the Y mean (or the X mean). In the numeric examples, even with a high power of 10 in cell D12, these three sums are not affected and the results of the second pass are independent of the entry in cell D2. Therefore, the results in Excel 2003 are more stable numerically.

Conclusions

A two-pass approach guarantees better numeric performance of PEARSON in Excel 2003 than the one-pass approach in earlier versions of Excel. Excel 2003 results will never be less accurate than results in earlier versions. CORREL has the same functionality and has always been implemented with the approach that is used for PEARSON in Excel 2003. Therefore, CORREL is a better choice for earlier versions of Excel.

In most practical examples, however, you are not likely to notice a difference between Excel 2003 results and the results in earlier versions of Excel. Typical data is unlikely to exhibit the kind of unusual behavior that this experiment illustrates. Numeric instability is most likely to appear in earlier versions of Excel when data contains both a high number of significant digits and relatively little variation between data values.

The procedure that finds the sum of squared deviations about a sample mean by
  1. finding the sample mean
  2. computing each squared deviation
  3. and then summing the squared deviations
is more accurate than the alternative procedure. (The alternative procedure is frequently referred to as the calculator formula, because it is suitable for use of a calculator on a small number of data points.) The alternative procedure is to:
  1. Find the sum of squares of all observations, the sample size, and the sum of all observations.
  2. Compute the sum of squares of all observations minus ((sum of all observations)^2)/sample size).
There are many other functions that have been improved in Excel 2003 by replacing this latter one-pass procedure with the two-pass procedure that finds the sample mean on the first pass and then computes the sum of squared deviations about it on the second pass.

A short list of such functions includes VAR, VARP, STDEV, STDEVP, DVAR, DVARP, DSTDEV, DSTDEVP, FORECAST, SLOPE, INTERCEPT, PEARSON, RSQ, and STEYX. Similar improvements were made in each of the three Analysis of Variance tools in the Analysis ToolPak.

For more information about PEARSON, click Microsoft Excel Help on the Help menu, type pearson in the Search for box in the Assistance pane, and then click Start searching to view the topic.

Modification Type:MinorLast Reviewed:3/14/2006
Keywords:kbinfo KB828129 kbAudEndUser