4.2 Covariance, Correlation, Regression

When doing science, we often want to study the relation between two observables A and B. Most of the time, we want to foresee the future with partial information:

Can we foresee the behavior of A by knowing B? Or maybe the opposite? When one grows, what happens to the other?

We will use the following data to exemplify the following techniques:

1 Manual Experiment

The class will compile a list with the following information:

We will make a graph with this information in Excel. What values seem to grow together? Which values seem to have no relation between them.

2 Correlation

Correlation is a measure of how two things behave similarly. For example:

These behaviors can be seen in a graph. For example, look at this graph:

As we can see, the values tend to grow from the left to the right, together. This is the sign of a positive correlation.

Mathematically, there is a formula for correlation, which takes value from -1 up to 1. It goes from negatively correlated (-1), to uncorrelated (0) to positively correlated (1).

2.1 Formula

In order to calculate correlation, one needs sequences pairs of numbers. These sequences will be written x_1, ..., x_n and y_1, ..., y_n. Writing M_x for the average of x and M_y for the correlation of y, then the formula for correlation is:

r\_{xy} = \frac{\sum\_{i=1}^n (x_i-M_x) (y_i-M_y)}{\sqrt{\sum\_{i=1}^n (x_i-M_x)^2} \sqrt{\sum\_{i=1}^n (y_i - M_y)^2}}

2.2 In Excel

In Excel, the correlation between two data sets can be calculated using the CORREL(A1:A100,B1:B100) function, replacing the data ranges by the desired values.

2.3 Exercises

In the babies data set, there are many measurements. For any pair of measurements, we can:

Some interesting pairs for doing it are:

3 Regression

In the section above we studied correlation. In some cases, such as weight and size, there is a very clear relation between the two variables. This deserves a question: can we model this relation with a straight line?

Look at the graph above: each blue dot has a coordinate (x,y). The straight line, is also composed by pairs of coordinates, such as the points, but its points follow a rule: y = \beta + \alpha x In this formula, there are two coefficients:

How should we draw this line? After all, we can draw any line in the graph, and it will be a model – but its quality can be better or worse, depending on the coefficients.

In order to choose the coefficients, we evaluate the error term:

The error term measures the difference between the measured value (one of the y_k values) and the value predicted by the model (calculated by \beta + \alpha x_k). The best model is the one which makes the error, measured by: \epsilon(\alpha,\beta) = \sum_{k = 1}^n (y_k - \beta - \alpha x_k)^2 as small as possible. The square (²) is there to make calculations easier.

If we use this criterion, we can calculate the regression coefficients as:

\beta = M_y − \alpha M_x

and

\alpha = r_{xy} \frac{\sigma_y}{\sigma_x}

where M represents the mean, r the regression coefficient, and \sigma the standard deviation.

3.1 Practice

Choose a pair of measurements. For this pair we will create a table with:

Next to it, two columns:

4 Control

Download the following:

And follow the instructions. We will make rounds so all students can do the control by themselves.