4.2 Covariance, Correlation, Regression
When doing science, we often want to study the relation between two observables A
and B
. Most of the time, we want to foresee the future with partial information:
Can we foresee the behavior of
A
by knowingB
? Or maybe the opposite? When one grows, what happens to the other?
We will use the following data to exemplify the following techniques:
1 Manual Experiment
The class will compile a list with the following information:
- Height
- Shoe size
- Number of brothers and sisters
We will make a graph with this information in Excel. What values seem to grow together? Which values seem to have no relation between them.
2 Correlation
Correlation is a measure of how two things behave similarly. For example:
- People’s height is positively correlated with their weight – that’s because a bigger height in general means a bigger weight.
- In a house, the amount of people is negatively correlated with the amount of space per person – that’s because the more people you have, the less space each one can have for themselves.
These behaviors can be seen in a graph. For example, look at this graph:
As we can see, the values tend to grow from the left to the right, together. This is the sign of a positive correlation.
Mathematically, there is a formula for correlation, which takes value from -1 up to 1. It goes from negatively correlated (-1), to uncorrelated (0) to positively correlated (1).
2.1 Formula
In order to calculate correlation, one needs sequences pairs of numbers. These sequences will be written x_1, ..., x_n and y_1, ..., y_n. Writing M_x for the average of x and M_y for the correlation of y, then the formula for correlation is:
r\_{xy} = \frac{\sum\_{i=1}^n (x_i-M_x) (y_i-M_y)}{\sqrt{\sum\_{i=1}^n (x_i-M_x)^2} \sqrt{\sum\_{i=1}^n (y_i - M_y)^2}}
2.2 In Excel
In Excel, the correlation between two data sets can be calculated using the CORREL(A1:A100,B1:B100)
function, replacing the data ranges by the desired values.
2.3 Exercises
In the babies data set, there are many measurements. For any pair of measurements, we can:
- Create a X-Y scatter plot to visualize the data
- Analyze it, and try to infer if there is a positive, negative or no correlation.
- In a separate cell, measure the correlation using
CORREL
.
Some interesting pairs for doing it are:
- Size and weight
- Size and number of cigarettes smoked by the mother
- Weight and number of cigarettes smoked by the mother
- Size and father height
- Size and father years of education
3 Regression
In the section above we studied correlation. In some cases, such as weight and size, there is a very clear relation between the two variables. This deserves a question: can we model this relation with a straight line?
Look at the graph above: each blue dot has a coordinate (x,y). The straight line, is also composed by pairs of coordinates, such as the points, but its points follow a rule: y = \beta + \alpha x In this formula, there are two coefficients:
- \beta is the intersection coefficient: y = \beta when x = 0
- \alpha is the inclination coefficient: it measures the speed of increase or decrease of the variable
How should we draw this line? After all, we can draw any line in the graph, and it will be a model – but its quality can be better or worse, depending on the coefficients.
In order to choose the coefficients, we evaluate the error term:
The error term measures the difference between the measured value (one of the y_k values) and the value predicted by the model (calculated by \beta + \alpha x_k). The best model is the one which makes the error, measured by: \epsilon(\alpha,\beta) = \sum_{k = 1}^n (y_k - \beta - \alpha x_k)^2 as small as possible. The square (²) is there to make calculations easier.
If we use this criterion, we can calculate the regression coefficients as:
\beta = M_y − \alpha M_x
and
\alpha = r_{xy} \frac{\sigma_y}{\sigma_x}
where M represents the mean, r the regression coefficient, and \sigma the standard deviation.
3.1 Practice
Choose a pair of measurements. For this pair we will create a table with:
- Average of x (M_x)
- Average of y (M_y)
- Standard deviation of x (\sigma_x)
- Standard deviation of y (\sigma_y)
- Correlation of x and y (r_{xy})
- The \beta coefficient
- The \alpha coefficient
- The optimal \beta
- The optimal \alpha
Next to it, two columns:
- One with the prediction \beta + \alpha x
- Another with the error y − \beta − \alpha x
4 Control
Download the following:
And follow the instructions. We will make rounds so all students can do the control by themselves.