Back of the Envelope

Observations on the Theory and Empirics of Mathematical Finance

[Stats] Results involving correlation

leave a comment »

I have noticed that many of our students are not quite aware of mathematical proofs behind some very basic identities involving correlation (standard books on probability – including that one – of course, have the necessary proofs, but old-fashioned textbooks aren’t exactly all that popular, are they?).

I have in mind the three:

  1. Correlation coefficient always lies between minus and plus 1
  2. Perfect correlation between two variables implies a linear relationship between them
  3. The relationship between R^2 in a simple linear regression and the correlation coefficient

1. Proof that \mathbf{|\rho| \le 1}

(This pretty much follows Feller, Vol. 1, Chapter 9)

Normalize the random variables X and Y to have mean 0 and standard deviation 1, as:

\begin{aligned} x &= \frac{X - \mu_X}{\sigma_X} \hspace{0.5pc} \mbox{and}\hspace{0.5pc} y=\frac{X - \mu_Y}{\sigma_Y} \end{aligned}

Covariance, \mbox{Cov}(x, y), between the standardized random variables x and y is:

\begin{aligned} \mbox{Cov}(x, y) &= \mbox{Cov}(\frac{X - \mu_X}{\sigma_X}, \frac{Y - \mu_Y}{\sigma_Y}) \\&=\mathbb{E}\big[\frac{(X - \mu_X)(Y - \mu_Y)}{\sigma_X \sigma_Y}\big] - \mathbb{E}\big[\frac{X - \mu_X}{\sigma_X}\big] \mathbb{E}\big[\frac{Y - \mu_Y}{\sigma_Y}\big] \\&=\frac{\mbox{Cov}(X, Y)}{\sigma_X \sigma_Y} \\&= \rho \end{aligned}

i.e. the covariance between the standardized random variables represents the correlation between the original random variables(the second term in the second equation above is zero because \mathbb{E}[X] = \mu_X and \mathbb{E}[Y] = \mu_Y).

The trick now is to calculate the variance of the sum of the standardized random variables:

\begin{aligned} \mbox{Var}(x \pm y) &=\mbox{Var}(x) +\mbox{Var}(y) \pm 2\mbox{Cov}(x, y) \\&= 1 + 1 \pm 2\rho \\&= 2(1 \pm \rho) \end{aligned}

Since variance is always non-negative, we must have that -1 \le \rho \le 1 or |\rho| \le 1.

2. A special case: \rho = \pm 1 

When \rho = \pm 1, the last equation reduces to \mbox{Var}(x \pm y) = 0, meaning x \pm y = constant. Converting back to the original variables, this implies:

\begin{aligned} \frac{X - \mu_X}{\sigma_X} \pm \frac{Y - \mu_Y}{\sigma_Y} &=\mbox{constant} \\ \Rightarrow Y &=c\pm \frac{\sigma_Y}{\sigma_X} X \end{aligned}

i.e. when the two random variables are perfectly positively or negatively correlated, they can be written as linear functions of each other. Alternatively, correlation captures a linear relationship between two random variables.

3. \mathbf{\mbox{R}^2 = \rho^2}

Regression coefficient b in a standard linear regression y = a + bx + \epsilon is given by:

\begin{aligned} b &=\dfrac{\sum (x - \bar{x})(y - \bar{y})}{\sum (x - \bar{x})^2} \end{aligned}

with the \mbox{R}^2 given as:

\begin{aligned} \mbox{R}^2 &=\dfrac{\sum (y - \bar{y})^2 - \sum (y - \mathbb{E}[y])^2}{\sum (y - \bar{y})^2}\end{aligned}

The correlation coefficient between x and y is given by:

\begin{aligned} \rho &=\dfrac{\sum (x - \bar{x})(y - \bar{y})}{\sqrt{\sum (x - \bar{x})^2 \sum (y - \bar{y})^2}}\end{aligned}

Noting that \mathbb{E}[y] = a + bx, it is a matter of simple algebra to verify that:

\begin{aligned} \sum (y - \mathbb{E}[y])^2 &=\sum (y - \bar{y})^2 + b^2 \sum (x - \bar{x})^2 - 2b \sum (x - \bar{x})(y - \bar{y})\end{aligned}

Putting the two together, and using the formula for the regression coefficient, b, we have that:

\begin{aligned}\sum (y - \bar{y})^2 - \sum (y - \mathbb{E}[y])^2 &=2b \sum (x - \bar{x})(y - \bar{y}) - b^2 \sum (x - \bar{x})^2\\&= 2 \dfrac{\sum (x - \bar{x})(y - \bar{y})}{\sum (x - \bar{x})^2} \times \sum (x - \bar{x})(y - \bar{y}) \\ &\qquad - \dfrac{\big(\sum (x - \bar{x})(y - \bar{y})\big)^2}{\big(\sum (x - \bar{x})^2\big)^2}\times \sum (x - \bar{x})^2 \\&= \dfrac{\big(\sum (x - \bar{x})(y - \bar{y})\big)^2}{\sum (x - \bar{x})^2}\end{aligned}

(where in the second step above the term \sum (x - \bar{x})^2 cancels off in numerator and denominator, simplifying the expression)

Finally, dividing both sides by \sum (y - \bar{y})^2 gives the definition of \mbox{R}^2 on the LHS and \rho^2 on the RHS, i.e.:

\begin{aligned}\dfrac{\sum (y - \bar{y})^2 - \sum (y - \mathbb{E}[y])^2}{\sum (y - \bar{y})^2} &=\dfrac{\big(\sum (x - \bar{x})(y - \bar{y})\big)^2}{\sum (x - \bar{x})^2\sum (y - \bar{y})^2} \\ \\ \Rightarrow \mbox{R}^2 &= \rho^2 \end{aligned}

(This result also has a more concise and general proof, but uses linear algebra. See, for example, here.)


Written by Vineet

February 15, 2015 at 7:47 pm

Posted in Math: Useful

Tagged with , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: