Back of the Envelope

Observations on the Theory and Empirics of Mathematical Finance

[Stats] Results involving correlation

I have noticed that many of our students are not quite aware of mathematical proofs behind some very basic identities involving correlation (standard books on probability – including that one – of course, have the necessary proofs, but old-fashioned textbooks aren’t exactly all that popular, are they?).

I have in mind the three:

1. Correlation coefficient always lies between minus and plus 1
2. Perfect correlation between two variables implies a linear relationship between them
3. The relationship between $R^2$ in a simple linear regression and the correlation coefficient

1. Proof that $\mathbf{|\rho| \le 1}$

(This pretty much follows Feller, Vol. 1, Chapter 9)

Normalize the random variables $X$ and $Y$ to have mean 0 and standard deviation 1, as:

\begin{aligned} x &= \frac{X - \mu_X}{\sigma_X} \hspace{0.5pc} \mbox{and}\hspace{0.5pc} y=\frac{X - \mu_Y}{\sigma_Y} \end{aligned}

Covariance, $\mbox{Cov}(x, y)$, between the standardized random variables $x$ and $y$ is:

\begin{aligned} \mbox{Cov}(x, y) &= \mbox{Cov}(\frac{X - \mu_X}{\sigma_X}, \frac{Y - \mu_Y}{\sigma_Y}) \\&=\mathbb{E}\big[\frac{(X - \mu_X)(Y - \mu_Y)}{\sigma_X \sigma_Y}\big] - \mathbb{E}\big[\frac{X - \mu_X}{\sigma_X}\big] \mathbb{E}\big[\frac{Y - \mu_Y}{\sigma_Y}\big] \\&=\frac{\mbox{Cov}(X, Y)}{\sigma_X \sigma_Y} \\&= \rho \end{aligned}

i.e. the covariance between the standardized random variables represents the correlation between the original random variables(the second term in the second equation above is zero because $\mathbb{E}[X] = \mu_X$ and $\mathbb{E}[Y] = \mu_Y$).

The trick now is to calculate the variance of the sum of the standardized random variables:

\begin{aligned} \mbox{Var}(x \pm y) &=\mbox{Var}(x) +\mbox{Var}(y) \pm 2\mbox{Cov}(x, y) \\&= 1 + 1 \pm 2\rho \\&= 2(1 \pm \rho) \end{aligned}

Since variance is always non-negative, we must have that $-1 \le \rho \le 1$ or $|\rho| \le 1$.

2. A special case: $\rho = \pm 1$

When $\rho = \pm 1$, the last equation reduces to $\mbox{Var}(x \pm y) = 0$, meaning $x \pm y$ = constant. Converting back to the original variables, this implies:

\begin{aligned} \frac{X - \mu_X}{\sigma_X} \pm \frac{Y - \mu_Y}{\sigma_Y} &=\mbox{constant} \\ \Rightarrow Y &=c\pm \frac{\sigma_Y}{\sigma_X} X \end{aligned}

i.e. when the two random variables are perfectly positively or negatively correlated, they can be written as linear functions of each other. Alternatively, correlation captures a linear relationship between two random variables.

3. $\mathbf{\mbox{R}^2 = \rho^2}$

Regression coefficient $b$ in a standard linear regression $y = a + bx + \epsilon$ is given by:

\begin{aligned} b &=\dfrac{\sum (x - \bar{x})(y - \bar{y})}{\sum (x - \bar{x})^2} \end{aligned}

with the $\mbox{R}^2$ given as:

\begin{aligned} \mbox{R}^2 &=\dfrac{\sum (y - \bar{y})^2 - \sum (y - \mathbb{E}[y])^2}{\sum (y - \bar{y})^2}\end{aligned}

The correlation coefficient between $x$ and $y$ is given by:

\begin{aligned} \rho &=\dfrac{\sum (x - \bar{x})(y - \bar{y})}{\sqrt{\sum (x - \bar{x})^2 \sum (y - \bar{y})^2}}\end{aligned}

Noting that $\mathbb{E}[y] = a + bx$, it is a matter of simple algebra to verify that:

\begin{aligned} \sum (y - \mathbb{E}[y])^2 &=\sum (y - \bar{y})^2 + b^2 \sum (x - \bar{x})^2 - 2b \sum (x - \bar{x})(y - \bar{y})\end{aligned}

Putting the two together, and using the formula for the regression coefficient, $b$, we have that:

\begin{aligned}\sum (y - \bar{y})^2 - \sum (y - \mathbb{E}[y])^2 &=2b \sum (x - \bar{x})(y - \bar{y}) - b^2 \sum (x - \bar{x})^2\\&= 2 \dfrac{\sum (x - \bar{x})(y - \bar{y})}{\sum (x - \bar{x})^2} \times \sum (x - \bar{x})(y - \bar{y}) \\ &\qquad - \dfrac{\big(\sum (x - \bar{x})(y - \bar{y})\big)^2}{\big(\sum (x - \bar{x})^2\big)^2}\times \sum (x - \bar{x})^2 \\&= \dfrac{\big(\sum (x - \bar{x})(y - \bar{y})\big)^2}{\sum (x - \bar{x})^2}\end{aligned}

(where in the second step above the term $\sum (x - \bar{x})^2$ cancels off in numerator and denominator, simplifying the expression)

Finally, dividing both sides by $\sum (y - \bar{y})^2$ gives the definition of $\mbox{R}^2$ on the LHS and $\rho^2$ on the RHS, i.e.:

\begin{aligned}\dfrac{\sum (y - \bar{y})^2 - \sum (y - \mathbb{E}[y])^2}{\sum (y - \bar{y})^2} &=\dfrac{\big(\sum (x - \bar{x})(y - \bar{y})\big)^2}{\sum (x - \bar{x})^2\sum (y - \bar{y})^2} \\ \\ \Rightarrow \mbox{R}^2 &= \rho^2 \end{aligned}

(This result also has a more concise and general proof, but uses linear algebra. See, for example, here.)