Variance, correlation, and covariance review

Variance

\[Variance = \frac{\sum (x-\bar{x})^2}{n-1} = \sigma^2\]

library(UsingR)
x <- father.son$fheight
y <- father.son$sheight
v <- sum((x-mean(x))^2/(length(x)-1))
v

## [1] 7.534

v == sd(x)^2 & #equivalency of alternative definitions
  v == var(x) #equivalency of r code var()

## [1] TRUE

Covariance

\[Covariance = \frac{\sum (x-\bar{x})(y-\bar{y})}{n-1}\]

cv <- sum((x-mean(x))*(y-mean(y)))/(length(x)-1)
cv

## [1] 3.873333

cv == cov(x,y) #equivalency of r code cov()

## [1] TRUE

Correlation

\[Correlation = \frac{Cov(X,Y)}{\sigma_x\sigma_y} = \sqrt{R^2}\]

cr <- cov(x,y)/(sd(x)*sd(y))
cr

## [1] 0.5013383

#round() b/c floating point issue in diff calculations
round(cr,10) == round(sqrt(summary(lm(x~y))$r.squared),10)  &  #equivalency of alternative defintions
  round(cr,10) == round(cor(x,y),10)  #equivalency of r code

## [1] TRUE

Calculating covariance matrix?

Covariance matrix is calculated from a matrix where each column contains the values for a different variable and each row represents and individual
The covariance matrix contains a row for each variable and a column for each variable
The intersection of a row and a column represents to co-variance between the row variable and the column variable
The diagnal represents the intersection of each variable with itself and therefore represents the variance

Matrix operations allow covariance to be calculated for whole matrix rather than pairs of samples:

Obtain vector of means by matrix operation: \[M = \frac{1}{N}\left(\begin{array}{c}1\\1\\\vdots\\1\end{array}\right)^{T}\left(\begin{array}{c}Y_{1}\\Y_{2}\\\vdots\\Y_{N}\end{array}\right) = \frac{1}{N}A^{T}Y\]
- \(Y_1\) is a vector of values corresponding to a particular variable (e.g. height, weight, etc.), each entry is that variable’s value for a particular individual (data point)
- Similarly for each \(Y_i\) through \(Y_N\)
Calculate residual matrix \(R\) by subtracting the mean of each variable from each instance of that variable \[R = \left(\begin{array}{c}Y_{1}\\Y_{2}\\\vdots\\Y_{N}\end{array}\right) - M\]
Calculate covariance matrix by crossproduct of \(R\) divided by N (see Examples of Matrix Algebra for matrix calculation of variance) \[S = \frac{1}{n-1}R^TR\]

z <- cbind(x,y) #creating the data matrix
head(z)

##             x        y
## [1,] 65.04851 59.77827
## [2,] 63.25094 63.21404
## [3,] 64.95532 63.34242
## [4,] 65.75250 62.79238
## [5,] 61.13723 64.28113
## [6,] 63.02254 64.24221

a <- matrix(rep(1, length(x)))
data.frame("dimension" = c("rows", "columns"), "Y" = dim(z), "A" = dim(a))

##   dimension    Y    A
## 1      rows 1078 1078
## 2   columns    2    1

m <- (t(a) %*% z)/length(x) #calculating the mean matrix
m

##            x        y
## [1,] 67.6871 68.68407

n <- matrix(rep(m,length(x)), nrow = length(x), byrow = T) #creating a matrix that repeats the mean matrix for the same number of rows as the data matrix
r <- z - n
head(r)

##              x        y
## [1,] -2.638587 -8.90580
## [2,] -4.436157 -5.47003
## [3,] -2.731777 -5.34165
## [4,] -1.934597 -5.89169
## [5,] -6.549867 -4.40294
## [6,] -4.664557 -4.44186

v <- crossprod(r)/(length(x)-1) #calculating the covariance matrix
v

##          x        y
## x 7.534303 3.873333
## y 3.873333 7.922545

#round() b/c floating point issue in diff calculations
#all checks if every cell in covariance matrix is equivalent
all(round(v,10) == round(cov(cbind(x,y)),10)) #equivalency of r code cov() on a single matrix with variables x and y compared to above using cov() on seperate vectors x and Y

## [1] TRUE

all(list(
v[1,1] == cov(x,x) & v[1,1] == var(x),
v[2,2] == cov(y,y) & v[2,2] == var(y),
v[1,2] == cov(x,y)))

## [1] TRUE

Variance and Covariance

Variance, correlation, and covariance review

Variance

Covariance

Correlation

Calculating covariance matrix?