Variance and Covariance
Variance, correlation, and covariance review
Variance
\[Variance = \frac{\sum (x-\bar{x})^2}{n-1} = \sigma^2\]
library(UsingR) x <- father.son$fheight y <- father.son$sheight v <- sum((x-mean(x))^2/(length(x)-1)) v
## [1] 7.534
v == sd(x)^2 & #equivalency of alternative definitions v == var(x) #equivalency of r code var()
## [1] TRUE
Covariance
\[Covariance = \frac{\sum (x-\bar{x})(y-\bar{y})}{n-1}\]
cv <- sum((x-mean(x))*(y-mean(y)))/(length(x)-1) cv
## [1] 3.873333
cv == cov(x,y) #equivalency of r code cov()
## [1] TRUE
Correlation
\[Correlation = \frac{Cov(X,Y)}{\sigma_x\sigma_y} = \sqrt{R^2}\]
cr <- cov(x,y)/(sd(x)*sd(y)) cr
## [1] 0.5013383
#round() b/c floating point issue in diff calculations round(cr,10) == round(sqrt(summary(lm(x~y))$r.squared),10) & #equivalency of alternative defintions round(cr,10) == round(cor(x,y),10) #equivalency of r code
## [1] TRUE
Calculating covariance matrix?
- Covariance matrix is calculated from a matrix where each column contains the values for a different variable and each row represents and individual
- The covariance matrix contains a row for each variable and a column for each variable
- The intersection of a row and a column represents to co-variance between the row variable and the column variable
- The diagnal represents the intersection of each variable with itself and therefore represents the variance
- Matrix operations allow covariance to be calculated for whole matrix rather than pairs of samples:
- Obtain vector of means by matrix operation: \[M = \frac{1}{N}\left(\begin{array}{c}1\\1\\\vdots\\1\end{array}\right)^{T}\left(\begin{array}{c}Y_{1}\\Y_{2}\\\vdots\\Y_{N}\end{array}\right) = \frac{1}{N}A^{T}Y\]
- \(Y_1\) is a vector of values corresponding to a particular variable (e.g. height, weight, etc.), each entry is that variable’s value for a particular individual (data point)
- Similarly for each \(Y_i\) through \(Y_N\)
- Calculate residual matrix \(R\) by subtracting the mean of each variable from each instance of that variable \[R = \left(\begin{array}{c}Y_{1}\\Y_{2}\\\vdots\\Y_{N}\end{array}\right) - M\]
- Calculate covariance matrix by crossproduct of \(R\) divided by N (see Examples of Matrix Algebra for matrix calculation of variance) \[S = \frac{1}{n-1}R^TR\]
z <- cbind(x,y) #creating the data matrix head(z)
## x y ## [1,] 65.04851 59.77827 ## [2,] 63.25094 63.21404 ## [3,] 64.95532 63.34242 ## [4,] 65.75250 62.79238 ## [5,] 61.13723 64.28113 ## [6,] 63.02254 64.24221
a <- matrix(rep(1, length(x))) data.frame("dimension" = c("rows", "columns"), "Y" = dim(z), "A" = dim(a))
## dimension Y A ## 1 rows 1078 1078 ## 2 columns 2 1
m <- (t(a) %*% z)/length(x) #calculating the mean matrix m
## x y ## [1,] 67.6871 68.68407
n <- matrix(rep(m,length(x)), nrow = length(x), byrow = T) #creating a matrix that repeats the mean matrix for the same number of rows as the data matrix r <- z - n head(r)
## x y ## [1,] -2.638587 -8.90580 ## [2,] -4.436157 -5.47003 ## [3,] -2.731777 -5.34165 ## [4,] -1.934597 -5.89169 ## [5,] -6.549867 -4.40294 ## [6,] -4.664557 -4.44186
v <- crossprod(r)/(length(x)-1) #calculating the covariance matrix v
## x y ## x 7.534303 3.873333 ## y 3.873333 7.922545
#round() b/c floating point issue in diff calculations #all checks if every cell in covariance matrix is equivalent all(round(v,10) == round(cov(cbind(x,y)),10)) #equivalency of r code cov() on a single matrix with variables x and y compared to above using cov() on seperate vectors x and Y
## [1] TRUE
all(list( v[1,1] == cov(x,x) & v[1,1] == var(x), v[2,2] == cov(y,y) & v[2,2] == var(y), v[1,2] == cov(x,y)))
## [1] TRUE
- Obtain vector of means by matrix operation: \[M = \frac{1}{N}\left(\begin{array}{c}1\\1\\\vdots\\1\end{array}\right)^{T}\left(\begin{array}{c}Y_{1}\\Y_{2}\\\vdots\\Y_{N}\end{array}\right) = \frac{1}{N}A^{T}Y\]