### Variance, correlation, and covariance review

#### Variance

•   $Variance = \frac{\sum (x-\bar{x})^2}{n-1} = \sigma^2$

library(UsingR)
x <- father.son$fheight y <- father.son$sheight
v <- sum((x-mean(x))^2/(length(x)-1))
v
## [1] 7.534
v == sd(x)^2 & #equivalency of alternative definitions
v == var(x) #equivalency of r code var()
## [1] TRUE

#### Covariance

• $Covariance = \frac{\sum (x-\bar{x})(y-\bar{y})}{n-1}$

cv <- sum((x-mean(x))*(y-mean(y)))/(length(x)-1)
cv
## [1] 3.873333
cv == cov(x,y) #equivalency of r code cov()
## [1] TRUE

#### Correlation

• $Correlation = \frac{Cov(X,Y)}{\sigma_x\sigma_y} = \sqrt{R^2}$

cr <- cov(x,y)/(sd(x)*sd(y))
cr
## [1] 0.5013383
#round() b/c floating point issue in diff calculations
round(cr,10) == round(sqrt(summary(lm(x~y))\$r.squared),10)  &  #equivalency of alternative defintions
round(cr,10) == round(cor(x,y),10)  #equivalency of r code
## [1] TRUE

# Calculating covariance matrix?

• Covariance matrix is calculated from a matrix where each column contains the values for a different variable and each row represents and individual
• The covariance matrix contains a row for each variable and a column for each variable
• The intersection of a row and a column represents to co-variance between the row variable and the column variable
• The diagnal represents the intersection of each variable with itself and therefore represents the variance
• Matrix operations allow covariance to be calculated for whole matrix rather than pairs of samples:
• Obtain vector of means by matrix operation: $M = \frac{1}{N}\left(\begin{array}{c}1\\1\\\vdots\\1\end{array}\right)^{T}\left(\begin{array}{c}Y_{1}\\Y_{2}\\\vdots\\Y_{N}\end{array}\right) = \frac{1}{N}A^{T}Y$
• $$Y_1$$ is a vector of values corresponding to a particular variable (e.g. height, weight, etc.), each entry is that variable’s value for a particular individual (data point)
• Similarly for each $$Y_i$$ through $$Y_N$$
• Calculate residual matrix $$R$$ by subtracting the mean of each variable from each instance of that variable $R = \left(\begin{array}{c}Y_{1}\\Y_{2}\\\vdots\\Y_{N}\end{array}\right) - M$
• Calculate covariance matrix by crossproduct of $$R$$ divided by N (see Examples of Matrix Algebra for matrix calculation of variance) $S = \frac{1}{n-1}R^TR$
z <- cbind(x,y) #creating the data matrix
head(z)
##             x        y
## [1,] 65.04851 59.77827
## [2,] 63.25094 63.21404
## [3,] 64.95532 63.34242
## [4,] 65.75250 62.79238
## [5,] 61.13723 64.28113
## [6,] 63.02254 64.24221
a <- matrix(rep(1, length(x)))
data.frame("dimension" = c("rows", "columns"), "Y" = dim(z), "A" = dim(a))
##   dimension    Y    A
## 1      rows 1078 1078
## 2   columns    2    1
m <- (t(a) %*% z)/length(x) #calculating the mean matrix
m
##            x        y
## [1,] 67.6871 68.68407
n <- matrix(rep(m,length(x)), nrow = length(x), byrow = T) #creating a matrix that repeats the mean matrix for the same number of rows as the data matrix
r <- z - n
head(r)
##              x        y
## [1,] -2.638587 -8.90580
## [2,] -4.436157 -5.47003
## [3,] -2.731777 -5.34165
## [4,] -1.934597 -5.89169
## [5,] -6.549867 -4.40294
## [6,] -4.664557 -4.44186
v <- crossprod(r)/(length(x)-1) #calculating the covariance matrix
v
##          x        y
## x 7.534303 3.873333
## y 3.873333 7.922545
#round() b/c floating point issue in diff calculations
#all checks if every cell in covariance matrix is equivalent
all(round(v,10) == round(cov(cbind(x,y)),10)) #equivalency of r code cov() on a single matrix with variables x and y compared to above using cov() on seperate vectors x and Y 
## [1] TRUE
all(list(
v[1,1] == cov(x,x) & v[1,1] == var(x),
v[2,2] == cov(y,y) & v[2,2] == var(y),
v[1,2] == cov(x,y)))
## [1] TRUE