Variance, correlation, and covariance review

 

Variance

  •   \[Variance = \frac{\sum (x-\bar{x})^2}{n-1} = \sigma^2\]

    library(UsingR)
    x <- father.son$fheight
    y <- father.son$sheight
    v <- sum((x-mean(x))^2/(length(x)-1))
    v
    ## [1] 7.534
    v == sd(x)^2 & #equivalency of alternative definitions
      v == var(x) #equivalency of r code var()
    ## [1] TRUE

     

Covariance

  • \[Covariance = \frac{\sum (x-\bar{x})(y-\bar{y})}{n-1}\]

    cv <- sum((x-mean(x))*(y-mean(y)))/(length(x)-1)
    cv
    ## [1] 3.873333
    cv == cov(x,y) #equivalency of r code cov()
    ## [1] TRUE

     

Correlation

  • \[Correlation = \frac{Cov(X,Y)}{\sigma_x\sigma_y} = \sqrt{R^2}\]

    cr <- cov(x,y)/(sd(x)*sd(y))
    cr
    ## [1] 0.5013383
    #round() b/c floating point issue in diff calculations
    round(cr,10) == round(sqrt(summary(lm(x~y))$r.squared),10)  &  #equivalency of alternative defintions
      round(cr,10) == round(cor(x,y),10)  #equivalency of r code
    ## [1] TRUE

Calculating covariance matrix?

  • Covariance matrix is calculated from a matrix where each column contains the values for a different variable and each row represents and individual
  • The covariance matrix contains a row for each variable and a column for each variable
  • The intersection of a row and a column represents to co-variance between the row variable and the column variable
  • The diagnal represents the intersection of each variable with itself and therefore represents the variance
  • Matrix operations allow covariance to be calculated for whole matrix rather than pairs of samples:
    • Obtain vector of means by matrix operation: \[M = \frac{1}{N}\left(\begin{array}{c}1\\1\\\vdots\\1\end{array}\right)^{T}\left(\begin{array}{c}Y_{1}\\Y_{2}\\\vdots\\Y_{N}\end{array}\right) = \frac{1}{N}A^{T}Y\]
      • \(Y_1\) is a vector of values corresponding to a particular variable (e.g. height, weight, etc.), each entry is that variable’s value for a particular individual (data point)
      • Similarly for each \(Y_i\) through \(Y_N\)
    • Calculate residual matrix \(R\) by subtracting the mean of each variable from each instance of that variable \[R = \left(\begin{array}{c}Y_{1}\\Y_{2}\\\vdots\\Y_{N}\end{array}\right) - M\]
    • Calculate covariance matrix by crossproduct of \(R\) divided by N (see Examples of Matrix Algebra for matrix calculation of variance) \[S = \frac{1}{n-1}R^TR\]
    z <- cbind(x,y) #creating the data matrix
    head(z)
    ##             x        y
    ## [1,] 65.04851 59.77827
    ## [2,] 63.25094 63.21404
    ## [3,] 64.95532 63.34242
    ## [4,] 65.75250 62.79238
    ## [5,] 61.13723 64.28113
    ## [6,] 63.02254 64.24221
    a <- matrix(rep(1, length(x)))
    data.frame("dimension" = c("rows", "columns"), "Y" = dim(z), "A" = dim(a))
    ##   dimension    Y    A
    ## 1      rows 1078 1078
    ## 2   columns    2    1
    m <- (t(a) %*% z)/length(x) #calculating the mean matrix
    m
    ##            x        y
    ## [1,] 67.6871 68.68407
    n <- matrix(rep(m,length(x)), nrow = length(x), byrow = T) #creating a matrix that repeats the mean matrix for the same number of rows as the data matrix
    r <- z - n
    head(r)
    ##              x        y
    ## [1,] -2.638587 -8.90580
    ## [2,] -4.436157 -5.47003
    ## [3,] -2.731777 -5.34165
    ## [4,] -1.934597 -5.89169
    ## [5,] -6.549867 -4.40294
    ## [6,] -4.664557 -4.44186
    v <- crossprod(r)/(length(x)-1) #calculating the covariance matrix
    v
    ##          x        y
    ## x 7.534303 3.873333
    ## y 3.873333 7.922545
    #round() b/c floating point issue in diff calculations
    #all checks if every cell in covariance matrix is equivalent
    all(round(v,10) == round(cov(cbind(x,y)),10)) #equivalency of r code cov() on a single matrix with variables x and y compared to above using cov() on seperate vectors x and Y 
    ## [1] TRUE
    all(list(
    v[1,1] == cov(x,x) & v[1,1] == var(x),
    v[2,2] == cov(y,y) & v[2,2] == var(y),
    v[1,2] == cov(x,y)))
    ## [1] TRUE