Differential gene expression with RNA-seq data

Modeling raw counts

A note on normalizing data

Data normalization is done to determine is differences between data is due to reasons beyond biological and technical variance
In previous notes, data was normalized to size factors in order to equalize the effects of high count and low count genes for analysis at the genome/transcriptome level
- When comparing two whole transcription profiles, want to make sure genes with particular levels of expression aren’t dominating the comparison
For analysis at the gene level we do not need to normalize the same way, since the expression levels of other genes do not directly affect the comparison between samples for a single gene
For gene level comparisons, we normalize by modeling feature counts from raw counts using the negative binomial distribution
- The negative binomial distribution is a dispersed variation of a Poisson distribution

Poisson distribution

[Reminder] - The Poisson distribution is a discrete probability distribution modeling the probability of obtaining a certain “count” given an “average count” or “rate”
The law of small numbers says that the sum of binary outcomes with a large \(N\) and small \(p\) can be modeled by a Poisson distribution
- RNAseq data fits this description since there are many outcomes (genes) and a small probability for each (% of individual gene out of all genes)
- For a single sample, the genes in the sample all follow a Poisson distribution
So, in a pool of transcripts, every gene follows a Poisson distribution with a rate relative to how much of that specific transcript is available in the sample
A result of this is that, for low-expression genes, the ratios of expression between samples are going to be much more variable than ratios of high expression genes
However, this variability is due only to sampling variablity and does not take into account experimental or biological variability
- This can be seen when comparing mean expression to expression variance
- For a true poisson variable, variance is equal to the mean
```
x <- rpois(10000,lambda=100)
mean(x); sd(x)^2
```
```
## [1] 100.0772
```
```
## [1] 99.617
```
- However, with biological data, the variance usually exceeds the mean, indicating additional sources of variability
- When comparing ACROSS samples, genes in the samples do not follow a simple Poisson distribution

Negative binomial distribution

The negative binomial distribution can be formalized as \(K_{ij} \sim \text{NB}(\mu_{ij}, \alpha_i)\)
For RNA-seq data, more specifically formalized as \(K_{ij} \sim \text{NB}(s_{ij} q_{ij}, \alpha_i )\)
- \(K_{ij}\) is the raw count of a particular gene \(i\) for a particular sample \(j\)
- \(\text{NB}\) is the negative binomial distribution
- \(s_{ij}\) is a normalization factor (e.g. size factor) for gene \(i\) in sample \(j\)
- \(q_{ij}\) is the modeled count to be used for comparison for gene \(i\) in sample \(j\)
  - \(s_{ij}q_{ij}\) is equivalent the the NB mean \(\mu_{ij}\)
- \(\alpha_i\) is the dispersion factor

The dispersion factor \(\alpha_i\) determines how far the variation of the negative binomial distirbution exceeds that of the equivalent Poisson distribution

library(rafalib)
mypar(3,1)
n <- 10000
brks <- 0:400
hist(rpois(n,lambda=100),
     main="Poisson / NB, disp=0",xlab="",breaks=brks,col="black")
hist(rnbinom(n,mu=100,size=1/.01), #size = 1/dispersion
     main="NB, disp = 0.01",xlab="",breaks=brks,col="black")
hist(rnbinom(n,mu=100,size=1/.1),
     main="NB, disp = 0.1",xlab="",breaks=brks,col="black")

The variance of the negative binomial is equivalent to \(\mu + \mu^2*\alpha\)

x <- rpois(10000,lambda =100) #Equivalent to NB with dispersion = 0
#mean; dispersion; variance
mean(x); (sd(x)/mean(x))^2; mean(x) + (mean(x)^2)*(0)

## [1] 99.9688

## [1] 0.01006731

## [1] 99.9688

x <- rnbinom(10000,mu=100,size=1/.1)
#mean; dispersion; variance
mean(x); (sd(x)/mean(x))^2; mean(x) + (mean(x)^2)*(0.1)

## [1] 99.6975

## [1] 0.1114373

## [1] 1093.657

Why model \(q_{ij}\) instead of just normalize it to some standardization factor?
- Alternatives:
  1. Negative binomial model = \(K_{ij} \sim \text{NB}(s_{ij} q_{ij}, \alpha_i )\)
  2. Basic standardization = \(\frac{K_{ij}}{s_{ij}} \sim \mathcal{L}(\mu_{ij} = q_{ij})\)
    - Here \(q_{ij}\) is just the result of dividing raw counts (\(K_{ij}\)) by a standardization factor (\(s_{ij}\))
- When a Poisson variable is scaled directly by a factor, it changes both the mean and the variance
  - So while scaling by a size factor my normalize the mean, it will artificially skew the variance
    - Variables that are scaled up will have increased variance and variables that are scaled down will have reduced variance
```
mypar(3,1)
n <- 10000
brks <- 0:300
###All three distributions have the same mean
hist(rpois(n,100),main="",xlab="",breaks=brks,col="black") #Base 100
hist(rpois(n,1000)/10,main="",xlab="",breaks=brks,col="black") #1000 transformed to 100
hist(rpois(n,10)*10,main="",xlab="",breaks=brks,col="black") #10 transformed to 100
```
```
data.frame("Base.100" = c(mean(rpois(n,100)), sd(rpois(n,100))),
           "1000.to.100" = c(mean(rpois(n,1000)/10), sd(rpois(n,1000)/10)),
           "10.to.100" = c(mean(rpois(n,10)*10), sd(rpois(n,10)*10)),
           row.names = c("mean", "sd"))
```
```
##        Base.100 X1000.to.100 X10.to.100
## mean 100.038300    99.981190   99.76400
## sd     9.943501     3.150362   31.86749
```
- Once you introduce the possibility of scaled counts/variance, it becomes unclear what the true distribution of a given variable is
  - I.e. a gene with a mean of 100 could derive from any of the three plots above (or more), even though only one would be the true distribution

Differential gene expression with DESeq2

Initializing DESeq2 workflow

library(airway)
library(DESeq2)
data("airway")
dds <- DESeqDataSet(airway, design= ~ cell + dex)

Differential expression output

library(DESeq2)
dds$dex <- relevel(dds$dex, "untrt") #Ensures that "untrt" is the base level for expression
dds <- DESeq(dds)
res <- results(dds)
head(res)

## log2 fold change (MAP): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 6 rows and 6 columns
##                    baseMean log2FoldChange      lfcSE       stat
##                   <numeric>      <numeric>  <numeric>  <numeric>
## ENSG00000000003 708.6021697    -0.37415246 0.09884435 -3.7852692
## ENSG00000000005   0.0000000             NA         NA         NA
## ENSG00000000419 520.2979006     0.20206175 0.10974241  1.8412367
## ENSG00000000457 237.1630368     0.03616686 0.13834540  0.2614244
## ENSG00000000460  57.9326331    -0.08445399 0.24990710 -0.3379415
## ENSG00000000938   0.3180984    -0.08413903 0.15133427 -0.5559813
##                       pvalue        padj
##                    <numeric>   <numeric>
## ENSG00000000003 0.0001535422 0.001289269
## ENSG00000000005           NA          NA
## ENSG00000000419 0.0655868755 0.197066699
## ENSG00000000457 0.7937652378 0.913855995
## ENSG00000000460 0.7354072485 0.884141561
## ENSG00000000938 0.5782236283          NA

MA plots

MA plots show each gene by the mean of its counts across all sample vs. it’s fold change between conditions
```
plotMA(res, ylim=c(-4,4))
```
Red dots indicate genes with an adjusted p-value that has acheived significance
- The default alpha is 0.01
- Alpha can be changes in the results function with the argument alpha
```
library(rafalib)
mypar(1,2)
res2 <- results(dds, alpha=0.05)
plotMA(res, ylim=c(-4,4), main = "alpha=0.1")
plotMA(res2, ylim=c(-4,4), main = "alpha=0.05")
```
MA plots can also be adjusted to only display significiant differential expression that reaches a fold-change threshold
- Normally, all fold changes which are significant are indicated
- lfcThreshold argument in results can specify a cut off +“lfc” stands for log fold change
```
library(rafalib)
mypar(1,2)
res3 <- results(dds, lfcThreshold = 1)
plotMA(res, ylim=c(-4,4), main = "no thresh")
plotMA(res3, ylim=c(-4,4), main = "lfcTresh=1")
```

Individual genes

*Sorting genes by p-value can highlight important genes

```r
resSort <- res[order(res$padj),]
head(resSort)
```

```
## log2 fold change (MAP): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 6 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE      stat
##                  <numeric>      <numeric> <numeric> <numeric>
## ENSG00000152583   997.4398       4.313961 0.1721375  25.06114
## ENSG00000165995   495.0929       3.186823 0.1281565  24.86665
## ENSG00000101347 12703.3871       3.618734 0.1489434  24.29604
## ENSG00000120129  3409.0294       2.871488 0.1182491  24.28337
## ENSG00000189221  2341.7673       3.230395 0.1366746  23.63567
## ENSG00000211445 12285.6151       3.553360 0.1579821  22.49216
##                        pvalue          padj
##                     <numeric>     <numeric>
## ENSG00000152583 1.320082e-138 2.379844e-134
## ENSG00000165995 1.708459e-136 1.540005e-132
## ENSG00000101347 2.158519e-130 1.297126e-126
## ENSG00000120129 2.937761e-130 1.324049e-126
## ENSG00000189221 1.657244e-123 5.975359e-120
## ENSG00000211445 4.952520e-112 1.488067e-108
```

Can plot expression levels between groups for individual genes

library(ggplot2)
top.gene  <- which.min(res$padj) #selects gene with lowest adjusted pvalue
data <- plotCounts(dds, gene=top.gene, intgroup=c("dex","cell"), returnData=TRUE) #extracts expression data from top.gene
    #If returnData=F, would generate a base graphics plot itself
ggplot(data, aes(x=dex, y=count, col=cell, group=cell)) +
  geom_point() + geom_line() + scale_y_log10()

Heatmap of top genes

```r
# library(pheatmap)
# topgenes <- head(rownames(resSort),20)
# mat <- assay(rld)[topgenes,]
# mat <- mat - rowMeans(mat)
# df <- as.data.frame(colData(dds)[,c("dex","cell")])
# pheatmap(mat, annotation_col=df)
```

4 - Differential gene expression with RNA-seq data

Modeling raw counts

Differential gene expression with DESeq2

Finding hidden structure