4 - Differential gene expression with RNA-seq data
Modeling raw counts
A note on normalizing data
- Data normalization is done to determine is differences between data is due to reasons beyond biological and technical variance
- In previous notes, data was normalized to
size factors
in order to equalize the effects of high count and low count genes for analysis at the genome/transcriptome level- When comparing two whole transcription profiles, want to make sure genes with particular levels of expression aren’t dominating the comparison
- For analysis at the gene level we do not need to normalize the same way, since the expression levels of other genes do not directly affect the comparison between samples for a single gene
- For gene level comparisons, we normalize by modeling feature counts from raw counts using the negative binomial distribution
- The negative binomial distribution is a dispersed variation of a Poisson distribution
Poisson distribution
- [Reminder] - The Poisson distribution is a discrete probability distribution modeling the probability of obtaining a certain “count” given an “average count” or “rate”
- The law of small numbers says that the sum of binary outcomes with a large \(N\) and small \(p\) can be modeled by a Poisson distribution
- RNAseq data fits this description since there are many outcomes (genes) and a small probability for each (% of individual gene out of all genes)
- For a single sample, the genes in the sample all follow a Poisson distribution
- So, in a pool of transcripts, every gene follows a Poisson distribution with a rate relative to how much of that specific transcript is available in the sample
- A result of this is that, for low-expression genes, the ratios of expression between samples are going to be much more variable than ratios of high expression genes
- However, this variability is due only to sampling variablity and does not take into account experimental or biological variability
- This can be seen when comparing mean expression to expression variance
For a true poisson variable, variance is equal to the mean
x <- rpois(10000,lambda=100) mean(x); sd(x)^2
## [1] 100.0772
## [1] 99.617
- However, with biological data, the variance usually exceeds the mean, indicating additional sources of variability
When comparing ACROSS samples, genes in the samples do not follow a simple Poisson distribution
Negative binomial distribution
- The negative binomial distribution can be formalized as \(K_{ij} \sim \text{NB}(\mu_{ij}, \alpha_i)\)
- For RNA-seq data, more specifically formalized as \(K_{ij} \sim \text{NB}(s_{ij} q_{ij}, \alpha_i )\)
- \(K_{ij}\) is the raw count of a particular gene \(i\) for a particular sample \(j\)
- \(\text{NB}\) is the negative binomial distribution
- \(s_{ij}\) is a normalization factor (e.g. size factor) for gene \(i\) in sample \(j\)
- \(q_{ij}\) is the modeled count to be used for comparison for gene \(i\) in sample \(j\)
- \(s_{ij}q_{ij}\) is equivalent the the NB mean \(\mu_{ij}\)
- \(\alpha_i\) is the dispersion factor
The dispersion factor \(\alpha_i\) determines how far the variation of the negative binomial distirbution exceeds that of the equivalent Poisson distribution
library(rafalib) mypar(3,1) n <- 10000 brks <- 0:400 hist(rpois(n,lambda=100), main="Poisson / NB, disp=0",xlab="",breaks=brks,col="black") hist(rnbinom(n,mu=100,size=1/.01), #size = 1/dispersion main="NB, disp = 0.01",xlab="",breaks=brks,col="black") hist(rnbinom(n,mu=100,size=1/.1), main="NB, disp = 0.1",xlab="",breaks=brks,col="black")
The variance of the negative binomial is equivalent to \(\mu + \mu^2*\alpha\)
x <- rpois(10000,lambda =100) #Equivalent to NB with dispersion = 0 #mean; dispersion; variance mean(x); (sd(x)/mean(x))^2; mean(x) + (mean(x)^2)*(0)
## [1] 99.9688
## [1] 0.01006731
## [1] 99.9688
x <- rnbinom(10000,mu=100,size=1/.1) #mean; dispersion; variance mean(x); (sd(x)/mean(x))^2; mean(x) + (mean(x)^2)*(0.1)
## [1] 99.6975
## [1] 0.1114373
## [1] 1093.657
- Why model \(q_{ij}\) instead of just normalize it to some standardization factor?
- Alternatives:
- Negative binomial model = \(K_{ij} \sim \text{NB}(s_{ij} q_{ij}, \alpha_i )\)
- Basic standardization = \(\frac{K_{ij}}{s_{ij}} \sim \mathcal{L}(\mu_{ij} = q_{ij})\)
- Here \(q_{ij}\) is just the result of dividing raw counts (\(K_{ij}\)) by a standardization factor (\(s_{ij}\))
- When a Poisson variable is scaled directly by a factor, it changes both the mean and the variance
- So while scaling by a size factor my normalize the mean, it will artificially skew the variance
- Variables that are scaled up will have increased variance and variables that are scaled down will have reduced variance
- So while scaling by a size factor my normalize the mean, it will artificially skew the variance
mypar(3,1) n <- 10000 brks <- 0:300 ###All three distributions have the same mean hist(rpois(n,100),main="",xlab="",breaks=brks,col="black") #Base 100 hist(rpois(n,1000)/10,main="",xlab="",breaks=brks,col="black") #1000 transformed to 100 hist(rpois(n,10)*10,main="",xlab="",breaks=brks,col="black") #10 transformed to 100
data.frame("Base.100" = c(mean(rpois(n,100)), sd(rpois(n,100))), "1000.to.100" = c(mean(rpois(n,1000)/10), sd(rpois(n,1000)/10)), "10.to.100" = c(mean(rpois(n,10)*10), sd(rpois(n,10)*10)), row.names = c("mean", "sd"))
## Base.100 X1000.to.100 X10.to.100 ## mean 100.038300 99.981190 99.76400 ## sd 9.943501 3.150362 31.86749
- Once you introduce the possibility of scaled counts/variance, it becomes unclear what the true distribution of a given variable is
- I.e. a gene with a mean of 100 could derive from any of the three plots above (or more), even though only one would be the true distribution
- Alternatives:
Differential gene expression with DESeq2
Initializing DESeq2 workflow
library(airway)
library(DESeq2)
data("airway")
dds <- DESeqDataSet(airway, design= ~ cell + dex)
Differential expression output
library(DESeq2)
dds$dex <- relevel(dds$dex, "untrt") #Ensures that "untrt" is the base level for expression
dds <- DESeq(dds)
res <- results(dds)
head(res)
## log2 fold change (MAP): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat
## <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 708.6021697 -0.37415246 0.09884435 -3.7852692
## ENSG00000000005 0.0000000 NA NA NA
## ENSG00000000419 520.2979006 0.20206175 0.10974241 1.8412367
## ENSG00000000457 237.1630368 0.03616686 0.13834540 0.2614244
## ENSG00000000460 57.9326331 -0.08445399 0.24990710 -0.3379415
## ENSG00000000938 0.3180984 -0.08413903 0.15133427 -0.5559813
## pvalue padj
## <numeric> <numeric>
## ENSG00000000003 0.0001535422 0.001289269
## ENSG00000000005 NA NA
## ENSG00000000419 0.0655868755 0.197066699
## ENSG00000000457 0.7937652378 0.913855995
## ENSG00000000460 0.7354072485 0.884141561
## ENSG00000000938 0.5782236283 NA
MA plots
MA plots show each gene by the mean of its counts across all sample vs. it’s fold change between conditions
plotMA(res, ylim=c(-4,4))
- Red dots indicate genes with an adjusted p-value that has acheived significance
- The default alpha is 0.01
- Alpha can be changes in the
results
function with the argumentalpha
library(rafalib) mypar(1,2) res2 <- results(dds, alpha=0.05) plotMA(res, ylim=c(-4,4), main = "alpha=0.1") plotMA(res2, ylim=c(-4,4), main = "alpha=0.05")
- MA plots can also be adjusted to only display significiant differential expression that reaches a fold-change threshold
- Normally, all fold changes which are significant are indicated
lfcThreshold
argument inresults
can specify a cut off +“lfc” stands for log fold change
library(rafalib) mypar(1,2) res3 <- results(dds, lfcThreshold = 1) plotMA(res, ylim=c(-4,4), main = "no thresh") plotMA(res3, ylim=c(-4,4), main = "lfcTresh=1")
Individual genes
*Sorting genes by p-value can highlight important genes
```r
resSort <- res[order(res$padj),]
head(resSort)
```
```
## log2 fold change (MAP): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat
## <numeric> <numeric> <numeric> <numeric>
## ENSG00000152583 997.4398 4.313961 0.1721375 25.06114
## ENSG00000165995 495.0929 3.186823 0.1281565 24.86665
## ENSG00000101347 12703.3871 3.618734 0.1489434 24.29604
## ENSG00000120129 3409.0294 2.871488 0.1182491 24.28337
## ENSG00000189221 2341.7673 3.230395 0.1366746 23.63567
## ENSG00000211445 12285.6151 3.553360 0.1579821 22.49216
## pvalue padj
## <numeric> <numeric>
## ENSG00000152583 1.320082e-138 2.379844e-134
## ENSG00000165995 1.708459e-136 1.540005e-132
## ENSG00000101347 2.158519e-130 1.297126e-126
## ENSG00000120129 2.937761e-130 1.324049e-126
## ENSG00000189221 1.657244e-123 5.975359e-120
## ENSG00000211445 4.952520e-112 1.488067e-108
```
Can plot expression levels between groups for individual genes
library(ggplot2) top.gene <- which.min(res$padj) #selects gene with lowest adjusted pvalue data <- plotCounts(dds, gene=top.gene, intgroup=c("dex","cell"), returnData=TRUE) #extracts expression data from top.gene #If returnData=F, would generate a base graphics plot itself ggplot(data, aes(x=dex, y=count, col=cell, group=cell)) + geom_point() + geom_line() + scale_y_log10()
Heatmap of top genes
```r
# library(pheatmap)
# topgenes <- head(rownames(resSort),20)
# mat <- assay(rld)[topgenes,]
# mat <- mat - rowMeans(mat)
# df <- as.data.frame(colData(dds)[,c("dex","cell")])
# pheatmap(mat, annotation_col=df)
```