Introduction to RNA-seq analysis
Data generation
Basics
- RNAseq data in generated from short fragments (~60bp) of cDNA
- These fragments are then aligned and mapped to a genome
- The number of fragments aligned to a gene can be quantified and used to assess transcription levels
The problem of exons
- Unlike the genome, cDNA transcripts lack introns
- So cDNA fragments spanning an exon junction site present a challenge when mapping to a genome since they “skip” the intron region
- Some analysis approaches map fragments to a transcriptome to avoid this problem
- e.g. Trinity, Velvet/Oases
- Others use a reference genome which exon annotation to map cDNA fragments
- e.g. Cufflinks, Scripture
The problem of transcript isoforms
- Many genes have multiple ways in which their exons can be assembled
- Determining which potential transcript isoform a short RNAseq fragment derived from is another challenge
Model quantification
Quantifying transcripts of different lengths
- Necessary to distinguish between copy number and length when quantifying transcripts
- When broken into fragments, a longer transcript will generate more transcripts than a shorter transcript, even if the number of copies of each is the same
The number of copies of a transcript is quantified using the poisson distribtion
- The proportion of sequence material in an RNAseq pool derived from a given transcript can be stated as \(p_f= \theta_fl_f\)
- \(l_f\) is the length of transcript \(f\)
- \(\theta_f\) is the proportion of transcripts in the RNAseq pool derived from transcript \(f\)
- This can be defined as \(\theta_f = \frac{k_f}{\Sigma k_il_i}\)
- \(k_f\) is the total amount of sequence material from transcript \(f\)
- \(\Sigma k_il_i\) represents the total amount of sequence in the RNAseq pool
- \(\theta_f\) is the parameter of interest for RNAseq analysis
- This can be defined as \(\theta_f = \frac{k_f}{\Sigma k_il_i}\)
- \(Y_f\) ~ \(\text{Poisson}(\theta_fl_f)\)
- \(Y_f\) is the number of reads of transcript \(f\)
- [Reminder] - The poisson distribution is a discrete probability distribution modeling the probability of obtaining a certain “count” given an “average count” or “rate”
- The law of small numbers says that the sum of binary outcomes with a large \(N\) and small \(p\) can be modeled by a poisson distribution +For RNAseq, outcomes are binary (each transcript either is or isnt \(f\)), \(N\) (number of transcripts) is high, and \(p\) (frequency of transcript \(f\)) is small, so count of \(f\) can be modeled by a poisson
- The number of counts for a transcripts follows a poisson distribution centered on the proportion of all sequence material derived from \(f\)
- References for use of poisson distribution in RNA seq:
- Marioni (2008) Genome Research
- Jiang and Wong (2009) Bioinformatics
Maximum likelihood quantification
- Example - A three exon (A,B,C) gene with three possible transcripts 1) A-B, 2) A-B-C, 3) B-C
- How do we quantify to proportion of each transcript from fragments derived from all three?
- There are five possible fragment sequences given the size of a fragment and exon
- A only, B only, C only, A-B junction, B-C junction
- These possibilities can be descriped by a linear model \[\begin{align} E \left( \begin{array}{c} Y_A/l_A \\ Y_B/l_B \\ Y_C/l_C \\ Y_{A,B}/l_{A,B} \\ Y_{B,C}/l_{B,C} \end{array} \right) & = & \left( \begin{array}{rrr} 1 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 0 \\ 0 & 1 & 1 \end{array} \right) & \left( \begin{array}{r} \theta_A \\ \theta_B \\ \theta_C \end{array} \right) \end{align}\]
- Where: \[\begin{align} \left( \begin{array}{c} Y_A \\ Y_B \\ Y_C \\ Y_{A,B} \\ Y_{B,C} \end{array} \right) & = & \left( \begin{array}{c} \text{Exon A count} \\ \text{Exon B count} \\ \text{Exon C count} \\ \text{Junction A-B count} \\ \text{Junction B-C count} \end{array} \right) \end{align}\]
- And, for example, \(Y_A\) ~ \(\text{Poisson}(\theta_1l_1 + \theta_2l_2)\)
- The read count for exon A follows a poisson distribution based on the proportion of fragments from isoforms 1 and 2, since both of these can potentially include exon A
- Solving this maximum likelihood is used if Cufflinks algorithm
- Further discussed in Jiang and Wong (2009) Bioinformatics
- Example
- Suppose \[\begin{align} \text{lenghts} = \left( \begin{array}{c} l_A \\ l_B \\ l_C \\ l_{A,B} \\ l_{B,C} \end{array} \right) = \left( \begin{array}{c} 100 \\ 200 \\ 300 \\ 100 \\ 100 \end{array} \right) \end{align}\]
- The length of the exons are simply the actual length of the exons
- The length of the junctions represent the total range of sequence that can be counted as a junctional sequence
- Any fragment that touches a junction is considered a junctional sequence, so this ends up representing 2x the size of a fragment (i.e fragments can touch the junction from the 5’ or 3’ side). In this example the fragment is ~50bp so the junction lengths are 100bp
- Note that realistically, computation of lengths and counts is more sophisticated
- Suppose \[\begin{align} \text{lenghts} = \left( \begin{array}{c} l_A \\ l_B \\ l_C \\ l_{A,B} \\ l_{B,C} \end{array} \right) = \left( \begin{array}{c} 100 \\ 200 \\ 300 \\ 100 \\ 100 \end{array} \right) \end{align}\]
exon.lengths <- c(100,200,300,100,100)
mat <- cbind(c(1,1,0,1,0),c(1,1,1,1,1),c(0,1,1,0,1))
(transcript.lengths <- exon.lengths %*% mat)
## [,1] [,2] [,3]
## [1,] 400 800 600
total.reads <- 1000
exon.counts <- c(125,350,300,125,100)
LHS <- exon.counts/(exon.lengths * total.reads)
lm.fit(mat, LHS)$coefficients
## x1 x2 x3
## 0.00075 0.00050 0.00050