### Data generation

Basics

• RNAseq data in generated from short fragments (~60bp) of cDNA
• These fragments are then aligned and mapped to a genome
• The number of fragments aligned to a gene can be quantified and used to assess transcription levels

The problem of exons

• Unlike the genome, cDNA transcripts lack introns
• So cDNA fragments spanning an exon junction site present a challenge when mapping to a genome since they “skip” the intron region
• Some analysis approaches map fragments to a transcriptome to avoid this problem
• e.g. Trinity, Velvet/Oases
• Others use a reference genome which exon annotation to map cDNA fragments

The problem of transcript isoforms

• Many genes have multiple ways in which their exons can be assembled
• Determining which potential transcript isoform a short RNAseq fragment derived from is another challenge

### Model quantification

Quantifying transcripts of different lengths

• Necessary to distinguish between copy number and length when quantifying transcripts
• When broken into fragments, a longer transcript will generate more transcripts than a shorter transcript, even if the number of copies of each is the same

The number of copies of a transcript is quantified using the poisson distribtion

• The proportion of sequence material in an RNAseq pool derived from a given transcript can be stated as $$p_f= \theta_fl_f$$
• $$l_f$$ is the length of transcript $$f$$
• $$\theta_f$$ is the proportion of transcripts in the RNAseq pool derived from transcript $$f$$
• This can be defined as $$\theta_f = \frac{k_f}{\Sigma k_il_i}$$
• $$k_f$$ is the total amount of sequence material from transcript $$f$$
• $$\Sigma k_il_i$$ represents the total amount of sequence in the RNAseq pool
• $$\theta_f$$ is the parameter of interest for RNAseq analysis
• $$Y_f$$ ~ $$\text{Poisson}(\theta_fl_f)$$
• $$Y_f$$ is the number of reads of transcript $$f$$
• [Reminder] - The poisson distribution is a discrete probability distribution modeling the probability of obtaining a certain “count” given an “average count” or “rate”
• The law of small numbers says that the sum of binary outcomes with a large $$N$$ and small $$p$$ can be modeled by a poisson distribution +For RNAseq, outcomes are binary (each transcript either is or isnt $$f$$), $$N$$ (number of transcripts) is high, and $$p$$ (frequency of transcript $$f$$) is small, so count of $$f$$ can be modeled by a poisson
• The number of counts for a transcripts follows a poisson distribution centered on the proportion of all sequence material derived from $$f$$
• References for use of poisson distribution in RNA seq:
• Marioni (2008) Genome Research
• Jiang and Wong (2009) Bioinformatics

Maximum likelihood quantification

• Example - A three exon (A,B,C) gene with three possible transcripts 1) A-B, 2) A-B-C, 3) B-C
• How do we quantify to proportion of each transcript from fragments derived from all three?
• There are five possible fragment sequences given the size of a fragment and exon
• A only, B only, C only, A-B junction, B-C junction
• These possibilities can be descriped by a linear model \begin{align} E \left( \begin{array}{c} Y_A/l_A \\ Y_B/l_B \\ Y_C/l_C \\ Y_{A,B}/l_{A,B} \\ Y_{B,C}/l_{B,C} \end{array} \right) & = & \left( \begin{array}{rrr} 1 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 0 \\ 0 & 1 & 1 \end{array} \right) & \left( \begin{array}{r} \theta_A \\ \theta_B \\ \theta_C \end{array} \right) \end{align}
• Where: \begin{align} \left( \begin{array}{c} Y_A \\ Y_B \\ Y_C \\ Y_{A,B} \\ Y_{B,C} \end{array} \right) & = & \left( \begin{array}{c} \text{Exon A count} \\ \text{Exon B count} \\ \text{Exon C count} \\ \text{Junction A-B count} \\ \text{Junction B-C count} \end{array} \right) \end{align}
• And, for example, $$Y_A$$ ~ $$\text{Poisson}(\theta_1l_1 + \theta_2l_2)$$
• The read count for exon A follows a poisson distribution based on the proportion of fragments from isoforms 1 and 2, since both of these can potentially include exon A
• Solving this maximum likelihood is used if Cufflinks algorithm
• Further discussed in Jiang and Wong (2009) Bioinformatics
• Example
• Suppose \begin{align} \text{lenghts} = \left( \begin{array}{c} l_A \\ l_B \\ l_C \\ l_{A,B} \\ l_{B,C} \end{array} \right) = \left( \begin{array}{c} 100 \\ 200 \\ 300 \\ 100 \\ 100 \end{array} \right) \end{align}
• The length of the exons are simply the actual length of the exons
• The length of the junctions represent the total range of sequence that can be counted as a junctional sequence
• Any fragment that touches a junction is considered a junctional sequence, so this ends up representing 2x the size of a fragment (i.e fragments can touch the junction from the 5’ or 3’ side). In this example the fragment is ~50bp so the junction lengths are 100bp
• Note that realistically, computation of lengths and counts is more sophisticated
exon.lengths <- c(100,200,300,100,100)
mat <- cbind(c(1,1,0,1,0),c(1,1,1,1,1),c(0,1,1,0,1))
(transcript.lengths <- exon.lengths %*% mat)
##      [,1] [,2] [,3]
## [1,]  400  800  600
total.reads <- 1000
exon.counts <- c(125,350,300,125,100)
lm.fit(mat, LHS)\$coefficients
##      x1      x2      x3
## 0.00075 0.00050 0.00050