RNA-seq alignment - Hello. I'm Ben Solomon.

RNA-seq alignment

Sample data information

TITLE: “RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells.”
PAPER: http://www.ncbi.nlm.nih.gov/pubmed/24926665?holding=wustlmlib

DATA: http://www.ncbi.nlm.nih.gov/Traces/sra/?study=SRP033351

##            X SampleName    cell   dex albut        Run avgLength
## 1 SRR1039508 GSM1275862  N61311 untrt untrt SRR1039508       126
## 2 SRR1039509 GSM1275863  N61311   trt untrt SRR1039509       126
## 3 SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512       126
## 4 SRR1039513 GSM1275867 N052611   trt untrt SRR1039513        87
## 5 SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516       120
## 6 SRR1039517 GSM1275871 N080611   trt untrt SRR1039517       126
## 7 SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520       101
## 8 SRR1039521 GSM1275875 N061011   trt untrt SRR1039521        98
##   Experiment    Sample    BioSample
## 1  SRX384345 SRS508568 SAMN02422669
## 2  SRX384346 SRS508567 SAMN02422675
## 3  SRX384349 SRS508571 SAMN02422678
## 4  SRX384350 SRS508572 SAMN02422670
## 5  SRX384353 SRS508575 SAMN02422682
## 6  SRX384354 SRS508576 SAMN02422673
## 7  SRX384357 SRS508579 SAMN02422683
## 8  SRX384358 SRS508580 SAMN02422677

Information summary of each sample in study
SRR numbers are reference numbers for data held in the Sequence Read Archive
GSM numbers are reference numbers for data held in the Gene Expression Omnibus

Downloading and working with FASTQ files

Finding public fastq files * Sequence Read Archive hosts compressed FASTQ files + FTP download may not be available? * European Nucleotide Archive hosts uncompressed FASTQ files

Downloading FASTQ files

###Terminal commands
> cd [drive path]
> wget [url of read 1 FASTQ FTP]
> wget [url of read 2 FASTQ FTP]

###Terminal commands
> alias ll='ls -lGh'
#Creates alias ll that calls custom ls variation
#-l shows long format
#-G witholds group information
#-h shows file size in conventional units

Uncompressing files

###Terminal commands
> gunzip *.fastq.gz
#* is wildcard character allowing for all .fastq.gz files to be selected

Evaluating files

###Terminal commands
> less [file] #Shows subset of lines
> wc -l [file] #Shows number of lines in file

Example of FASTQ data

@SRR1039508.1 HWI-ST177:290:C0TECACXX:1:1101:1225:2130/1
CATTGCTGATACCAANNNNNNNNGCATTCCTCAAGGTCTTCCTCCTTCCCTTACGGAATTACA
+
HJJJJJJJJJJJJJJ########00?GHIJJJJJJJIJJJJJJJJJJJJJJJJJHHHFFFFFD
@SRR1039508.2 HWI-ST177:290:C0TECACXX:1:1101:1311:2131/1
CCCTGGACTGCTTCTTGAAAAGTGCCATCCAAACTCTATCTTTGGGGAGAGTATGATAGAGAT
+
HJJJJJJJJJJJJJJJJJIIJIGHIJJJJJJJJJJJJJJJJJJJJJJGHHIDHIJJHHHHHHF
@SRR1039508.3 HWI-ST177:290:C0TECACXX:1:1101:1414:2163/1
TCGATCCATCGATTGGAAGGCACTGATCTGGACTGTCAGGTTGGTGGTCTTATTTGCAAGTCC
+
HJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJGJJIGHIJJBGIJCGIAHIJHHHHHHHFFFFF

Line 1 - information about the read, including which strand of paired end (if applicable) with /1 or /2
Line 2 - the sequence itself
Line 4 - quality code with each letter representing a quality code for each nucleotide

FASTQC quality control

###Terminal commands
> fastqc --noextract [file1] [file2] ...
#--noextract will output files in .zip format

Output includes an .html file encoding the fastqc report

Per base quality

Distribution of quality at each position for all reads
Tends to trail off towards end of read length

Per sequence quality score

Frequency of each level of quality score (highest quality is 40)
Want most reads to be high quality

Per base sequence content

Proportion of each nucleotide at each position across all reads
Each nucleotide should represent ~25% of nucleotides at each position

Genome alignment with STAR

File housekeeping

###Terminal commands
> mkdir fastq
> mv *.fastq fastq #moves all fastq files to new fastq folder
> mkdir fastqc
> mv *.zip fastqc

Generating genome indexes

Need to provide STAR with reference genome and infomation to parse exons and introns
- Reference genome is FASTA format
- Intron/exon annotation in GTF format

Can download reference genome from Ensemble or NCBI

Ensemble –> Downloads –> Download data by FTP –> copy links for fasta and GTF

###Terminal commands
> wget [url of FASTA FTP]
> wget [url of GTF FTP]
> gunzip *.zip
> mkdir genome
> mv [genome fasta file] genome
> mkdir gtf
> mv [gtf file] gtf

May want to subset the reference genome based if only a subset of genome is needed
```
###Terminal commands
> grep -P '^1\t' Homo_sapiens.GRCh38.79.gtf > Homo_sapiens.GRCh38.79.chrom1.gtf
```
- This command subsets only the genome sequences located on chr1

Run STAR command to generate reference genome inexes

###Terminal commands
> mkdir STARoutput
> STAR --runMode genomeGenerate \
> --genomeDir STARoutput \
> --genomeFASTAFiles genome/[genome filename]
> --sjdbGTFfile gtf/[gtf filename]
> --sjdbOverhand [read length - 1]

Running STAR

Write a bash script for STAR command

Running alignment typically done on cluster due to necessary computing power

Command is run on a cluster by submitting a bash script

###Bash script "run_star.sh"
> STAR --runThreadN 12 \ #number of threads to use
> --genomeDire STARoutput \
> --readFilesIn fastq/[read 1 filename] fastq/[read 2 filename]

Sample output

STAR output contained in .sam files or compressed .bam files

@PG     ID:STAR PN:STAR VN:STAR_2.3.1z_r395     CL:STAR   --runThreadN 12   --genomeDir GRCh38.79.chrom1   --readFilesIn fastq/SRR1039508_1.fastq   fastq/SRR1039508_2.fastq    cl:STAR --runThreadN 12 --genomeDir GRCh38.79.chrom1 --readFilesIn fastq/SRR1039508_1.fastq fastq/SRR1039508_2.fastq
@SQ     SN:1    LN:248956422
SRR1039508.7    163     1       96446974        255     63M     =       96446997        86      CGTAGATTCGGGCAAGTCCACCACTACTGGCCATCTGATCTATAAATGCGGTGGCATCGACAA HJHIJJJJJJJJJJJJJJJIJJJJJJJJIJJJJJJJJJIIJJJJJJJJJJJEHHFFFFDEDDD NH:i:1  HI:i:1  AS:i:110        nM:i:7
SRR1039508.7    83      1       96446997        255     63M     =       96446974        -86     CTACTGGCCATCTGATCTATAAATGCGGTGGCATCGACAAAAGAACCATTGAAAAATTTGAGA FHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJH NH:i:1  HI:i:1  AS:i:110        nM:i:7
SRR1039508.16   163     1       243426450       255     63M     =       243426495       108     AAGAAAAAAGGTATACATATGATAAATTGGGAAAGTTACAGAGAAGAAATGAAGAATTGGAGG HJJJJJJJJJJAGIJJJIJJJJJJJJJJIJJIIJJHIJFHJIJJIJIJJJJJJJJIJJHHHHG NH:i:1  HI:i:1  AS:i:124        nM:i:0
SRR1039508.16   83      1       243426495       255     63M     =       243426450       -108    GAAATGAAGAATTGGAGGAACAGTGTGTCCAGCATGGGAGAGTACATGAGACGATGAAGCAAA IJJIGJJJJJIIIJJJJJIIJJJJJJJIJJJJJIGJJIJJJJJIJJJJIJJJJJJJJJJJJJH NH:i:1  HI:i:1  AS:i:124        nM:i:0

First column (e.g. SRR1039508.7) is read identifier
Second column (e.g 163) is the SAM Flag code which can be looked up
- Can include info on if read was paired and which strand in pair
Third column is the length of the corresponding reference sequence
Fourth column is the CIGAR string which presents coded output of how well a read aligned to the genome
- “63M” is the CIGAR string for all 4 reads above, indicating a 63bp (M)atch
- Other codes include (I)nsertion, (D)eletion, and (N)Skipped

Transcriptome alignment with RSEM

Preparing reference transcriptome

###Terminal commands
> mkdir rsemGenome
> rsem-prepare-reference --gtf gtf/[gft filename] genome/[genome fasta filename] rsemGenome/[output filename]

Calculating expression

###Bash script "run_rsem.sh"
> rsem-calculate-expression -p 12 --paired-end fastq/[read 1 filename] fastq/[read 2 filename] rsemGenome/[reference filepath] [sample name]

Sample .isoform.results

transcript_id   gene_id length  effective_length    expected_count  TPM FPKM    IsoPct
ENST00000367770 ENSG00000000457 2916    2880.32 17.48   2.26    1.92    7.12
ENST00000367771 ENSG00000000457 6364    6328.32 234.41  13.75   11.67   43.37
ENST00000367772 ENSG00000000457 3090    3054.32 77.27   9.41    7.99    29.68
ENST00000423670 ENSG00000000457 2077    2041.32 20.75   3.79    3.22    11.95
ENST00000470238 ENSG00000000457 1538    1502.32 10.06   2.50    2.12    7.89

All of the above transcript isoforms (ENSTs) belong to a single gene (ENSG)
effective_length - transcript length - mean fragment length + 1
- Not positions on a transcript can generate a fragment due to the length of a fragment itself
expected_count - probabilistic estimate for transcripts as discussed in (1)
TPM - transcripts per million out of all transcripts
- Sum of all transcript TPMs will be one million
FPKM - fragments per kilobase of transcript per million
IsoPct - percentage of all gene transcripts represented by particular isoform
- Will total to 100% for all isoforms of a gene

Sample .genes.results

gene_id transcript_id(s)    length  effective_length    expected_count  TPM FPKM
ENSG00000000457 ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238 4254.03 4218.34 359.98  31.70   26.91
ENSG00000000460 ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289 3195.09 3159.41 116.58  13.72   11.65

Combines all isoforms of a gene into a single line
Stats (TPM, FPKM, etc) apply to the gene here, not an isoform