RNA-seq alignment
Sample data information
- TITLE: “RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells.”
- PAPER: http://www.ncbi.nlm.nih.gov/pubmed/24926665?holding=wustlmlib
DATA: http://www.ncbi.nlm.nih.gov/Traces/sra/?study=SRP033351
## X SampleName cell dex albut Run avgLength ## 1 SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 ## 2 SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 ## 3 SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 ## 4 SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 ## 5 SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 ## 6 SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 ## 7 SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 ## 8 SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 ## Experiment Sample BioSample ## 1 SRX384345 SRS508568 SAMN02422669 ## 2 SRX384346 SRS508567 SAMN02422675 ## 3 SRX384349 SRS508571 SAMN02422678 ## 4 SRX384350 SRS508572 SAMN02422670 ## 5 SRX384353 SRS508575 SAMN02422682 ## 6 SRX384354 SRS508576 SAMN02422673 ## 7 SRX384357 SRS508579 SAMN02422683 ## 8 SRX384358 SRS508580 SAMN02422677
- Information summary of each sample in study
- SRR numbers are reference numbers for data held in the Sequence Read Archive
- GSM numbers are reference numbers for data held in the Gene Expression Omnibus
Downloading and working with FASTQ files
Finding public fastq files * Sequence Read Archive hosts compressed FASTQ files + FTP download may not be available? * European Nucleotide Archive hosts uncompressed FASTQ files
Downloading FASTQ files
###Terminal commands
> cd [drive path]
> wget [url of read 1 FASTQ FTP]
> wget [url of read 2 FASTQ FTP]
###Terminal commands
> alias ll='ls -lGh'
#Creates alias ll that calls custom ls variation
#-l shows long format
#-G witholds group information
#-h shows file size in conventional units
Uncompressing files
###Terminal commands
> gunzip *.fastq.gz
#* is wildcard character allowing for all .fastq.gz files to be selected
Evaluating files
###Terminal commands
> less [file] #Shows subset of lines
> wc -l [file] #Shows number of lines in file
Example of FASTQ data
@SRR1039508.1 HWI-ST177:290:C0TECACXX:1:1101:1225:2130/1
CATTGCTGATACCAANNNNNNNNGCATTCCTCAAGGTCTTCCTCCTTCCCTTACGGAATTACA
+
HJJJJJJJJJJJJJJ########00?GHIJJJJJJJIJJJJJJJJJJJJJJJJJHHHFFFFFD
@SRR1039508.2 HWI-ST177:290:C0TECACXX:1:1101:1311:2131/1
CCCTGGACTGCTTCTTGAAAAGTGCCATCCAAACTCTATCTTTGGGGAGAGTATGATAGAGAT
+
HJJJJJJJJJJJJJJJJJIIJIGHIJJJJJJJJJJJJJJJJJJJJJJGHHIDHIJJHHHHHHF
@SRR1039508.3 HWI-ST177:290:C0TECACXX:1:1101:1414:2163/1
TCGATCCATCGATTGGAAGGCACTGATCTGGACTGTCAGGTTGGTGGTCTTATTTGCAAGTCC
+
HJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJGJJIGHIJJBGIJCGIAHIJHHHHHHHFFFFF
- Line 1 - information about the read, including which strand of paired end (if applicable) with /1 or /2
- Line 2 - the sequence itself
- Line 4 - quality code with each letter representing a quality code for each nucleotide
FASTQC quality control
###Terminal commands
> fastqc --noextract [file1] [file2] ...
#--noextract will output files in .zip format
- Output includes an .html file encoding the fastqc report
Per base quality
- Distribution of quality at each position for all reads
- Tends to trail off towards end of read length
Per sequence quality score
- Frequency of each level of quality score (highest quality is 40)
- Want most reads to be high quality
Per base sequence content
- Proportion of each nucleotide at each position across all reads
- Each nucleotide should represent ~25% of nucleotides at each position
Genome alignment with STAR
File housekeeping
###Terminal commands
> mkdir fastq
> mv *.fastq fastq #moves all fastq files to new fastq folder
> mkdir fastqc
> mv *.zip fastqc
Generating genome indexes
- Need to provide STAR with reference genome and infomation to parse exons and introns
- Reference genome is FASTA format
- Intron/exon annotation in GTF format
- Can download reference genome from Ensemble or NCBI
- Ensemble –> Downloads –> Download data by FTP –> copy links for fasta and GTF
###Terminal commands > wget [url of FASTA FTP] > wget [url of GTF FTP] > gunzip *.zip > mkdir genome > mv [genome fasta file] genome > mkdir gtf > mv [gtf file] gtf
May want to subset the reference genome based if only a subset of genome is needed
###Terminal commands > grep -P '^1\t' Homo_sapiens.GRCh38.79.gtf > Homo_sapiens.GRCh38.79.chrom1.gtf
- This command subsets only the genome sequences located on chr1
Run STAR command to generate reference genome inexes
###Terminal commands > mkdir STARoutput > STAR --runMode genomeGenerate \ > --genomeDir STARoutput \ > --genomeFASTAFiles genome/[genome filename] > --sjdbGTFfile gtf/[gtf filename] > --sjdbOverhand [read length - 1]
Running STAR
Write a bash script for STAR command
- Running alignment typically done on cluster due to necessary computing power
Command is run on a cluster by submitting a bash script
###Bash script "run_star.sh" > STAR --runThreadN 12 \ #number of threads to use > --genomeDire STARoutput \ > --readFilesIn fastq/[read 1 filename] fastq/[read 2 filename]
Sample output
- STAR output contained in .sam files or compressed .bam files
@PG ID:STAR PN:STAR VN:STAR_2.3.1z_r395 CL:STAR --runThreadN 12 --genomeDir GRCh38.79.chrom1 --readFilesIn fastq/SRR1039508_1.fastq fastq/SRR1039508_2.fastq cl:STAR --runThreadN 12 --genomeDir GRCh38.79.chrom1 --readFilesIn fastq/SRR1039508_1.fastq fastq/SRR1039508_2.fastq
@SQ SN:1 LN:248956422
SRR1039508.7 163 1 96446974 255 63M = 96446997 86 CGTAGATTCGGGCAAGTCCACCACTACTGGCCATCTGATCTATAAATGCGGTGGCATCGACAA HJHIJJJJJJJJJJJJJJJIJJJJJJJJIJJJJJJJJJIIJJJJJJJJJJJEHHFFFFDEDDD NH:i:1 HI:i:1 AS:i:110 nM:i:7
SRR1039508.7 83 1 96446997 255 63M = 96446974 -86 CTACTGGCCATCTGATCTATAAATGCGGTGGCATCGACAAAAGAACCATTGAAAAATTTGAGA FHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJH NH:i:1 HI:i:1 AS:i:110 nM:i:7
SRR1039508.16 163 1 243426450 255 63M = 243426495 108 AAGAAAAAAGGTATACATATGATAAATTGGGAAAGTTACAGAGAAGAAATGAAGAATTGGAGG HJJJJJJJJJJAGIJJJIJJJJJJJJJJIJJIIJJHIJFHJIJJIJIJJJJJJJJIJJHHHHG NH:i:1 HI:i:1 AS:i:124 nM:i:0
SRR1039508.16 83 1 243426495 255 63M = 243426450 -108 GAAATGAAGAATTGGAGGAACAGTGTGTCCAGCATGGGAGAGTACATGAGACGATGAAGCAAA IJJIGJJJJJIIIJJJJJIIJJJJJJJIJJJJJIGJJIJJJJJIJJJJIJJJJJJJJJJJJJH NH:i:1 HI:i:1 AS:i:124 nM:i:0
- First column (e.g. SRR1039508.7) is read identifier
- Second column (e.g 163) is the SAM Flag code which can be looked up
- Can include info on if read was paired and which strand in pair
- Third column is the length of the corresponding reference sequence
- Fourth column is the CIGAR string which presents coded output of how well a read aligned to the genome
- “63M” is the CIGAR string for all 4 reads above, indicating a 63bp (M)atch
- Other codes include (I)nsertion, (D)eletion, and (N)Skipped
Transcriptome alignment with RSEM
Preparing reference transcriptome
###Terminal commands
> mkdir rsemGenome
> rsem-prepare-reference --gtf gtf/[gft filename] genome/[genome fasta filename] rsemGenome/[output filename]
Calculating expression
###Bash script "run_rsem.sh"
> rsem-calculate-expression -p 12 --paired-end fastq/[read 1 filename] fastq/[read 2 filename] rsemGenome/[reference filepath] [sample name]
Sample .isoform.results
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
ENST00000367770 ENSG00000000457 2916 2880.32 17.48 2.26 1.92 7.12
ENST00000367771 ENSG00000000457 6364 6328.32 234.41 13.75 11.67 43.37
ENST00000367772 ENSG00000000457 3090 3054.32 77.27 9.41 7.99 29.68
ENST00000423670 ENSG00000000457 2077 2041.32 20.75 3.79 3.22 11.95
ENST00000470238 ENSG00000000457 1538 1502.32 10.06 2.50 2.12 7.89
- All of the above transcript isoforms (ENSTs) belong to a single gene (ENSG)
- effective_length - transcript length - mean fragment length + 1
- Not positions on a transcript can generate a fragment due to the length of a fragment itself
- expected_count - probabilistic estimate for transcripts as discussed in (1)
- TPM - transcripts per million out of all transcripts
- Sum of all transcript TPMs will be one million
- FPKM - fragments per kilobase of transcript per million
- IsoPct - percentage of all gene transcripts represented by particular isoform
- Will total to 100% for all isoforms of a gene
Sample .genes.results
gene_id transcript_id(s) length effective_length expected_count TPM FPKM
ENSG00000000457 ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238 4254.03 4218.34 359.98 31.70 26.91
ENSG00000000460 ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289 3195.09 3159.41 116.58 13.72 11.65
- Combines all isoforms of a gene into a single line
- Stats (TPM, FPKM, etc) apply to the gene here, not an isoform