Analyzing ImmGen Data in R
Reading data from GEO
Gene Expression Omnibus (GEO) database
- GEO is a database for public hosting of gene expression data sets
- There are 4 types of GEO data formats
- GEO Platform (GPL) - annotation files describing a particular type of microarray
- GEO Sample (GSM) - data from a single microarray chip
- GEO Series (GSE) - list of GSM files from a single experiment
- GEO DataSet (GDS) - summarized combination of GSE and GSM files, with normalized expression levels
ImmGenn on GEO
- All ImmGen data is available on GEO at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15907
- The entire ImmGen project is represented as GSE15907
- GSM files representing individual microarray data sets are also hosted and are linked on the GSE page
library(Biobase); library(GEOquery); library(org.Mm.eg.db) gse <- getGEO("GSE15907") gseES <- gse[[1]] #getGEO actually reads in data as a one element list that must be subsetted gseES
## ExpressionSet (storageMode: lockedEnvironment) ## assayData: 24922 features, 653 samples ## element names: exprs ## protocolData: none ## phenoData ## sampleNames: GSM399362 GSM399363 ... GSM2176423 (653 total) ## varLabels: title geo_accession ... relation.1 (39 total) ## varMetadata: labelDescription ## featureData ## featureNames: 10344620 10344622 ... 10608630 (24922 total) ## fvarLabels: ID GB_LIST ... category (12 total) ## fvarMetadata: Column Description labelDescription ## experimentData: use 'experimentData(object)' ## Annotation: GPL6246
getGEO
default is to read GSE in as anExpressionSet
- This can be changed to native GEOquery class with
GSEMatrix=FALSE
argument
- This can be changed to native GEOquery class with
ExpressionSet object in Bioconductor
ExpressionSets
are a general class in Bioconductor used to link many aspects of expression data together- There are several broad components of an
ExpressionSet
assayData
- the matrix of expression data, can be accessed byexprs()
exprs(gseES)[1:5,1:5]
## GSM399362 GSM399363 GSM399364 GSM399365 GSM399366 ## 10344620 19.8684 18.1856 18.2727 18.3073 17.7811 ## 10344622 313.1524 249.7509 356.3565 223.6505 278.3954 ## 10344624 321.6448 288.0079 254.8980 293.6604 376.0210 ## 10344633 843.8224 1133.3558 731.9925 666.4584 778.6256 ## 10344637 201.9093 229.7096 163.1982 154.7398 178.3030
phenoData
- anAnnotatedDataFrame
of variables describing the phenotype of each sample, can be accessed bypData()
names(pData(gseES)) #also varLabels(gseES)
## [1] "title" "geo_accession" ## [3] "status" "submission_date" ## [5] "last_update_date" "type" ## [7] "channel_count" "source_name_ch1" ## [9] "organism_ch1" "characteristics_ch1" ## [11] "characteristics_ch1.1" "characteristics_ch1.2" ## [13] "characteristics_ch1.3" "characteristics_ch1.4" ## [15] "characteristics_ch1.5" "treatment_protocol_ch1" ## [17] "molecule_ch1" "extract_protocol_ch1" ## [19] "label_ch1" "label_protocol_ch1" ## [21] "taxid_ch1" "hyb_protocol" ## [23] "scan_protocol" "description" ## [25] "data_processing" "platform_id" ## [27] "contact_name" "contact_laboratory" ## [29] "contact_department" "contact_institute" ## [31] "contact_address" "contact_city" ## [33] "contact_state" "contact_zip/postal_code" ## [35] "contact_country" "supplementary_file" ## [37] "data_row_count" "relation" ## [39] "relation.1"
sampleNames(gseES)[1:5] #equivalent to geo_accession column of pData
## [1] "GSM399362" "GSM399363" "GSM399364" "GSM399365" "GSM399366"
featureData
- anAnnotatedDataFrame
of variables describing each probe, can be accessed byfData()
names(fData(gseES))
## [1] "ID" "GB_LIST" "SPOT_ID" ## [4] "seqname" "RANGE_GB" "RANGE_STRAND" ## [7] "RANGE_START" "RANGE_STOP" "total_probes" ## [10] "gene_assignment" "mrna_assignment" "category"
featureNames(gseES)[1:5] #equivalent to ID column of fData
## [1] "10344620" "10344622" "10344624" "10344633" "10344637"
annotation
- information about the expression platform used (i.e. how the probe set can be annotated), typically a GPL file nameexperimentData
- text description of the expression set, typically in “MIAME” format
Annotating ImmGen data
- Many GEO databases contain annotation where probeIDs are clearly correlated to individual genes
- However in ImmGen GSE15907, annotation data is in a more complicated format
- Accession numbers are contained in
GB_LIST
column offeatureData
- For each probeID,
GB_LIST
has a string of accession numbers corresponding to each database that the associated gene has an entry in - The first entry is typically for NCBI’s GeneBank DB (e.g “NM_…”)
- However, some probeIDs will even have multiple accession numbers from the same database
- This usually indicates alternative splice forms that a given probeID can’t distinguish between
as.character(head(fData(gseES)$GB_LIST))[1:3]
## [1] "" "NM_024221,BC094468" "NM_008866,BC013536"
- For each probeID,
gene_assignment
contains string of annotation for each accession number available for a given probe- However, this string can be hard to parse
as.character(head(fData(gseES)$gene_assignment))[1:3]
## [1] "ENSMUST00000097833 // Gm10568 // predicted gene 10568 // --- // 100038431" ## [2] "ENSMUST00000097833 // Gm10568 // predicted gene 10568 // --- // 100038431 /// ENSMUST00000097833 // Gm10568 // predicted gene 10568 // --- // 100038431 /// ENSMUST00000097833 // Gm10568 // predicted gene 10568 // --- // 100038431" ## [3] "NM_008866 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// ENSMUST00000027036 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// BC013536 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777"
Alternative strategy would be to first available GeneBank accession number for a probeID and use that as a key to query and annotation database
library(org.Mm.eg.db) #Separate each GB_LIST entry into a vector of accession numbers, contained as elements in a list gbList <- strsplit(as.character(fData(gseES)$GB_LIST), ",") #Find the first GeneBank ID in each vector element of the list gb <- sapply(gbList, function(x) (x[grepl("NM", x)])[1]) #Uses the org.Mm.eg.db to lookup probes by GeneBank accession number and return a gene symbol sym <- select(org.Mm.eg.db, keys=gb, columns=c("SYMBOL"), keytype=c("ACCNUM")) #Adds the gene symbol to the phenoData fData(gseES) <- cbind(fData(gseES), sym) head(fData(gseES)[,c("GB_LIST", "ACCNUM", "SYMBOL")])
## GB_LIST ACCNUM SYMBOL ## 10344620 <NA> <NA> ## 10344622 NM_024221,BC094468 NM_024221 Pdhb ## 10344624 NM_008866,BC013536 NM_008866 Lypla1 ## 10344633 NM_011541,NM_001159750,NM_001159751,BC083127 NM_011541 Tcea1 ## 10344637 NM_133826,BC009154 NM_133826 Atp6v1h ## 10344653 NM_011011,L11065 NM_011011 Oprk1
Pre-processing ImmGen Data
The expression data in the ImmGen GSE file needs to be log-transformed
exprs(gseES) <- log(exprs(gseES))
However, ImmGen GSE file is already pre-normalized
library(lumi) boxplot(gseES)
Differential gene expression
Theory of the Limma package
- Differential gene expression is a multiple testing problem
- The more genes you test for differential expression, the more likely you are to find a gene that meets a significance threshold by chance
- Normal approach would be to control the family wise error rate with a false discovery rate correction
- E.g. Bonferonni or Benjamini-Hochberg
- However a problem still remains that genes with smaller expression levels will have smaller standard errors and a a higher chance of acheiving significance
- Two fundamental components of the Limma methodology
- Use a linear model to describe the expression of each gene in terms of all different sample conditions
- \(expression_{gene} = /alpha_0 * condition_1 + ... /alpha_n * condition_n + error\)
- Use empirical Bayesian analysis to “shrink” standard error of any gene toward the mean of all gene variance
- Having a very small or very large SE is unlikely, so correct the SE based on its likelihood
- Use a linear model to describe the expression of each gene in terms of all different sample conditions
- Thus, Limma “borrows” power from the population of genes in the statistical test of a single gene
Analysis of Two-Group comparisons
- Subsetting ImmGen expressionSet to just to types of samples
- Comparing the splenic CD4+Foxp3- and CD4+Foxp3+CD25+ populations
gseES$population <- sapply(strsplit(as.character(gseES$title),"#"), function(x) x[1]) gseES.sub <- gseES[, gseES$population == "T.4FP3-.Sp" | gseES$population == "T.4FP3+25+.Sp"] show(gseES.sub)
## ExpressionSet (storageMode: lockedEnvironment) ## assayData: 24922 features, 5 samples ## element names: exprs ## protocolData: none ## phenoData ## sampleNames: GSM399365 GSM399366 ... GSM605768 (5 total) ## varLabels: title geo_accession ... population (40 total) ## varMetadata: labelDescription ## featureData ## featureNames: 10344620 10344622 ... 10608630 (24922 total) ## fvarLabels: ID GB_LIST ... SYMBOL (14 total) ## fvarMetadata: Column Description labelDescription ## experimentData: use 'experimentData(object)' ## Annotation: GPL6246
Creating the model matrix
samples <- gseES.sub$population design <- model.matrix(~factor(samples)) lv <- levels(factor(samples)) colnames(design) <- c(lv[1], paste0(lv[1],"_",lv[2])) design
## T.4FP3-.Sp T.4FP3-.Sp_T.4FP3+25+.Sp ## 1 1 1 ## 2 1 1 ## 3 1 0 ## 4 1 0 ## 5 1 0 ## attr(,"assign") ## [1] 0 1 ## attr(,"contrasts") ## attr(,"contrasts")$`factor(samples)` ## [1] "contr.treatment"
Using
limma
to calculate differential gene expressionlibrary(limma) fit <- lmFit(gseES.sub, design) fit <- eBayes(fit)
Listing top genes
topTable(fit, coef = 2, n=10, adjust="BH", genelist = fit$genes$SYMBOL, sort.by="logFC", p.value = 0.05)
## ID logFC AveExpr t P.Value adj.P.Val ## 10469278 Il2ra 4.206739 5.335222 40.02674 1.806074e-14 4.501097e-10 ## 10556528 Pde3b -3.598917 5.276919 -28.72939 1.050463e-12 6.544912e-09 ## 10555174 Lrrc32 3.388296 5.738926 26.28288 3.108976e-12 1.549638e-08 ## 10430344 Il2rb 3.347476 6.325675 29.11076 8.942440e-13 6.544912e-09 ## 10346790 Ctla4 3.284915 5.252286 22.81147 1.736545e-11 4.857526e-08 ## 10355312 Ikzf2 3.180292 5.496507 30.22917 5.641863e-13 6.544912e-09 ## 10598292 Foxp3 3.092576 5.595011 23.17289 1.435440e-11 4.857526e-08 ## 10504757 <NA> 2.929929 5.200110 24.70969 6.585971e-12 2.735593e-08 ## 10576639 Nrp1 2.894591 5.002522 19.96351 8.686259e-11 1.546278e-07 ## 10358408 Rgs1 2.852070 4.332895 22.72988 1.813551e-11 4.857526e-08 ## B ## 10469278 21.21208 ## 10556528 18.56722 ## 10555174 17.75013 ## 10430344 18.68474 ## 10346790 16.37101 ## 10355312 19.01542 ## 10598292 16.52840 ## 10504757 17.16037 ## 10576639 14.99778 ## 10358408 16.33499
Heatmap of top genes
library(gplots) selected <- p.adjust(fit$p.value[, 2]) <0.05 gseES.sub.selected <- gseES.sub[selected, ] heatmap.2(exprs(gseES.sub.selected), trace = "none", margins = c(10,10), labCol = gseES.sub.selected$population, cexCol=1.5)
Volcano plot of differential gene expression
volcanoplot(fit, coef=2, highlight=10, names=fit$genes$SYMBOL)
Finding results for a specific gene
allGenes <- topTable(fit, coef = 2, n=Inf, adjust="BH", genelist = fit$genes$SYMBOL) allGenes[(allGenes$ID == "Foxp3") & !is.na(allGenes$ID),]
## ID logFC AveExpr t P.Value adj.P.Val B ## 10598292 Foxp3 3.092576 5.595011 23.17289 1.43544e-11 4.857526e-08 16.5284
Analysis of Many-Group comparisons
Working with raw ImmGen sample data
Getting GSM files
- The ImmGen GSE file contains expression data from all ImmGen experiments
- However, this means that all individual samples have already been normalized prior to inclusion in the GSE
- You may want to apply your own pre-processing in order to best compare ImmGen data to your own expression data
- This will require downloading the component GSM files from the ImmGen GSE
- Getting all GSM files
- From the GSE file, it is simple to download all associated GSM files
- However, for the ImmGen data base, this is a lot of data and will take a while to download
- May not be necessary if you are only interested in a few particular samples
getGEOSuppFiles("GSE15907") #2.7Gb zipped file
- Getting specific GSM files
- Alternative, you can determine which specific GSM files you want and download them specifically
sample <- "T.4+8int.Th#1" gsm <- as.character(pData(gseES)$geo_accession[pData(gseES)$title == sample]) #Finds the gsm for the indicated sample gsm <- getGEO(gsm) #Downloads gsm from GEO
Setting up GSM files
If all supporting GSM files were downloaded
untar("GSE15907/GSE15907_RAW.tar", exdir="data")
#Required packages
source("https://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("GEOquery", "org.Mm.eg.db", "limma"))
Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments
https://www.bioconductor.org/help/course-materials/2009/BioC2009/labs/limma/limma.pdf