Analyzing ImmGen Data in R - Hello. I'm Ben Solomon.

Reading data from GEO

Gene Expression Omnibus (GEO) database

GEO is a database for public hosting of gene expression data sets
There are 4 types of GEO data formats
- GEO Platform (GPL) - annotation files describing a particular type of microarray
- GEO Sample (GSM) - data from a single microarray chip
- GEO Series (GSE) - list of GSM files from a single experiment
- GEO DataSet (GDS) - summarized combination of GSE and GSM files, with normalized expression levels

ImmGenn on GEO

All ImmGen data is available on GEO at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15907

The entire ImmGen project is represented as GSE15907

GSM files representing individual microarray data sets are also hosted and are linked on the GSE page

library(Biobase); library(GEOquery); library(org.Mm.eg.db)
gse <- getGEO("GSE15907")
gseES <- gse[[1]] #getGEO actually reads in data as a one element list that must be subsetted
gseES

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 24922 features, 653 samples
##   element names: exprs
## protocolData: none
## phenoData
##   sampleNames: GSM399362 GSM399363 ... GSM2176423 (653 total)
##   varLabels: title geo_accession ... relation.1 (39 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 10344620 10344622 ... 10608630 (24922 total)
##   fvarLabels: ID GB_LIST ... category (12 total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL6246

getGEO default is to read GSE in as an ExpressionSet
- This can be changed to native GEOquery class with GSEMatrix=FALSE argument

ExpressionSet object in Bioconductor

ExpressionSets are a general class in Bioconductor used to link many aspects of expression data together

There are several broad components of an ExpressionSet

assayData - the matrix of expression data, can be accessed by exprs()

exprs(gseES)[1:5,1:5]

##          GSM399362 GSM399363 GSM399364 GSM399365 GSM399366
## 10344620   19.8684   18.1856   18.2727   18.3073   17.7811
## 10344622  313.1524  249.7509  356.3565  223.6505  278.3954
## 10344624  321.6448  288.0079  254.8980  293.6604  376.0210
## 10344633  843.8224 1133.3558  731.9925  666.4584  778.6256
## 10344637  201.9093  229.7096  163.1982  154.7398  178.3030

phenoData - an AnnotatedDataFrame of variables describing the phenotype of each sample, can be accessed by pData()

names(pData(gseES))  #also varLabels(gseES)

##  [1] "title"                   "geo_accession"
##  [3] "status"                  "submission_date"
##  [5] "last_update_date"        "type"
##  [7] "channel_count"           "source_name_ch1"
##  [9] "organism_ch1"            "characteristics_ch1"
## [11] "characteristics_ch1.1"   "characteristics_ch1.2"
## [13] "characteristics_ch1.3"   "characteristics_ch1.4"
## [15] "characteristics_ch1.5"   "treatment_protocol_ch1"
## [17] "molecule_ch1"            "extract_protocol_ch1"
## [19] "label_ch1"               "label_protocol_ch1"
## [21] "taxid_ch1"               "hyb_protocol"
## [23] "scan_protocol"           "description"
## [25] "data_processing"         "platform_id"
## [27] "contact_name"            "contact_laboratory"
## [29] "contact_department"      "contact_institute"
## [31] "contact_address"         "contact_city"
## [33] "contact_state"           "contact_zip/postal_code"
## [35] "contact_country"         "supplementary_file"
## [37] "data_row_count"          "relation"
## [39] "relation.1"

sampleNames(gseES)[1:5] #equivalent to geo_accession column of pData

## [1] "GSM399362" "GSM399363" "GSM399364" "GSM399365" "GSM399366"

featureData - an AnnotatedDataFrame of variables describing each probe, can be accessed by fData()

names(fData(gseES))

##  [1] "ID"              "GB_LIST"         "SPOT_ID"
##  [4] "seqname"         "RANGE_GB"        "RANGE_STRAND"
##  [7] "RANGE_START"     "RANGE_STOP"      "total_probes"
## [10] "gene_assignment" "mrna_assignment" "category"

featureNames(gseES)[1:5] #equivalent to ID column of fData

## [1] "10344620" "10344622" "10344624" "10344633" "10344637"

annotation - information about the expression platform used (i.e. how the probe set can be annotated), typically a GPL file name
experimentData - text description of the expression set, typically in “MIAME” format

Annotating ImmGen data

Many GEO databases contain annotation where probeIDs are clearly correlated to individual genes
However in ImmGen GSE15907, annotation data is in a more complicated format
Accession numbers are contained in GB_LIST column of featureData
- For each probeID, GB_LIST has a string of accession numbers corresponding to each database that the associated gene has an entry in
- The first entry is typically for NCBI’s GeneBank DB (e.g “NM_…”)
- However, some probeIDs will even have multiple accession numbers from the same database
  - This usually indicates alternative splice forms that a given probeID can’t distinguish between
```
as.character(head(fData(gseES)$GB_LIST))[1:3]
```
```
## [1] ""                   "NM_024221,BC094468" "NM_008866,BC013536"
```

gene_assignment contains string of annotation for each accession number available for a given probe

However, this string can be hard to parse

as.character(head(fData(gseES)$gene_assignment))[1:3]

## [1] "ENSMUST00000097833 // Gm10568 // predicted gene 10568 // --- // 100038431"
## [2] "ENSMUST00000097833 // Gm10568 // predicted gene 10568 // --- // 100038431 /// ENSMUST00000097833 // Gm10568 // predicted gene 10568 // --- // 100038431 /// ENSMUST00000097833 // Gm10568 // predicted gene 10568 // --- // 100038431"
## [3] "NM_008866 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// ENSMUST00000027036 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// BC013536 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777"

Alternative strategy would be to first available GeneBank accession number for a probeID and use that as a key to query and annotation database

library(org.Mm.eg.db)
#Separate each GB_LIST entry into a vector of accession numbers, contained as elements in a list
gbList <- strsplit(as.character(fData(gseES)$GB_LIST), ",")
#Find the first GeneBank ID in each vector element of the list
gb <- sapply(gbList, function(x) (x[grepl("NM", x)])[1])
#Uses the org.Mm.eg.db to lookup probes by GeneBank accession number and return a gene symbol
sym <- select(org.Mm.eg.db, keys=gb, columns=c("SYMBOL"), keytype=c("ACCNUM"))
#Adds the gene symbol to the phenoData
fData(gseES) <- cbind(fData(gseES), sym)
head(fData(gseES)[,c("GB_LIST", "ACCNUM", "SYMBOL")])

##                                               GB_LIST    ACCNUM  SYMBOL
## 10344620                                                   <NA>    <NA>
## 10344622                           NM_024221,BC094468 NM_024221    Pdhb
## 10344624                           NM_008866,BC013536 NM_008866  Lypla1
## 10344633 NM_011541,NM_001159750,NM_001159751,BC083127 NM_011541   Tcea1
## 10344637                           NM_133826,BC009154 NM_133826 Atp6v1h
## 10344653                             NM_011011,L11065 NM_011011   Oprk1

Pre-processing ImmGen Data

The expression data in the ImmGen GSE file needs to be log-transformed
```
exprs(gseES) <- log(exprs(gseES))
```
However, ImmGen GSE file is already pre-normalized
```
library(lumi)
boxplot(gseES)
```

Differential gene expression

Theory of the Limma package

Differential gene expression is a multiple testing problem
- The more genes you test for differential expression, the more likely you are to find a gene that meets a significance threshold by chance
- Normal approach would be to control the family wise error rate with a false discovery rate correction
  - E.g. Bonferonni or Benjamini-Hochberg
- However a problem still remains that genes with smaller expression levels will have smaller standard errors and a a higher chance of acheiving significance
Two fundamental components of the Limma methodology
1. Use a linear model to describe the expression of each gene in terms of all different sample conditions
  - $expression_{gene} = /alpha_0 * condition_1 + ... /alpha_n * condition_n + error$
2. Use empirical Bayesian analysis to “shrink” standard error of any gene toward the mean of all gene variance
  - Having a very small or very large SE is unlikely, so correct the SE based on its likelihood
Thus, Limma “borrows” power from the population of genes in the statistical test of a single gene

Analysis of Two-Group comparisons

Subsetting ImmGen expressionSet to just to types of samples

Comparing the splenic CD4+Foxp3- and CD4+Foxp3+CD25+ populations

gseES$population <- sapply(strsplit(as.character(gseES$title),"#"), function(x) x[1])
gseES.sub <- gseES[, gseES$population == "T.4FP3-.Sp" | gseES$population == "T.4FP3+25+.Sp"]
show(gseES.sub)

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 24922 features, 5 samples
##   element names: exprs
## protocolData: none
## phenoData
##   sampleNames: GSM399365 GSM399366 ... GSM605768 (5 total)
##   varLabels: title geo_accession ... population (40 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 10344620 10344622 ... 10608630 (24922 total)
##   fvarLabels: ID GB_LIST ... SYMBOL (14 total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL6246

Creating the model matrix

samples <- gseES.sub$population
design <- model.matrix(~factor(samples))
lv <- levels(factor(samples))
colnames(design) <- c(lv[1], paste0(lv[1],"_",lv[2]))
design

##   T.4FP3-.Sp T.4FP3-.Sp_T.4FP3+25+.Sp
## 1          1                        1
## 2          1                        1
## 3          1                        0
## 4          1                        0
## 5          1                        0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$`factor(samples)`
## [1] "contr.treatment"

Using limma to calculate differential gene expression

library(limma)
fit <- lmFit(gseES.sub, design)
fit <- eBayes(fit)

Listing top genes

topTable(fit, coef = 2, n=10, adjust="BH", genelist = fit$genes$SYMBOL, sort.by="logFC", p.value = 0.05)

##              ID     logFC  AveExpr         t      P.Value    adj.P.Val
## 10469278  Il2ra  4.206739 5.335222  40.02674 1.806074e-14 4.501097e-10
## 10556528  Pde3b -3.598917 5.276919 -28.72939 1.050463e-12 6.544912e-09
## 10555174 Lrrc32  3.388296 5.738926  26.28288 3.108976e-12 1.549638e-08
## 10430344  Il2rb  3.347476 6.325675  29.11076 8.942440e-13 6.544912e-09
## 10346790  Ctla4  3.284915 5.252286  22.81147 1.736545e-11 4.857526e-08
## 10355312  Ikzf2  3.180292 5.496507  30.22917 5.641863e-13 6.544912e-09
## 10598292  Foxp3  3.092576 5.595011  23.17289 1.435440e-11 4.857526e-08
## 10504757   <NA>  2.929929 5.200110  24.70969 6.585971e-12 2.735593e-08
## 10576639   Nrp1  2.894591 5.002522  19.96351 8.686259e-11 1.546278e-07
## 10358408   Rgs1  2.852070 4.332895  22.72988 1.813551e-11 4.857526e-08
##                 B
## 10469278 21.21208
## 10556528 18.56722
## 10555174 17.75013
## 10430344 18.68474
## 10346790 16.37101
## 10355312 19.01542
## 10598292 16.52840
## 10504757 17.16037
## 10576639 14.99778
## 10358408 16.33499

Heatmap of top genes

library(gplots)
selected <- p.adjust(fit$p.value[, 2]) <0.05
gseES.sub.selected <- gseES.sub[selected, ]
heatmap.2(exprs(gseES.sub.selected),
          trace = "none",
          margins = c(10,10),
          labCol = gseES.sub.selected$population,
          cexCol=1.5)

Volcano plot of differential gene expression

volcanoplot(fit, coef=2, highlight=10, names=fit$genes$SYMBOL)

Finding results for a specific gene

allGenes <- topTable(fit, coef = 2, n=Inf, adjust="BH", genelist = fit$genes$SYMBOL)
allGenes[(allGenes$ID == "Foxp3") & !is.na(allGenes$ID),]

##             ID    logFC  AveExpr        t     P.Value    adj.P.Val       B
## 10598292 Foxp3 3.092576 5.595011 23.17289 1.43544e-11 4.857526e-08 16.5284

Analysis of Many-Group comparisons

Working with raw ImmGen sample data

Getting GSM files

The ImmGen GSE file contains expression data from all ImmGen experiments
However, this means that all individual samples have already been normalized prior to inclusion in the GSE
You may want to apply your own pre-processing in order to best compare ImmGen data to your own expression data
This will require downloading the component GSM files from the ImmGen GSE
Getting all GSM files
- From the GSE file, it is simple to download all associated GSM files
- However, for the ImmGen data base, this is a lot of data and will take a while to download
- May not be necessary if you are only interested in a few particular samples
```
getGEOSuppFiles("GSE15907") #2.7Gb zipped file
```

Getting specific GSM files

Alternative, you can determine which specific GSM files you want and download them specifically

sample <- "T.4+8int.Th#1"
gsm <- as.character(pData(gseES)$geo_accession[pData(gseES)$title == sample]) #Finds the gsm for the indicated sample
gsm <- getGEO(gsm) #Downloads gsm from GEO

Setting up GSM files

If all supporting GSM files were downloaded

untar("GSE15907/GSE15907_RAW.tar", exdir="data")

#Required packages
source("https://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("GEOquery", "org.Mm.eg.db", "limma"))

Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments

https://www.bioconductor.org/help/course-materials/2009/BioC2009/labs/limma/limma.pdf