Reading data from GEO

Gene Expression Omnibus (GEO) database

  • GEO is a database for public hosting of gene expression data sets
  • There are 4 types of GEO data formats
    • GEO Platform (GPL) - annotation files describing a particular type of microarray
    • GEO Sample (GSM) - data from a single microarray chip
    • GEO Series (GSE) - list of GSM files from a single experiment
    • GEO DataSet (GDS) - summarized combination of GSE and GSM files, with normalized expression levels

ImmGenn on GEO

  • All ImmGen data is available on GEO at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15907
  • The entire ImmGen project is represented as GSE15907
    • GSM files representing individual microarray data sets are also hosted and are linked on the GSE page
    library(Biobase); library(GEOquery); library(org.Mm.eg.db)
    gse <- getGEO("GSE15907")
    gseES <- gse[[1]] #getGEO actually reads in data as a one element list that must be subsetted
    gseES
    ## ExpressionSet (storageMode: lockedEnvironment)
    ## assayData: 24922 features, 653 samples
    ##   element names: exprs
    ## protocolData: none
    ## phenoData
    ##   sampleNames: GSM399362 GSM399363 ... GSM2176423 (653 total)
    ##   varLabels: title geo_accession ... relation.1 (39 total)
    ##   varMetadata: labelDescription
    ## featureData
    ##   featureNames: 10344620 10344622 ... 10608630 (24922 total)
    ##   fvarLabels: ID GB_LIST ... category (12 total)
    ##   fvarMetadata: Column Description labelDescription
    ## experimentData: use 'experimentData(object)'
    ## Annotation: GPL6246
    • getGEO default is to read GSE in as an ExpressionSet
      • This can be changed to native GEOquery class with GSEMatrix=FALSE argument

ExpressionSet object in Bioconductor

  • ExpressionSets are a general class in Bioconductor used to link many aspects of expression data together
  • There are several broad components of an ExpressionSet
    • assayData - the matrix of expression data, can be accessed by exprs()

      exprs(gseES)[1:5,1:5]
      ##          GSM399362 GSM399363 GSM399364 GSM399365 GSM399366
      ## 10344620   19.8684   18.1856   18.2727   18.3073   17.7811
      ## 10344622  313.1524  249.7509  356.3565  223.6505  278.3954
      ## 10344624  321.6448  288.0079  254.8980  293.6604  376.0210
      ## 10344633  843.8224 1133.3558  731.9925  666.4584  778.6256
      ## 10344637  201.9093  229.7096  163.1982  154.7398  178.3030
    • phenoData - an AnnotatedDataFrame of variables describing the phenotype of each sample, can be accessed by pData()

      names(pData(gseES))  #also varLabels(gseES)
      ##  [1] "title"                   "geo_accession"
      ##  [3] "status"                  "submission_date"
      ##  [5] "last_update_date"        "type"
      ##  [7] "channel_count"           "source_name_ch1"
      ##  [9] "organism_ch1"            "characteristics_ch1"
      ## [11] "characteristics_ch1.1"   "characteristics_ch1.2"
      ## [13] "characteristics_ch1.3"   "characteristics_ch1.4"
      ## [15] "characteristics_ch1.5"   "treatment_protocol_ch1"
      ## [17] "molecule_ch1"            "extract_protocol_ch1"
      ## [19] "label_ch1"               "label_protocol_ch1"
      ## [21] "taxid_ch1"               "hyb_protocol"
      ## [23] "scan_protocol"           "description"
      ## [25] "data_processing"         "platform_id"
      ## [27] "contact_name"            "contact_laboratory"
      ## [29] "contact_department"      "contact_institute"
      ## [31] "contact_address"         "contact_city"
      ## [33] "contact_state"           "contact_zip/postal_code"
      ## [35] "contact_country"         "supplementary_file"
      ## [37] "data_row_count"          "relation"
      ## [39] "relation.1"
      sampleNames(gseES)[1:5] #equivalent to geo_accession column of pData
      ## [1] "GSM399362" "GSM399363" "GSM399364" "GSM399365" "GSM399366"
    • featureData - an AnnotatedDataFrame of variables describing each probe, can be accessed by fData()

      names(fData(gseES))
      ##  [1] "ID"              "GB_LIST"         "SPOT_ID"
      ##  [4] "seqname"         "RANGE_GB"        "RANGE_STRAND"
      ##  [7] "RANGE_START"     "RANGE_STOP"      "total_probes"
      ## [10] "gene_assignment" "mrna_assignment" "category"
      featureNames(gseES)[1:5] #equivalent to ID column of fData
      ## [1] "10344620" "10344622" "10344624" "10344633" "10344637"
    • annotation - information about the expression platform used (i.e. how the probe set can be annotated), typically a GPL file name
    • experimentData - text description of the expression set, typically in “MIAME” format

Annotating ImmGen data

  • Many GEO databases contain annotation where probeIDs are clearly correlated to individual genes
  • However in ImmGen GSE15907, annotation data is in a more complicated format
  • Accession numbers are contained in GB_LIST column of featureData
    • For each probeID, GB_LIST has a string of accession numbers corresponding to each database that the associated gene has an entry in
    • The first entry is typically for NCBI’s GeneBank DB (e.g “NM_…”)
    • However, some probeIDs will even have multiple accession numbers from the same database
      • This usually indicates alternative splice forms that a given probeID can’t distinguish between
    as.character(head(fData(gseES)$GB_LIST))[1:3]
    ## [1] ""                   "NM_024221,BC094468" "NM_008866,BC013536"
  • gene_assignment contains string of annotation for each accession number available for a given probe
    • However, this string can be hard to parse
    as.character(head(fData(gseES)$gene_assignment))[1:3]
    ## [1] "ENSMUST00000097833 // Gm10568 // predicted gene 10568 // --- // 100038431"
    ## [2] "ENSMUST00000097833 // Gm10568 // predicted gene 10568 // --- // 100038431 /// ENSMUST00000097833 // Gm10568 // predicted gene 10568 // --- // 100038431 /// ENSMUST00000097833 // Gm10568 // predicted gene 10568 // --- // 100038431"
    ## [3] "NM_008866 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// ENSMUST00000027036 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// BC013536 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777"
  • Alternative strategy would be to first available GeneBank accession number for a probeID and use that as a key to query and annotation database

    library(org.Mm.eg.db)
    #Separate each GB_LIST entry into a vector of accession numbers, contained as elements in a list
    gbList <- strsplit(as.character(fData(gseES)$GB_LIST), ",")
    #Find the first GeneBank ID in each vector element of the list
    gb <- sapply(gbList, function(x) (x[grepl("NM", x)])[1])
    #Uses the org.Mm.eg.db to lookup probes by GeneBank accession number and return a gene symbol
    sym <- select(org.Mm.eg.db, keys=gb, columns=c("SYMBOL"), keytype=c("ACCNUM"))
    #Adds the gene symbol to the phenoData
    fData(gseES) <- cbind(fData(gseES), sym)
    head(fData(gseES)[,c("GB_LIST", "ACCNUM", "SYMBOL")])
    ##                                               GB_LIST    ACCNUM  SYMBOL
    ## 10344620                                                   <NA>    <NA>
    ## 10344622                           NM_024221,BC094468 NM_024221    Pdhb
    ## 10344624                           NM_008866,BC013536 NM_008866  Lypla1
    ## 10344633 NM_011541,NM_001159750,NM_001159751,BC083127 NM_011541   Tcea1
    ## 10344637                           NM_133826,BC009154 NM_133826 Atp6v1h
    ## 10344653                             NM_011011,L11065 NM_011011   Oprk1

Pre-processing ImmGen Data

  • The expression data in the ImmGen GSE file needs to be log-transformed

    exprs(gseES) <- log(exprs(gseES))
  • However, ImmGen GSE file is already pre-normalized

    library(lumi)
    boxplot(gseES)

Differential gene expression

Theory of the Limma package

  • Differential gene expression is a multiple testing problem
    • The more genes you test for differential expression, the more likely you are to find a gene that meets a significance threshold by chance
    • Normal approach would be to control the family wise error rate with a false discovery rate correction
      • E.g. Bonferonni or Benjamini-Hochberg
    • However a problem still remains that genes with smaller expression levels will have smaller standard errors and a a higher chance of acheiving significance
  • Two fundamental components of the Limma methodology
    1. Use a linear model to describe the expression of each gene in terms of all different sample conditions
      • \(expression_{gene} = /alpha_0 * condition_1 + ... /alpha_n * condition_n + error\)
    2. Use empirical Bayesian analysis to “shrink” standard error of any gene toward the mean of all gene variance
      • Having a very small or very large SE is unlikely, so correct the SE based on its likelihood
  • Thus, Limma “borrows” power from the population of genes in the statistical test of a single gene

Analysis of Two-Group comparisons

  • Subsetting ImmGen expressionSet to just to types of samples
    • Comparing the splenic CD4+Foxp3- and CD4+Foxp3+CD25+ populations
    gseES$population <- sapply(strsplit(as.character(gseES$title),"#"), function(x) x[1])
    gseES.sub <- gseES[, gseES$population == "T.4FP3-.Sp" | gseES$population == "T.4FP3+25+.Sp"]
    show(gseES.sub)
    ## ExpressionSet (storageMode: lockedEnvironment)
    ## assayData: 24922 features, 5 samples
    ##   element names: exprs
    ## protocolData: none
    ## phenoData
    ##   sampleNames: GSM399365 GSM399366 ... GSM605768 (5 total)
    ##   varLabels: title geo_accession ... population (40 total)
    ##   varMetadata: labelDescription
    ## featureData
    ##   featureNames: 10344620 10344622 ... 10608630 (24922 total)
    ##   fvarLabels: ID GB_LIST ... SYMBOL (14 total)
    ##   fvarMetadata: Column Description labelDescription
    ## experimentData: use 'experimentData(object)'
    ## Annotation: GPL6246
  • Creating the model matrix

    samples <- gseES.sub$population
    design <- model.matrix(~factor(samples))
    lv <- levels(factor(samples))
    colnames(design) <- c(lv[1], paste0(lv[1],"_",lv[2]))
    design
    ##   T.4FP3-.Sp T.4FP3-.Sp_T.4FP3+25+.Sp
    ## 1          1                        1
    ## 2          1                        1
    ## 3          1                        0
    ## 4          1                        0
    ## 5          1                        0
    ## attr(,"assign")
    ## [1] 0 1
    ## attr(,"contrasts")
    ## attr(,"contrasts")$`factor(samples)`
    ## [1] "contr.treatment"
  • Using limma to calculate differential gene expression

    library(limma)
    fit <- lmFit(gseES.sub, design)
    fit <- eBayes(fit)
  • Listing top genes

    topTable(fit, coef = 2, n=10, adjust="BH", genelist = fit$genes$SYMBOL, sort.by="logFC", p.value = 0.05)
    ##              ID     logFC  AveExpr         t      P.Value    adj.P.Val
    ## 10469278  Il2ra  4.206739 5.335222  40.02674 1.806074e-14 4.501097e-10
    ## 10556528  Pde3b -3.598917 5.276919 -28.72939 1.050463e-12 6.544912e-09
    ## 10555174 Lrrc32  3.388296 5.738926  26.28288 3.108976e-12 1.549638e-08
    ## 10430344  Il2rb  3.347476 6.325675  29.11076 8.942440e-13 6.544912e-09
    ## 10346790  Ctla4  3.284915 5.252286  22.81147 1.736545e-11 4.857526e-08
    ## 10355312  Ikzf2  3.180292 5.496507  30.22917 5.641863e-13 6.544912e-09
    ## 10598292  Foxp3  3.092576 5.595011  23.17289 1.435440e-11 4.857526e-08
    ## 10504757   <NA>  2.929929 5.200110  24.70969 6.585971e-12 2.735593e-08
    ## 10576639   Nrp1  2.894591 5.002522  19.96351 8.686259e-11 1.546278e-07
    ## 10358408   Rgs1  2.852070 4.332895  22.72988 1.813551e-11 4.857526e-08
    ##                 B
    ## 10469278 21.21208
    ## 10556528 18.56722
    ## 10555174 17.75013
    ## 10430344 18.68474
    ## 10346790 16.37101
    ## 10355312 19.01542
    ## 10598292 16.52840
    ## 10504757 17.16037
    ## 10576639 14.99778
    ## 10358408 16.33499
  • Heatmap of top genes

    library(gplots)
    selected <- p.adjust(fit$p.value[, 2]) <0.05
    gseES.sub.selected <- gseES.sub[selected, ]
    heatmap.2(exprs(gseES.sub.selected),
              trace = "none",
              margins = c(10,10),
              labCol = gseES.sub.selected$population,
              cexCol=1.5)

  • Volcano plot of differential gene expression

    volcanoplot(fit, coef=2, highlight=10, names=fit$genes$SYMBOL)

  • Finding results for a specific gene

    allGenes <- topTable(fit, coef = 2, n=Inf, adjust="BH", genelist = fit$genes$SYMBOL)
    allGenes[(allGenes$ID == "Foxp3") & !is.na(allGenes$ID),]
    ##             ID    logFC  AveExpr        t     P.Value    adj.P.Val       B
    ## 10598292 Foxp3 3.092576 5.595011 23.17289 1.43544e-11 4.857526e-08 16.5284

Analysis of Many-Group comparisons

Working with raw ImmGen sample data

Getting GSM files

  • The ImmGen GSE file contains expression data from all ImmGen experiments
  • However, this means that all individual samples have already been normalized prior to inclusion in the GSE
  • You may want to apply your own pre-processing in order to best compare ImmGen data to your own expression data
  • This will require downloading the component GSM files from the ImmGen GSE
  • Getting all GSM files
    • From the GSE file, it is simple to download all associated GSM files
    • However, for the ImmGen data base, this is a lot of data and will take a while to download
    • May not be necessary if you are only interested in a few particular samples
    getGEOSuppFiles("GSE15907") #2.7Gb zipped file
  • Getting specific GSM files
    • Alternative, you can determine which specific GSM files you want and download them specifically
    sample <- "T.4+8int.Th#1"
    gsm <- as.character(pData(gseES)$geo_accession[pData(gseES)$title == sample]) #Finds the gsm for the indicated sample
    gsm <- getGEO(gsm) #Downloads gsm from GEO 

Setting up GSM files

  • If all supporting GSM files were downloaded

    untar("GSE15907/GSE15907_RAW.tar", exdir="data")
#Required packages
source("https://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("GEOquery", "org.Mm.eg.db", "limma"))

Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments

https://www.bioconductor.org/help/course-materials/2009/BioC2009/labs/limma/limma.pdf