Required packages

  • library(HDMD) #Contains Atchley factor data
    library(dplyr); library(reshape2); library(tidyr) #Packages for data cleaning and shaping
    library(vegan) #Contains similarity functions
    library(ggplot2) #Plotting package
    library(gplots); library(RColorBrewer) #Heatmap packages

Background

Components of the immune system

  • The immune system is made up of several different types of cells, each performing a specialized function in the body’s defense against infective pathogens
    • Most of these cell types circulate in the blood and are collectively known as white blood cells
  • The adaptive immune system is a component of the immune system that is able to recognize new pathogens, and coordinate an a more rapid and robust memory response to the same pathogen if it is encountered again in the future
    • This phenomenon is the mechanism underlying the efficacy of vaccination
  • Lymphocytes, including T-cells and B-cells are immune cells that make up the majority of the adaptive immune system

How T-cells contibute to adaptive immunity

  • T-cells recognize pathogens through their T-cell receptor (TCR), a protein expressed on their cellular surface
  • The TCR protein is encoded by the Tcr gene, which undergoes a unique genetic process in each T-cell.
  • During the development of each T-cell, several components of the Tcr gene are swapped and edited in a process known as VDJ recombination to produce an entirely unique TCR molecule
  • Thus, each T cell is has the potential to recognize a different pathogen due to its unique TCR sequence
    • It is estimated that humans can theoretically generate 1015 different TCR sequences
  • When a T-cell encounters a pathogen that can interact with its TCR, it clonally expands in to many identical cellular clones which direct the immune response against that pathogen
  • When the pathogen is eliminated, a portion of these expanded cells remain, leaving a pool of cells to respond if the pathogen should be encountered again
    • This pool of cells, rather than a single cell, underlies the more rapid and robust nature of the secondary immune response

The problem

  • The dynamics of an immune response can be visualized by changes in the TCR repertoire
    • A TCR repertoire is the combined set of TCR sequences from all T cells in an organism
    • Changes in the frequencies of specific TCR sequences may correspond to pathogen-stimulated proliferation of the cells that expressed them.
    • In other words, TCRs that are derived from a T-cell that can recognize a particular pathogen will make up a larger component of the TCR repertoire
  • However, with a theoretical population of 1015 different sequences, it can be very difficult to visualize changes in the frequencies of a handful of specific sequences
    • Experimentally, this can be mitigated by artifically reducing the total number of possible sequences in model organisms
  • Moreover, comparing frequencies of TCR sequences do not account for similarities in the amino acid structure between different sequences
  • Thus, interpretation of the TCR repertoire could be improved by:
    1. Evaluating the full TCR repertoire, rather than simplified model systems
    2. Factoring in sequence similarity in addition to differences in sequence frequency

Atchley Factors

  • In the paper Atchley WR PNAS 2005, investigators applied the PROMAX factor analysis algorithm to reduce approximately 500 different amino acid properties into 5 multi-dimensional factors
    • Thus the overall biochemical properties of any amino acid can be expressed as a sequence of 5 factor values
  • Each of the 5 factors roughly corresponds to several general biological characteristics
    • pah - polarity, accessibility, and hydrophobicity component
    • pss - propensity for secondary structure component
    • ms - molecular size component
    • cc - codon composition component
    • ec - electrostatic charge component
    AAMetric.Atchley #Object in HDMD package that stores amino acid matrix of Atchley factors. Each row is a single amino acid
    ##           pah         pss         ms         cc          ec
    ## A -0.59145974 -1.30209266 -0.7330651  1.5703918 -0.14550842
    ## C -1.34267179  0.46542300 -0.8620345 -1.0200786 -0.25516894
    ## D  1.05015062  0.30242411 -3.6559147 -0.2590236 -3.24176791
    ## E  1.35733226 -1.45275578  1.4766610  0.1129444 -0.83715681
    ## F -1.00610084 -0.59046634  1.8909687 -0.3966186  0.41194139
    ## G -0.38387987  1.65201497  1.3301017  1.0449765  2.06385566
    ## H  0.33616543 -0.41662780 -1.6733690 -1.4738898 -0.07772917
    ## I -1.23936304 -0.54652238  2.1314349  0.3931618  0.81630366
    ## K  1.83146558 -0.56109831  0.5332237 -0.2771101  1.64762794
    ## L -1.01895162 -0.98693471 -1.5046185  1.2658296 -0.91181195
    ## M -0.66312569 -1.52353917  2.2194787 -1.0047207  1.21181214
    ## N  0.94535614  0.82846219  1.2991286 -0.1688162  0.93339498
    ## P  0.18862522  2.08084151 -1.6283286  0.4207004 -1.39177378
    ## Q  0.93056541 -0.17926549 -3.0048731 -0.5025910 -1.85303476
    ## R  1.53754853 -0.05472897  1.5021086  0.4403185  2.89744417
    ## S -0.22788299  1.39869991 -4.7596375  0.6701745 -2.64747356
    ## T -0.03181782  0.32571153  2.2134612  0.9078985  1.31337035
    ## V -1.33661279 -0.27854634 -0.5440132  1.2419935 -1.26225362
    ## W -0.59533918  0.00907760  0.6719274 -2.1275244 -0.18358096
    ## Y  0.25999617  0.82992312  3.0973596 -0.8380164  1.51150958

Analysis

Overview of approach

  1. Break down all TCR protein sequences into tuples of three amino acids (or any other p-length tuple)
  2. Create a vector composed of the 5-component Atchley factors for each amino acid in the tuple (i.e. a 15 length vector)
  3. Using K-means clustering, group those tuples into a subset of clusters withing 15-dimensional space
    • Assigning a particular tuple to a particular cluster constitutes the codebook
  4. For each experimental sample, determine how many times a particular tuple appears in it’s TCR repertoire
  5. Convert these tuples to the corresponding cluster in the codebook to determine how many times a particular cluster is represented in a sample’s TCR repertoire
  6. Compare samples based on how frequently a particular cluster is represented within its TCR repertoire

Reading in sample data

  • Sample data includes TCR sequences from mice either immunized with the protein OVA or a control

    df <- read.csv("SampleData.csv", header = T, stringsAsFactors = F)
    head(df)
    ##                    Sequence             File        TRAV TRAV_Score
    ## 1 M_AL-aay20a07.g1.seq1.out M_AL-aay20a07.g1 TRAV14-1*01       1345
    ## 2 M_AL-aay20a09.g1.seq1.out M_AL-aay20a09.g1 TRAV14-1*01       1345
    ## 3 M_AL-aay20b04.g1.seq1.out M_AL-aay20b04.g1 TRAV14-1*01       1345
    ## 4 M_AL-aay20b06.g1.seq1.out M_AL-aay20b06.g1 TRAV14-1*01       1345
    ## 5 M_AL-aay20b07.g1.seq1.out M_AL-aay20b07.g1 TRAV14-3*01       1345
    ## 6 M_AL-aay20b11.g1.seq1.out M_AL-aay20b11.g1 TRAV14-3*01       1345
    ##        TRAJ TRAJ_Score CDR3_Length                       CDR3_DNA_Sequence
    ## 1 TRAJ15*01        273          11       gcagcaatgcaccagggaggcagagctctgata
    ## 2 TRAJ15*01        282          11       gcagcaatccaccagggaggcagagctctgata
    ## 3 TRAJ15*01        282          11       gcagcaatccaccagggaggcagagctctgata
    ## 4  TRAJ9*01        258          10          gcagcaatggacatgggctacaaacttacc
    ## 5 TRAJ13*01        249          11       gcagcaagtcagggttctgggacttaccagagg
    ## 6  TRAJ4*01        211          13 gcagcaagggggttatctggtagcttcaataagttgacc
    ##   CDR3_Amino_Acid_Sequence       Source Phenotype Experiment_Number
    ## 1              AAMHQGGRALI RIPOVAnegNON        G-               HL8
    ## 2              AAIHQGGRALI RIPOVAnegNON        G-               HL8
    ## 3              AAIHQGGRALI RIPOVAnegNON        G-               HL8
    ## 4               AAMDMGYKLT RIPOVAnegNON        G-               HL8
    ## 5              AASQGSGTYQR RIPOVAnegNON        G-               HL8
    ## 6            AARGLSGSFNKLT RIPOVAnegNON        G-               HL8
    ##    Date_Exp  Date_Seq Import_ID
    ## 1 4/22/2008 6/24/2008       779
    ## 2 4/22/2008 6/24/2008       779
    ## 3 4/22/2008 6/24/2008       779
    ## 4 4/22/2008 6/24/2008       779
    ## 5 4/22/2008 6/24/2008       779
    ## 6 4/22/2008 6/24/2008       779

Cleaning data

  • The following is standard data cleaning that includes: filtering specific populations of data specific to this experiment, renaming variables, creating new variables, filtering out poor quality data

    df <- df %>%
      filter(Source %in% c("RIPOVAnegNON", "RIPOVAnegPEP", "RIPOVAposNON", "RIPOVAposPEP", "OVAimm NON", "OVAimm PEP"),
             Phenotype == "G+") %>%
      mutate(exptCondition = grepl("PEP", .$Source),
             exptCondition = as.character(exptCondition),
             exptCondition = recode(exptCondition, "TRUE" = "Immunized", "FALSE" = "Naive")) %>%
      mutate(cdr3 = CDR3_Amino_Acid_Sequence, group = Source, mouseNum = Experiment_Number) %>%
      select(cdr3, mouseNum, exptCondition) %>%
      mutate(cdr3 = substring(cdr3, 3)) %>%
      filter(nchar(cdr3) > 5, !grepl("X", cdr3)) %>%
      filter(sapply(as.list(.$cdr3), function(x) all(unlist(strsplit(x, "")) %in% rownames(AAMetric.Atchley))))
    head(df)
    ##           cdr3 mouseNum exptCondition
    ## 1     SRGGRALI      HL8         Naive
    ## 2    RGQGGRALI      HL8         Naive
    ## 3     SRGGRALI      HL8         Naive
    ## 4     SAGNYQLI      HL8         Naive
    ## 5     FNYAQGLT      HL8         Naive
    ## 6 SGWGQGGSAKLI      HL8         Naive
    • cdr3 - the TCR amino acid sequence relevant to pathogen recognition
    • mouseNum - the replicate identifier
    • exptCondition - whether this replicate was in the immunized or control group

Reshaping data

  • Note each row in the above data represents a single data point (i.e. a signle DNA molecule read by the sequencer). Each sequence can occur multiple times, if multiple copies of that sequence are observed
  • After reshaping the data, each row represents a particular sequence, with each column representing a particular sample. The number in each cell represents the number of times that sequence is found in that sample

    df <- df %>%
      group_by_all() %>%
      summarise(numReads = length(cdr3))
    df <- dcast(df, cdr3 ~ mouseNum + exptCondition, fun.aggregate = sum, value.var = "numReads")
    head(df)
    ##          cdr3 HL11_Immunized HL11_Naive HL12_Immunized HL12_Naive
    ## 1      AGEQAN              0          0              0          0
    ## 2      AGNKLT              4          6              6          3
    ## 3   AGNMGYKLT              0          0              0          2
    ## 4    ALITTCFT              0          0              0          0
    ## 5    ANSGTYQR              0          0              0          0
    ## 6 APLNYGNEKIT              0          0              0          0
    ##   HL15_Immunized HL15_Naive HL17_Immunized HL17_Naive HL19_Immunized
    ## 1              0          0              0          0              1
    ## 2              1          5              3         25              5
    ## 3              0          0              0          0              0
    ## 4              0          0              0          0              2
    ## 5              0          0              0          0              0
    ## 6              0          0              0          0              0
    ##   HL19_Naive HL8_Immunized HL8_Naive HL9_Immunized HL9_Naive
    ## 1          0             0         0             0         0
    ## 2          4             1         4            13        17
    ## 3          0             0         0             0         0
    ## 4          0             0         0             0         0
    ## 5          1             0         1             0         0
    ## 6          0             0         0             0         1

Building functions

  • all.ptuples - this will convert a string of amino acids into a vector of amino acid triplets (or any other p-length tuple) that move down entire sequence by one position each
    • all.ptuples <- function(cdr3, p = 3){
        inc <- p-1
        position <- 1:(nchar(cdr3)-inc)
        sapply(position, function(x) substring(cdr3, x, x+inc))
      }
      
      all.ptuples("ABCDE")
      ## [1] "ABC" "BCD" "CDE"
  • ptuple2atchley - this will take a single amino acid triplet (or any other p-length tuple) and create a vector of its corresponding Atchley factors
    • Each amino acid is represented by 5 Atchley factors, so a triplet tuple will result in a 15 length vector (e.g. p=4 length amino acid tuples would result in a 20 length vector)

      ptuple2atchley <- function(ptuple){
        amino <- as.list(strsplit(ptuple,"")[[1]]) #Each amino acid to a list element
        atchley <- unlist(lapply(amino, function(x) AAMetric.Atchley[x,])) #Atchley factors for each AA into vector
        len <- nchar(ptuple) #Number of AAs in ptuple
        names(atchley) <- paste(rep(1:len, each = 5), names(atchley), sep = ".") #Names for each element in actchley vector
        return(atchley)
      }
      
      ptuple2atchley("AGA")
      ##      1.pah      1.pss       1.ms       1.cc       1.ec      2.pah
      ## -0.5914597 -1.3020927 -0.7330651  1.5703918 -0.1455084 -0.3838799
      ##      2.pss       2.ms       2.cc       2.ec      3.pah      3.pss
      ##  1.6520150  1.3301017  1.0449765  2.0638557 -0.5914597 -1.3020927
      ##       3.ms       3.cc       3.ec
      ## -0.7330651  1.5703918 -0.1455084
  • tcr2atchley - determines the set of amino acids triplets (or any other p-length tupe) from a single TCR sequence and creates a data frame of the corresponding atchley factor vectors for each tuple
    • #Converts a sequence of amino acids in to p-length ptuples to create a matrix of atchley vectors with a row for each ptuple
      tcr2atchley <- function(seq, p = 3){
          ptuples <- as.list(all.ptuples(seq, p = p))
          all <- t(sapply(ptuples, function(x) ptuple2atchley(x)))
          rownames(all) <- ptuples
          return(all)
      }
      
      tcr2atchley("AGAV")
      ##          1.pah     1.pss       1.ms     1.cc       1.ec      2.pah
      ## AGA -0.5914597 -1.302093 -0.7330651 1.570392 -0.1455084 -0.3838799
      ## GAV -0.3838799  1.652015  1.3301017 1.044976  2.0638557 -0.5914597
      ##         2.pss       2.ms     2.cc       2.ec      3.pah      3.pss
      ## AGA  1.652015  1.3301017 1.044976  2.0638557 -0.5914597 -1.3020927
      ## GAV -1.302093 -0.7330651 1.570392 -0.1455084 -1.3366128 -0.2785463
      ##           3.ms     3.cc       3.ec
      ## AGA -0.7330651 1.570392 -0.1455084
      ## GAV -0.5440132 1.241993 -1.2622536
    • Each row represents a single tuple from the original TCR

Creating the codebook

  • The biochemical similarity of two tuples can be represented by how close together they lie together in multi-dimensional space that reflects each number in their vector of Atchley factors (here being 15 dimensional space)
  • The proximity of multiple tuples can be reflected in the clusters they form in this multi-dimensional space
  • Such clusters can be identified with K-means clustering
  • The codebook is a data frame that associates each tuple in a data set to specific cluster of tuples
  • Representing tuples as cluster members in later analysis is a way to reduce the dimensionality of the data while retaining information about biochemical similarity

    p <- 3 #tuple length
    df <- df %>% filter(nchar(cdr3) > p) #select sequences that can generate at least 2 tuples
    
    #Create data frame of all possible tuples in data set with their corresponding Atchley vectors
    df.atchley <- as.data.frame(do.call("rbind", lapply(as.list(df$cdr3), function(x) tcr2atchley(x, p))))
    head(df.atchley)
    ##          1.pah      1.pss       1.ms       1.cc       1.ec      2.pah
    ## AGE -0.5914597 -1.3020927 -0.7330651  1.5703918 -0.1455084 -0.3838799
    ## GEQ -0.3838799  1.6520150  1.3301017  1.0449765  2.0638557  1.3573323
    ## EQA  1.3573323 -1.4527558  1.4766610  0.1129444 -0.8371568  0.9305654
    ## QAN  0.9305654 -0.1792655 -3.0048731 -0.5025910 -1.8530348 -0.5914597
    ## AGN -0.5914597 -1.3020927 -0.7330651  1.5703918 -0.1455084 -0.3838799
    ## GNK -0.3838799  1.6520150  1.3301017  1.0449765  2.0638557  0.9453561
    ##          2.pss       2.ms       2.cc       2.ec      3.pah      3.pss
    ## AGE  1.6520150  1.3301017  1.0449765  2.0638557  1.3573323 -1.4527558
    ## GEQ -1.4527558  1.4766610  0.1129444 -0.8371568  0.9305654 -0.1792655
    ## EQA -0.1792655 -3.0048731 -0.5025910 -1.8530348 -0.5914597 -1.3020927
    ## QAN -1.3020927 -0.7330651  1.5703918 -0.1455084  0.9453561  0.8284622
    ## AGN  1.6520150  1.3301017  1.0449765  2.0638557  0.9453561  0.8284622
    ## GNK  0.8284622  1.2991286 -0.1688162  0.9333950  1.8314656 -0.5610983
    ##           3.ms       3.cc       3.ec
    ## AGE  1.4766610  0.1129444 -0.8371568
    ## GEQ -3.0048731 -0.5025910 -1.8530348
    ## EQA -0.7330651  1.5703918 -0.1455084
    ## QAN  1.2991286 -0.1688162  0.9333950
    ## AGN  1.2991286 -0.1688162  0.9333950
    ## GNK  0.5332237 -0.2771101  1.6476279
    #Determine k clusters of atchley factors in vector-length space and assign each ptuple to a cluster
    set.seed(1) #For code reproducibility
    k <- 100 #number of clusters
    clusters <- kmeans(df.atchley, k) #finding k-means
    codebook <- data.frame(tuple = unlist(lapply(as.list(df$cdr3), function(x) all.ptuples(x, p))),
                           "cluster" = clusters$cluster,
                           stringsAsFactors = F) #Extract the cluster number for each tuple
    codebook <- codebook[!duplicated(codebook), ] #Displays each tuple only once
    head(codebook)
    ##   tuple cluster
    ## 1   AGE      47
    ## 2   GEQ      63
    ## 3   EQA      65
    ## 4   QAN      11
    ## 5   AGN      47
    ## 6   GNK      34
  • It is important that k is not so large as to produce many clusters containing only 1 tuple sequence
    • In order for a cluster to represent a group of biochemically similar tuples, there must be multiple tuples inside that cluster
  • Possible overclustering can be evaluated by visualizing the total tuples per cluster

    #Red line represents mean tuples per cluster
    ggplot(codebook, aes(x = cluster)) + geom_hline(yintercept = mean((table(codebook$cluster))), color = "red", size = 2) + geom_bar(fill = "grey", color = "black", width = 0.5)  + scale_y_continuous(minor_breaks = 1:max(table(codebook$cluster))) + theme_bw()

Coding the data

  • Using the codebook, we can now calculate how often a particular cluster is represented in each sample
  • The first step is to convert each TCR sequence to all possible tuples and tally up the total count of each tupe within each sample

    #Helper function that converts each sequence into ptuples
    #and gives each ptuple the same count as the original full sequence in a particular sample
    tupleCount <- function(df.vector, p=3){
      tuples <- all.ptuples(as.character(df.vector$cdr3), p=p)
      tuples.df <- data.frame("tuple" = tuples, apply(df.vector[-1], 2, rep, length(tuples)), row.names = c(), stringsAsFactors = F)
      return(tuples.df)
    }
    
    df[1,]; tupleCount(df[1,], p=p)
    ##     cdr3 HL11_Immunized HL11_Naive HL12_Immunized HL12_Naive
    ## 1 AGEQAN              0          0              0          0
    ##   HL15_Immunized HL15_Naive HL17_Immunized HL17_Naive HL19_Immunized
    ## 1              0          0              0          0              1
    ##   HL19_Naive HL8_Immunized HL8_Naive HL9_Immunized HL9_Naive
    ## 1          0             0         0             0         0
    ##   tuple HL11_Immunized HL11_Naive HL12_Immunized HL12_Naive HL15_Immunized
    ## 1   AGE              0          0              0          0              0
    ## 2   GEQ              0          0              0          0              0
    ## 3   EQA              0          0              0          0              0
    ## 4   QAN              0          0              0          0              0
    ##   HL15_Naive HL17_Immunized HL17_Naive HL19_Immunized HL19_Naive
    ## 1          0              0          0              1          0
    ## 2          0              0          0              1          0
    ## 3          0              0          0              1          0
    ## 4          0              0          0              1          0
    ##   HL8_Immunized HL8_Naive HL9_Immunized HL9_Naive
    ## 1             0         0             0         0
    ## 2             0         0             0         0
    ## 3             0         0             0         0
    ## 4             0         0             0         0
    #Break each TCR in data set into tuples and calculate the frequency of each tuple for each sample
    tuple.df <- do.call("rbind", lapply(setNames(split(df, seq(nrow(df))), rownames(df)), function(x) tupleCount(x, p=p)))
    tuple.df <- tuple.df %>%
      group_by(tuple) %>%
      summarise_all(sum) #tallys total count for each tuple within sample
    
    head(tuple.df)
    ## # A tibble: 6 x 15
    ##   tuple HL11_Immunized HL11_Naive HL12_Immunized HL12_Naive HL15_Immunized
    ##   <chr>          <int>      <int>          <int>      <int>          <int>
    ## 1 AAG                0          0              0          0              0
    ## 2 AAN                0          1              0          0              0
    ## 3 AAS                0          0              0          1              0
    ## 4 AAW                0          0              0          0              0
    ## 5 ADR                0          0              0          0              0
    ## 6 ADS                0          0              0          0              0
    ## # ... with 9 more variables: HL15_Naive <int>, HL17_Immunized <int>,
    ## #   HL17_Naive <int>, HL19_Immunized <int>, HL19_Naive <int>,
    ## #   HL8_Immunized <int>, HL8_Naive <int>, HL9_Immunized <int>,
    ## #   HL9_Naive <int>
  • The next step is to match up each tuple to its corresponding cluster by cross-referencing the above data with the codebook
  • Then the total occurences of each cluster within each sample can be tallied and converted to a frequency

    code.df <- tuple.df %>%
      left_join(codebook, by = "tuple") %>% #Match each tuple to its cluster
      select(-tuple) %>%
      group_by(cluster) %>%
      summarise_all(sum) %>% #Total count of each tuple per sample
      select(-cluster)%>%
      mutate_if(is.numeric, funs(. / sum(.))) #Convert sample count to frequency
    
    head(code.df)
    ## # A tibble: 6 x 14
    ##   HL11_Immunized HL11_Naive HL12_Immunized HL12_Naive HL15_Immunized
    ##            <dbl>      <dbl>          <dbl>      <dbl>          <dbl>
    ## 1       0.0136      0.00758        0.00776    0.0147         0
    ## 2       0.0235      0.0947         0.0264     0.0541         0.0114
    ## 3       0.000905    0              0          0              0
    ## 4       0.0145      0              0.0109     0.0319         0.00127
    ## 5       0.0190      0.102          0.0280     0.0504         0.0102
    ## 6       0.00271     0              0          0.00491        0
    ## # ... with 9 more variables: HL15_Naive <dbl>, HL17_Immunized <dbl>,
    ## #   HL17_Naive <dbl>, HL19_Immunized <dbl>, HL19_Naive <dbl>,
    ## #   HL8_Immunized <dbl>, HL8_Naive <dbl>, HL9_Immunized <dbl>,
    ## #   HL9_Naive <dbl>

Visualization of clustering

Some general plot styling

hmcol <- colorRampPalette(rev(brewer.pal(9, "RdBu")))(1000)
color.key <- c("Immunized" = brewer.pal(4, "Set1")[3],
               "Naive" = brewer.pal(4, "Set1")[4])
col.color <- color.key[sapply(strsplit(colnames(code.df), "_"), function(x) x[[2]])]

Relative cluster frequency per sample

heatmap.2(as.matrix(code.df),
          col = hmcol,
          trace = "none",
          margins = c(15,15), lhei = c(1,3),
          cexRow = 1, cexCol = 2,
          key.title = NA, key.xlab = "Cluster frequency",
          key.par = list(cex=1.2),
          ColSideColors = col.color)
legend(0.7,1.125,
       legend = names(color.key),
       col = color.key,
       lty= 1, lwd = 10,
       border=FALSE, bty="n", y.intersp = 0.8, cex=2, xpd=TRUE)

  • Notice that there are several clusters that have uniquely high representation in the immunized samples. Similarly, the naive samples also have several clusters with uniquely high representation


Similarity of cluster frequency between samples

heatmap.2(as.matrix(vegdist(t(code.df), method = "horn")),
          col = hmcol,
          trace = "none",
          margins = c(15,15), lhei = c(1,3),
          cexRow = 2, cexCol = 2,
          key.title = NA, key.xlab = "MH distance",
          key.par = list(cex=1.2),
          ColSideColors = col.color)
legend(0.7,1.125,
       legend = names(color.key),
       col = color.key,
       lty= 1, lwd = 10,
       border=FALSE, bty="n", y.intersp = 0.8, cex=2, xpd=TRUE)

  • The Morisita Horn index can be used to compare the relative cluster representation between different samples. The result is a distance index between 0 and 1, where 0 is identical and 1 is completely dissimilar
  • Using this index clearly discriminates between immunized and naive samples

Relative Atchley factor weight between groups experimental groups

  • To get a closer look at the biochemical TCR properties that contribute to the differential clustering of the immunized and naive group, we can evaluate the average Atchley factors of the most discriminating clusters
#Function to generate SE bars in final blot
se <- function(x) {
  s <- sd(x)/sqrt(length(x))
  m <- mean(x)
  data.frame("ymin" = m-s, "ymax" = m+s)
}

#Selects top unique clusters for each experimental condition
unique.clusters <- code.df %>%
  mutate(cluster = 1:k) %>%
  gather(mouseNum, freq, - cluster) %>%
  mutate(exptCondition = do.call(sapply, c(list(mouseNum), list(function(x) strsplit(x, "_")[[1]][2])))) %>%
  group_by(cluster, exptCondition) %>%
  summarise(freq = mean(freq)) %>%
  group_by(exptCondition) %>%
  arrange(desc(freq)) %>%
  top_n(7) %>% mutate(rank = 1:7) %>% select(-freq) %>%
  spread(exptCondition, cluster) %>%
  filter(!(Immunized %in% intersect(Immunized, Naive)),
         !(Naive %in% intersect(Immunized, Naive))) %>%
  select(Immunized, Naive) %>%
  gather(exptCondition, cluster)

#Determine the mean value of each Atchley factor for each amino acid position in the tuple between the two experimental groups
relative.factors <- df.atchley %>%
  mutate(tuple = row.names(.)) %>%
  left_join(codebook, by = "tuple") %>%
  left_join(unique.clusters, by = "cluster") %>%
  filter(!is.na(exptCondition)) %>%
  select(-tuple, -cluster) %>%
  gather(factor, value, - exptCondition) %>%
  mutate(position = do.call(sapply, c(list(factor), list(function(x) strsplit(x, "\\.")[[1]][1]))),
         factor = do.call(sapply, c(list(factor), list(function(x) strsplit(x, "\\.")[[1]][2]))))

ggplot(relative.factors, aes(x = "", y = value, group = exptCondition)) +
  stat_summary(fun.y = mean, aes(fill = exptCondition), position = "dodge", geom = "bar") +
  stat_summary(fun.data = se, geom = "errorbar", position = "dodge", width = 0.9, size = 1) +
  facet_grid(position ~ factor, switch = "both") +
  theme_bw() +
  theme(axis.text = element_blank(), axis.ticks = element_blank()) + xlab("Atchley factor component") + ylab("Amino acid position")

  • As described above, aach of the 5 Atchley factors roughly corresponds to several general biological characteristics
    • pah - polarity, accessibility, and hydrophobicity component
    • pss - propensity for secondary structure component
    • ms - molecular size component
    • cc - codon composition component
    • ec - electrostatic charge component