TCR Analysis with Atchley Factors - Hello. I'm Ben Solomon.

Required packages

library(HDMD) #Contains Atchley factor data
library(dplyr); library(reshape2); library(tidyr) #Packages for data cleaning and shaping
library(vegan) #Contains similarity functions
library(ggplot2) #Plotting package
library(gplots); library(RColorBrewer) #Heatmap packages

Background

Components of the immune system

The immune system is made up of several different types of cells, each performing a specialized function in the body’s defense against infective pathogens
- Most of these cell types circulate in the blood and are collectively known as white blood cells
The adaptive immune system is a component of the immune system that is able to recognize new pathogens, and coordinate an a more rapid and robust memory response to the same pathogen if it is encountered again in the future
- This phenomenon is the mechanism underlying the efficacy of vaccination
Lymphocytes, including T-cells and B-cells are immune cells that make up the majority of the adaptive immune system

How T-cells contibute to adaptive immunity

T-cells recognize pathogens through their T-cell receptor (TCR), a protein expressed on their cellular surface
The TCR protein is encoded by the Tcr gene, which undergoes a unique genetic process in each T-cell.
During the development of each T-cell, several components of the Tcr gene are swapped and edited in a process known as VDJ recombination to produce an entirely unique TCR molecule
Thus, each T cell is has the potential to recognize a different pathogen due to its unique TCR sequence
- It is estimated that humans can theoretically generate 10¹⁵ different TCR sequences
When a T-cell encounters a pathogen that can interact with its TCR, it clonally expands in to many identical cellular clones which direct the immune response against that pathogen
When the pathogen is eliminated, a portion of these expanded cells remain, leaving a pool of cells to respond if the pathogen should be encountered again
- This pool of cells, rather than a single cell, underlies the more rapid and robust nature of the secondary immune response

The problem

The dynamics of an immune response can be visualized by changes in the TCR repertoire
- A TCR repertoire is the combined set of TCR sequences from all T cells in an organism
- Changes in the frequencies of specific TCR sequences may correspond to pathogen-stimulated proliferation of the cells that expressed them.
- In other words, TCRs that are derived from a T-cell that can recognize a particular pathogen will make up a larger component of the TCR repertoire
However, with a theoretical population of 10¹⁵ different sequences, it can be very difficult to visualize changes in the frequencies of a handful of specific sequences
- Experimentally, this can be mitigated by artifically reducing the total number of possible sequences in model organisms
Moreover, comparing frequencies of TCR sequences do not account for similarities in the amino acid structure between different sequences
Thus, interpretation of the TCR repertoire could be improved by:
1. Evaluating the full TCR repertoire, rather than simplified model systems
2. Factoring in sequence similarity in addition to differences in sequence frequency

Atchley Factors

In the paper Atchley WR PNAS 2005, investigators applied the PROMAX factor analysis algorithm to reduce approximately 500 different amino acid properties into 5 multi-dimensional factors
- Thus the overall biochemical properties of any amino acid can be expressed as a sequence of 5 factor values

Each of the 5 factors roughly corresponds to several general biological characteristics

pah - polarity, accessibility, and hydrophobicity component
pss - propensity for secondary structure component
ms - molecular size component
cc - codon composition component
ec - electrostatic charge component

AAMetric.Atchley #Object in HDMD package that stores amino acid matrix of Atchley factors. Each row is a single amino acid

##           pah         pss         ms         cc          ec
## A -0.59145974 -1.30209266 -0.7330651  1.5703918 -0.14550842
## C -1.34267179  0.46542300 -0.8620345 -1.0200786 -0.25516894
## D  1.05015062  0.30242411 -3.6559147 -0.2590236 -3.24176791
## E  1.35733226 -1.45275578  1.4766610  0.1129444 -0.83715681
## F -1.00610084 -0.59046634  1.8909687 -0.3966186  0.41194139
## G -0.38387987  1.65201497  1.3301017  1.0449765  2.06385566
## H  0.33616543 -0.41662780 -1.6733690 -1.4738898 -0.07772917
## I -1.23936304 -0.54652238  2.1314349  0.3931618  0.81630366
## K  1.83146558 -0.56109831  0.5332237 -0.2771101  1.64762794
## L -1.01895162 -0.98693471 -1.5046185  1.2658296 -0.91181195
## M -0.66312569 -1.52353917  2.2194787 -1.0047207  1.21181214
## N  0.94535614  0.82846219  1.2991286 -0.1688162  0.93339498
## P  0.18862522  2.08084151 -1.6283286  0.4207004 -1.39177378
## Q  0.93056541 -0.17926549 -3.0048731 -0.5025910 -1.85303476
## R  1.53754853 -0.05472897  1.5021086  0.4403185  2.89744417
## S -0.22788299  1.39869991 -4.7596375  0.6701745 -2.64747356
## T -0.03181782  0.32571153  2.2134612  0.9078985  1.31337035
## V -1.33661279 -0.27854634 -0.5440132  1.2419935 -1.26225362
## W -0.59533918  0.00907760  0.6719274 -2.1275244 -0.18358096
## Y  0.25999617  0.82992312  3.0973596 -0.8380164  1.51150958

Analysis

Overview of approach

Break down all TCR protein sequences into tuples of three amino acids (or any other p-length tuple)
Create a vector composed of the 5-component Atchley factors for each amino acid in the tuple (i.e. a 15 length vector)
Using K-means clustering, group those tuples into a subset of clusters withing 15-dimensional space
- Assigning a particular tuple to a particular cluster constitutes the codebook
For each experimental sample, determine how many times a particular tuple appears in it’s TCR repertoire
Convert these tuples to the corresponding cluster in the codebook to determine how many times a particular cluster is represented in a sample’s TCR repertoire
Compare samples based on how frequently a particular cluster is represented within its TCR repertoire

Reading in sample data

Sample data includes TCR sequences from mice either immunized with the protein OVA or a control

df <- read.csv("SampleData.csv", header = T, stringsAsFactors = F)
head(df)

##                    Sequence             File        TRAV TRAV_Score
## 1 M_AL-aay20a07.g1.seq1.out M_AL-aay20a07.g1 TRAV14-1*01       1345
## 2 M_AL-aay20a09.g1.seq1.out M_AL-aay20a09.g1 TRAV14-1*01       1345
## 3 M_AL-aay20b04.g1.seq1.out M_AL-aay20b04.g1 TRAV14-1*01       1345
## 4 M_AL-aay20b06.g1.seq1.out M_AL-aay20b06.g1 TRAV14-1*01       1345
## 5 M_AL-aay20b07.g1.seq1.out M_AL-aay20b07.g1 TRAV14-3*01       1345
## 6 M_AL-aay20b11.g1.seq1.out M_AL-aay20b11.g1 TRAV14-3*01       1345
##        TRAJ TRAJ_Score CDR3_Length                       CDR3_DNA_Sequence
## 1 TRAJ15*01        273          11       gcagcaatgcaccagggaggcagagctctgata
## 2 TRAJ15*01        282          11       gcagcaatccaccagggaggcagagctctgata
## 3 TRAJ15*01        282          11       gcagcaatccaccagggaggcagagctctgata
## 4  TRAJ9*01        258          10          gcagcaatggacatgggctacaaacttacc
## 5 TRAJ13*01        249          11       gcagcaagtcagggttctgggacttaccagagg
## 6  TRAJ4*01        211          13 gcagcaagggggttatctggtagcttcaataagttgacc
##   CDR3_Amino_Acid_Sequence       Source Phenotype Experiment_Number
## 1              AAMHQGGRALI RIPOVAnegNON        G-               HL8
## 2              AAIHQGGRALI RIPOVAnegNON        G-               HL8
## 3              AAIHQGGRALI RIPOVAnegNON        G-               HL8
## 4               AAMDMGYKLT RIPOVAnegNON        G-               HL8
## 5              AASQGSGTYQR RIPOVAnegNON        G-               HL8
## 6            AARGLSGSFNKLT RIPOVAnegNON        G-               HL8
##    Date_Exp  Date_Seq Import_ID
## 1 4/22/2008 6/24/2008       779
## 2 4/22/2008 6/24/2008       779
## 3 4/22/2008 6/24/2008       779
## 4 4/22/2008 6/24/2008       779
## 5 4/22/2008 6/24/2008       779
## 6 4/22/2008 6/24/2008       779

Cleaning data

The following is standard data cleaning that includes: filtering specific populations of data specific to this experiment, renaming variables, creating new variables, filtering out poor quality data

df <- df %>%
  filter(Source %in% c("RIPOVAnegNON", "RIPOVAnegPEP", "RIPOVAposNON", "RIPOVAposPEP", "OVAimm NON", "OVAimm PEP"),
         Phenotype == "G+") %>%
  mutate(exptCondition = grepl("PEP", .$Source),
         exptCondition = as.character(exptCondition),
         exptCondition = recode(exptCondition, "TRUE" = "Immunized", "FALSE" = "Naive")) %>%
  mutate(cdr3 = CDR3_Amino_Acid_Sequence, group = Source, mouseNum = Experiment_Number) %>%
  select(cdr3, mouseNum, exptCondition) %>%
  mutate(cdr3 = substring(cdr3, 3)) %>%
  filter(nchar(cdr3) > 5, !grepl("X", cdr3)) %>%
  filter(sapply(as.list(.$cdr3), function(x) all(unlist(strsplit(x, "")) %in% rownames(AAMetric.Atchley))))
head(df)

##           cdr3 mouseNum exptCondition
## 1     SRGGRALI      HL8         Naive
## 2    RGQGGRALI      HL8         Naive
## 3     SRGGRALI      HL8         Naive
## 4     SAGNYQLI      HL8         Naive
## 5     FNYAQGLT      HL8         Naive
## 6 SGWGQGGSAKLI      HL8         Naive

cdr3 - the TCR amino acid sequence relevant to pathogen recognition
mouseNum - the replicate identifier
exptCondition - whether this replicate was in the immunized or control group

Reshaping data

Note each row in the above data represents a single data point (i.e. a signle DNA molecule read by the sequencer). Each sequence can occur multiple times, if multiple copies of that sequence are observed

After reshaping the data, each row represents a particular sequence, with each column representing a particular sample. The number in each cell represents the number of times that sequence is found in that sample

df <- df %>%
  group_by_all() %>%
  summarise(numReads = length(cdr3))
df <- dcast(df, cdr3 ~ mouseNum + exptCondition, fun.aggregate = sum, value.var = "numReads")
head(df)

##          cdr3 HL11_Immunized HL11_Naive HL12_Immunized HL12_Naive
## 1      AGEQAN              0          0              0          0
## 2      AGNKLT              4          6              6          3
## 3   AGNMGYKLT              0          0              0          2
## 4    ALITTCFT              0          0              0          0
## 5    ANSGTYQR              0          0              0          0
## 6 APLNYGNEKIT              0          0              0          0
##   HL15_Immunized HL15_Naive HL17_Immunized HL17_Naive HL19_Immunized
## 1              0          0              0          0              1
## 2              1          5              3         25              5
## 3              0          0              0          0              0
## 4              0          0              0          0              2
## 5              0          0              0          0              0
## 6              0          0              0          0              0
##   HL19_Naive HL8_Immunized HL8_Naive HL9_Immunized HL9_Naive
## 1          0             0         0             0         0
## 2          4             1         4            13        17
## 3          0             0         0             0         0
## 4          0             0         0             0         0
## 5          1             0         1             0         0
## 6          0             0         0             0         1

Building functions

all.ptuples - this will convert a string of amino acids into a vector of amino acid triplets (or any other p-length tuple) that move down entire sequence by one position each

all.ptuples <- function(cdr3, p = 3){
  inc <- p-1
  position <- 1:(nchar(cdr3)-inc)
  sapply(position, function(x) substring(cdr3, x, x+inc))
}

all.ptuples("ABCDE")

## [1] "ABC" "BCD" "CDE"

ptuple2atchley - this will take a single amino acid triplet (or any other p-length tuple) and create a vector of its corresponding Atchley factors

Each amino acid is represented by 5 Atchley factors, so a triplet tuple will result in a 15 length vector (e.g. p=4 length amino acid tuples would result in a 20 length vector)

ptuple2atchley <- function(ptuple){
  amino <- as.list(strsplit(ptuple,"")[[1]]) #Each amino acid to a list element
  atchley <- unlist(lapply(amino, function(x) AAMetric.Atchley[x,])) #Atchley factors for each AA into vector
  len <- nchar(ptuple) #Number of AAs in ptuple
  names(atchley) <- paste(rep(1:len, each = 5), names(atchley), sep = ".") #Names for each element in actchley vector
  return(atchley)
}

ptuple2atchley("AGA")

##      1.pah      1.pss       1.ms       1.cc       1.ec      2.pah
## -0.5914597 -1.3020927 -0.7330651  1.5703918 -0.1455084 -0.3838799
##      2.pss       2.ms       2.cc       2.ec      3.pah      3.pss
##  1.6520150  1.3301017  1.0449765  2.0638557 -0.5914597 -1.3020927
##       3.ms       3.cc       3.ec
## -0.7330651  1.5703918 -0.1455084

tcr2atchley - determines the set of amino acids triplets (or any other p-length tupe) from a single TCR sequence and creates a data frame of the corresponding atchley factor vectors for each tuple

#Converts a sequence of amino acids in to p-length ptuples to create a matrix of atchley vectors with a row for each ptuple
tcr2atchley <- function(seq, p = 3){
    ptuples <- as.list(all.ptuples(seq, p = p))
    all <- t(sapply(ptuples, function(x) ptuple2atchley(x)))
    rownames(all) <- ptuples
    return(all)
}

tcr2atchley("AGAV")

##          1.pah     1.pss       1.ms     1.cc       1.ec      2.pah
## AGA -0.5914597 -1.302093 -0.7330651 1.570392 -0.1455084 -0.3838799
## GAV -0.3838799  1.652015  1.3301017 1.044976  2.0638557 -0.5914597
##         2.pss       2.ms     2.cc       2.ec      3.pah      3.pss
## AGA  1.652015  1.3301017 1.044976  2.0638557 -0.5914597 -1.3020927
## GAV -1.302093 -0.7330651 1.570392 -0.1455084 -1.3366128 -0.2785463
##           3.ms     3.cc       3.ec
## AGA -0.7330651 1.570392 -0.1455084
## GAV -0.5440132 1.241993 -1.2622536

Each row represents a single tuple from the original TCR

Creating the codebook

The biochemical similarity of two tuples can be represented by how close together they lie together in multi-dimensional space that reflects each number in their vector of Atchley factors (here being 15 dimensional space)
The proximity of multiple tuples can be reflected in the clusters they form in this multi-dimensional space
Such clusters can be identified with K-means clustering
The codebook is a data frame that associates each tuple in a data set to specific cluster of tuples

Representing tuples as cluster members in later analysis is a way to reduce the dimensionality of the data while retaining information about biochemical similarity

p <- 3 #tuple length
df <- df %>% filter(nchar(cdr3) > p) #select sequences that can generate at least 2 tuples

#Create data frame of all possible tuples in data set with their corresponding Atchley vectors
df.atchley <- as.data.frame(do.call("rbind", lapply(as.list(df$cdr3), function(x) tcr2atchley(x, p))))
head(df.atchley)

##          1.pah      1.pss       1.ms       1.cc       1.ec      2.pah
## AGE -0.5914597 -1.3020927 -0.7330651  1.5703918 -0.1455084 -0.3838799
## GEQ -0.3838799  1.6520150  1.3301017  1.0449765  2.0638557  1.3573323
## EQA  1.3573323 -1.4527558  1.4766610  0.1129444 -0.8371568  0.9305654
## QAN  0.9305654 -0.1792655 -3.0048731 -0.5025910 -1.8530348 -0.5914597
## AGN -0.5914597 -1.3020927 -0.7330651  1.5703918 -0.1455084 -0.3838799
## GNK -0.3838799  1.6520150  1.3301017  1.0449765  2.0638557  0.9453561
##          2.pss       2.ms       2.cc       2.ec      3.pah      3.pss
## AGE  1.6520150  1.3301017  1.0449765  2.0638557  1.3573323 -1.4527558
## GEQ -1.4527558  1.4766610  0.1129444 -0.8371568  0.9305654 -0.1792655
## EQA -0.1792655 -3.0048731 -0.5025910 -1.8530348 -0.5914597 -1.3020927
## QAN -1.3020927 -0.7330651  1.5703918 -0.1455084  0.9453561  0.8284622
## AGN  1.6520150  1.3301017  1.0449765  2.0638557  0.9453561  0.8284622
## GNK  0.8284622  1.2991286 -0.1688162  0.9333950  1.8314656 -0.5610983
##           3.ms       3.cc       3.ec
## AGE  1.4766610  0.1129444 -0.8371568
## GEQ -3.0048731 -0.5025910 -1.8530348
## EQA -0.7330651  1.5703918 -0.1455084
## QAN  1.2991286 -0.1688162  0.9333950
## AGN  1.2991286 -0.1688162  0.9333950
## GNK  0.5332237 -0.2771101  1.6476279

#Determine k clusters of atchley factors in vector-length space and assign each ptuple to a cluster
set.seed(1) #For code reproducibility
k <- 100 #number of clusters
clusters <- kmeans(df.atchley, k) #finding k-means
codebook <- data.frame(tuple = unlist(lapply(as.list(df$cdr3), function(x) all.ptuples(x, p))),
                       "cluster" = clusters$cluster,
                       stringsAsFactors = F) #Extract the cluster number for each tuple
codebook <- codebook[!duplicated(codebook), ] #Displays each tuple only once
head(codebook)

##   tuple cluster
## 1   AGE      47
## 2   GEQ      63
## 3   EQA      65
## 4   QAN      11
## 5   AGN      47
## 6   GNK      34

It is important that k is not so large as to produce many clusters containing only 1 tuple sequence
- In order for a cluster to represent a group of biochemically similar tuples, there must be multiple tuples inside that cluster

Possible overclustering can be evaluated by visualizing the total tuples per cluster

#Red line represents mean tuples per cluster
ggplot(codebook, aes(x = cluster)) + geom_hline(yintercept = mean((table(codebook$cluster))), color = "red", size = 2) + geom_bar(fill = "grey", color = "black", width = 0.5)  + scale_y_continuous(minor_breaks = 1:max(table(codebook$cluster))) + theme_bw()

Coding the data

Using the codebook, we can now calculate how often a particular cluster is represented in each sample

The first step is to convert each TCR sequence to all possible tuples and tally up the total count of each tupe within each sample

#Helper function that converts each sequence into ptuples
#and gives each ptuple the same count as the original full sequence in a particular sample
tupleCount <- function(df.vector, p=3){
  tuples <- all.ptuples(as.character(df.vector$cdr3), p=p)
  tuples.df <- data.frame("tuple" = tuples, apply(df.vector[-1], 2, rep, length(tuples)), row.names = c(), stringsAsFactors = F)
  return(tuples.df)
}

df[1,]; tupleCount(df[1,], p=p)

##     cdr3 HL11_Immunized HL11_Naive HL12_Immunized HL12_Naive
## 1 AGEQAN              0          0              0          0
##   HL15_Immunized HL15_Naive HL17_Immunized HL17_Naive HL19_Immunized
## 1              0          0              0          0              1
##   HL19_Naive HL8_Immunized HL8_Naive HL9_Immunized HL9_Naive
## 1          0             0         0             0         0

##   tuple HL11_Immunized HL11_Naive HL12_Immunized HL12_Naive HL15_Immunized
## 1   AGE              0          0              0          0              0
## 2   GEQ              0          0              0          0              0
## 3   EQA              0          0              0          0              0
## 4   QAN              0          0              0          0              0
##   HL15_Naive HL17_Immunized HL17_Naive HL19_Immunized HL19_Naive
## 1          0              0          0              1          0
## 2          0              0          0              1          0
## 3          0              0          0              1          0
## 4          0              0          0              1          0
##   HL8_Immunized HL8_Naive HL9_Immunized HL9_Naive
## 1             0         0             0         0
## 2             0         0             0         0
## 3             0         0             0         0
## 4             0         0             0         0

#Break each TCR in data set into tuples and calculate the frequency of each tuple for each sample
tuple.df <- do.call("rbind", lapply(setNames(split(df, seq(nrow(df))), rownames(df)), function(x) tupleCount(x, p=p)))
tuple.df <- tuple.df %>%
  group_by(tuple) %>%
  summarise_all(sum) #tallys total count for each tuple within sample

head(tuple.df)

## # A tibble: 6 x 15
##   tuple HL11_Immunized HL11_Naive HL12_Immunized HL12_Naive HL15_Immunized
##   <chr>          <int>      <int>          <int>      <int>          <int>
## 1 AAG                0          0              0          0              0
## 2 AAN                0          1              0          0              0
## 3 AAS                0          0              0          1              0
## 4 AAW                0          0              0          0              0
## 5 ADR                0          0              0          0              0
## 6 ADS                0          0              0          0              0
## # ... with 9 more variables: HL15_Naive <int>, HL17_Immunized <int>,
## #   HL17_Naive <int>, HL19_Immunized <int>, HL19_Naive <int>,
## #   HL8_Immunized <int>, HL8_Naive <int>, HL9_Immunized <int>,
## #   HL9_Naive <int>

The next step is to match up each tuple to its corresponding cluster by cross-referencing the above data with the codebook

Then the total occurences of each cluster within each sample can be tallied and converted to a frequency

code.df <- tuple.df %>%
  left_join(codebook, by = "tuple") %>% #Match each tuple to its cluster
  select(-tuple) %>%
  group_by(cluster) %>%
  summarise_all(sum) %>% #Total count of each tuple per sample
  select(-cluster)%>%
  mutate_if(is.numeric, funs(. / sum(.))) #Convert sample count to frequency

head(code.df)

## # A tibble: 6 x 14
##   HL11_Immunized HL11_Naive HL12_Immunized HL12_Naive HL15_Immunized
##            <dbl>      <dbl>          <dbl>      <dbl>          <dbl>
## 1       0.0136      0.00758        0.00776    0.0147         0
## 2       0.0235      0.0947         0.0264     0.0541         0.0114
## 3       0.000905    0              0          0              0
## 4       0.0145      0              0.0109     0.0319         0.00127
## 5       0.0190      0.102          0.0280     0.0504         0.0102
## 6       0.00271     0              0          0.00491        0
## # ... with 9 more variables: HL15_Naive <dbl>, HL17_Immunized <dbl>,
## #   HL17_Naive <dbl>, HL19_Immunized <dbl>, HL19_Naive <dbl>,
## #   HL8_Immunized <dbl>, HL8_Naive <dbl>, HL9_Immunized <dbl>,
## #   HL9_Naive <dbl>

Visualization of clustering

Some general plot styling

hmcol <- colorRampPalette(rev(brewer.pal(9, "RdBu")))(1000)
color.key <- c("Immunized" = brewer.pal(4, "Set1")[3],
               "Naive" = brewer.pal(4, "Set1")[4])
col.color <- color.key[sapply(strsplit(colnames(code.df), "_"), function(x) x[[2]])]

Relative cluster frequency per sample

heatmap.2(as.matrix(code.df),
          col = hmcol,
          trace = "none",
          margins = c(15,15), lhei = c(1,3),
          cexRow = 1, cexCol = 2,
          key.title = NA, key.xlab = "Cluster frequency",
          key.par = list(cex=1.2),
          ColSideColors = col.color)
legend(0.7,1.125,
       legend = names(color.key),
       col = color.key,
       lty= 1, lwd = 10,
       border=FALSE, bty="n", y.intersp = 0.8, cex=2, xpd=TRUE)

Notice that there are several clusters that have uniquely high representation in the immunized samples. Similarly, the naive samples also have several clusters with uniquely high representation

Similarity of cluster frequency between samples

heatmap.2(as.matrix(vegdist(t(code.df), method = "horn")),
          col = hmcol,
          trace = "none",
          margins = c(15,15), lhei = c(1,3),
          cexRow = 2, cexCol = 2,
          key.title = NA, key.xlab = "MH distance",
          key.par = list(cex=1.2),
          ColSideColors = col.color)
legend(0.7,1.125,
       legend = names(color.key),
       col = color.key,
       lty= 1, lwd = 10,
       border=FALSE, bty="n", y.intersp = 0.8, cex=2, xpd=TRUE)

The Morisita Horn index can be used to compare the relative cluster representation between different samples. The result is a distance index between 0 and 1, where 0 is identical and 1 is completely dissimilar
Using this index clearly discriminates between immunized and naive samples

Relative Atchley factor weight between groups experimental groups

To get a closer look at the biochemical TCR properties that contribute to the differential clustering of the immunized and naive group, we can evaluate the average Atchley factors of the most discriminating clusters

#Function to generate SE bars in final blot
se <- function(x) {
  s <- sd(x)/sqrt(length(x))
  m <- mean(x)
  data.frame("ymin" = m-s, "ymax" = m+s)
}

#Selects top unique clusters for each experimental condition
unique.clusters <- code.df %>%
  mutate(cluster = 1:k) %>%
  gather(mouseNum, freq, - cluster) %>%
  mutate(exptCondition = do.call(sapply, c(list(mouseNum), list(function(x) strsplit(x, "_")[[1]][2])))) %>%
  group_by(cluster, exptCondition) %>%
  summarise(freq = mean(freq)) %>%
  group_by(exptCondition) %>%
  arrange(desc(freq)) %>%
  top_n(7) %>% mutate(rank = 1:7) %>% select(-freq) %>%
  spread(exptCondition, cluster) %>%
  filter(!(Immunized %in% intersect(Immunized, Naive)),
         !(Naive %in% intersect(Immunized, Naive))) %>%
  select(Immunized, Naive) %>%
  gather(exptCondition, cluster)

#Determine the mean value of each Atchley factor for each amino acid position in the tuple between the two experimental groups
relative.factors <- df.atchley %>%
  mutate(tuple = row.names(.)) %>%
  left_join(codebook, by = "tuple") %>%
  left_join(unique.clusters, by = "cluster") %>%
  filter(!is.na(exptCondition)) %>%
  select(-tuple, -cluster) %>%
  gather(factor, value, - exptCondition) %>%
  mutate(position = do.call(sapply, c(list(factor), list(function(x) strsplit(x, "\\.")[[1]][1]))),
         factor = do.call(sapply, c(list(factor), list(function(x) strsplit(x, "\\.")[[1]][2]))))

ggplot(relative.factors, aes(x = "", y = value, group = exptCondition)) +
  stat_summary(fun.y = mean, aes(fill = exptCondition), position = "dodge", geom = "bar") +
  stat_summary(fun.data = se, geom = "errorbar", position = "dodge", width = 0.9, size = 1) +
  facet_grid(position ~ factor, switch = "both") +
  theme_bw() +
  theme(axis.text = element_blank(), axis.ticks = element_blank()) + xlab("Atchley factor component") + ylab("Amino acid position")

As described above, aach of the 5 Atchley factors roughly corresponds to several general biological characteristics
- pah - polarity, accessibility, and hydrophobicity component
- pss - propensity for secondary structure component
- ms - molecular size component
- cc - codon composition component
- ec - electrostatic charge component

Analyzing TCR sequence Data with Atchley Factors

Required packages

Background

Analysis