Introduction


The gene expression dataset analyzed here comes from a project that examined the physiological and transcriptomic effects of cold water on the temperate coral Astrangia poculata. A. poculata is a facultatively symbiotic coral, meaning that it exists in the wild both with and without symbionts. However, only aposymbiotic corals were used in the experiment described here. Here, corals were placed in either a chilled treatment “cold” or a control condition. The cold group had temperatures decreased to 6°C or maintained at 22°C (control) over the course of 15 days, and were sampled for gene expression on day 15. Physiology measured (polyp extension) over the course of the experiment demonstrated that the corals pulled in their polyps in response to warming temperatures, which is the same phenotype that corals in the related cold experiment exhibited. It is important to note that the control temperature used in this experiment is different from that used in the cold experiment, which is why we do not make direct comparisons between these studies. Here, we use the program DESeq2 to quantify differential gene expression of corals in the control and heated treatments.






Analysis pipeline


First we load appropriate libraries

library(tidyr)
library(plyr)
library(dplyr)
library(DESeq2)
library(ggplot2)
library(affycoretools)
library(arrayQualityMetrics)
library(genefilter)
library(DESeq)
library(cowplot)
library(readr)
library(RColorBrewer)
library(gplots)
library(knitr)
library(plotly)
library(vegan)
library(kableExtra)
library(reshape2)
library(prettydoc)




Behavioural Response to food.

Coral polyp behaviours were observed every 3-4 days throughout the experiment. The total surface area of the coral that had extended vs retracted polyps was estimated and then scored between 1-5 based on polyp activity



Table 1 | Observational measurement of polyp activity. The percent of polyps that were extended were estimated and scored as below.
Score Percent.of.colony.with.extended.polyps
1 0
2 25
3 50
4 75
5 100



Read in data, transform to long form and organize.

behaviour = read.csv("Freeze_behaviour.csv") %>% 
  melt() %>% 
  rename(day = variable) %>%
  rename(polyp_behaviour = value) %>%
  rename(treatment = group)



In case someone is interested in the descriptive statistics, here we calculate means, standard error and standard deviation.

cold_sum = behaviour %>%
  group_by(treatment, day) %>%
  summarise( 
    n=n(),
    mean=mean(polyp_behaviour),
    sd=sd(polyp_behaviour)
  ) %>%
  mutate( se=sd/sqrt(n))  %>%
  mutate( ic=se * qt((1-0.05)/2 + .5, n-1))

kable(cold_sum) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, fixed_thead = TRUE) 
treatment day n mean sd se ic
control d1 21 5.000000 0.0000000 0.0000000 0.0000000
control d2 21 4.666667 0.6582806 0.1436486 0.2996457
control d3 21 3.619048 1.1608700 0.2533226 0.5284217
control d4 21 4.428571 0.9783368 0.2134906 0.4453336
control d5 21 2.857143 1.3522468 0.2950844 0.6155354
control d6 21 2.285714 1.1892375 0.2595129 0.5413344
control d7 21 3.380952 1.3592715 0.2966174 0.6187330
control d8 21 3.619048 1.0235326 0.2233531 0.4659065
control d9 21 3.285714 1.3469542 0.2939295 0.6131262
control d10 21 4.428571 0.8701396 0.1898800 0.3960828
control d11 21 4.095238 0.9436505 0.2059214 0.4295445
control d12 21 4.190476 0.9283883 0.2025909 0.4225973
control d13 21 2.380952 1.4309504 0.3122590 0.6513608
control d14 21 4.380952 0.7400129 0.1614840 0.3368498
control d15 21 4.380952 0.9734573 0.2124258 0.4431124
freeze d1 21 4.857143 0.4780914 0.1043281 0.2176246
freeze d2 21 4.857143 0.3585686 0.0782461 0.1632185
freeze d3 21 3.666667 1.2382784 0.2702145 0.5636576
freeze d4 21 3.952381 1.2440334 0.2714703 0.5662772
freeze d5 21 3.476190 1.1233453 0.2451341 0.5113407
freeze d6 21 3.095238 0.9436505 0.2059214 0.4295445
freeze d7 21 2.380952 0.9206623 0.2009050 0.4190804
freeze d8 21 1.523810 0.6015852 0.1312767 0.2738383
freeze d9 21 1.285714 0.4629100 0.1010153 0.2107141
freeze d10 21 1.142857 0.3585686 0.0782461 0.1632185
freeze d11 21 1.000000 0.0000000 0.0000000 0.0000000
freeze d12 21 1.000000 0.0000000 0.0000000 0.0000000
freeze d13 21 1.000000 0.0000000 0.0000000 0.0000000
freeze d14 21 1.000000 0.0000000 0.0000000 0.0000000
freeze d15 21 1.047619 0.2182179 0.0476190 0.0993316



This is how we plotted the behavioural figure in the manuscript. We selected the same days tha

cold_stacked = behaviour %>%
  filter(day == c("d1","d4","d7","d11","d15")) %>%
  group_by(polyp_behaviour, day, treatment) %>%
  summarise(n = n()) %>%
  mutate(frequency = n / sum(n)) 

ggplot(cold_stacked, aes(y = frequency, x = day, fill = as.factor(polyp_behaviour))) + 
  geom_bar(stat = "identity", position = "fill") +
  facet_grid(. ~ treatment) + 
  scale_fill_brewer(palette = "Blues", direction=1) +
  labs(fill = "Behavioural Score") +
  theme_cowplot()

Figure 1 | Behavioural scores after feeding. The total surface area of the coral with extended polyps was estimated and then scored between 1-5 based on polyp activity (see Table 1)




Differential Expression




The general pipeline that was used to generate these count files are well described by the Matz lab out of UT Austin. Check out their github page for details: https://github.com/z0on/tag-based_RNAseq

counts = read.csv("~/Documents/Experiments/freezing_hot/freezing_raw.csv", row.names = 1)



Set up experimental design matrix. You’ll notice that I strip the header of each column to inform me what treatment it belongs to.

treatment = as.factor(sapply(strsplit(colnames(counts), split = ""), "[[", 1)) %>%
            revalue(c("C" = "control", "F" = "cold"))
genotype  = as.factor(sapply(strsplit(colnames(counts), split = ""), "[[", 2))

expDesign = data.frame(treatment, genotype)
            expDesign$sample = colnames(counts)
            write.csv(expDesign, "tables/cold_expDesign.csv", row.names = F)



Table 2 | Experimental Design key with sample names
treatment genotype sample
control E CE6
control D CD8
cold E FE2
control E CE3
control F CF1
cold F FF3
cold F FF4
control F CF5
cold G FG3
control H CH1
cold H FH2
control H CH3
control I CI2
control I CI3
control J CJ2
control J CJ3
cold J FJ5
cold J FJ6
control P CP1
cold P FP3
cold P FP4
control T CT2
control T CT3
cold T FT4
cold T FT5
cold T FT6



Table 3 | Descriptive Summary stats of the mapped reads.

Min. X1st.Qu. Median Mean X3rd.Qu. Max. libsum perc05 perc10 perc90 perc95 zeros percent.zeros
CE6 0 0 0 8.554165 1 21224 405878 0 0 6 14.00 29362 61.88248
CD8 0 0 0 13.491696 3 54638 640154 0 0 10 26.00 25628 54.01281
FE2 0 0 0 13.919828 2 23858 660468 0 0 10 28.00 27194 57.31327
CE3 0 0 0 9.783447 2 25500 464205 0 0 7 17.00 28586 60.24701
CF1 0 0 0 11.790381 2 76320 559430 0 0 8 21.00 28397 59.84868
FF3 0 0 0 9.595473 1 27084 455286 0 0 6 18.00 30276 63.80880
FF4 0 0 0 13.021097 1 34220 617825 0 0 7 22.00 29810 62.82667
CF5 0 0 0 9.666941 2 47515 458677 0 0 8 20.00 27069 57.04982
FG3 0 0 0 10.297715 1 17489 488606 0 0 6 17.00 29483 62.13750
CH1 0 0 0 7.611870 1 32894 361168 0 0 5 13.00 29836 62.88147
FH2 0 0 0 7.136107 1 13382 338594 0 0 6 15.00 31403 66.18403
CH3 0 0 0 11.392809 2 29117 540566 0 0 8 20.00 27437 57.82541
CI2 0 0 0 11.530602 2 35991 547104 0 0 9 22.65 27193 57.31116
CI3 0 0 0 11.251707 2 43755 533871 0 0 10 25.00 25676 54.11398
CJ2 0 0 0 5.008599 1 19208 237648 0 0 5 11.00 30900 65.12393
CJ3 0 0 0 5.397951 1 18430 256122 0 0 5 12.00 30368 64.00270
FJ5 0 0 0 13.344946 2 22064 633191 0 0 8 22.00 28169 59.36815
FJ6 0 0 0 5.601184 1 13862 265765 0 0 5 12.00 32022 67.48862
CP1 0 0 0 6.361153 1 17861 301824 0 0 6 15.00 29055 61.23546
FP3 0 0 0 12.424423 2 28096 589514 0 0 9 25.00 27773 58.53355
FP4 0 0 0 13.029970 2 30298 618246 0 0 9 25.00 27899 58.79911
CT2 0 0 0 13.048242 1 32167 619113 0 0 6 14.00 29742 62.68336
CT3 0 0 0 8.407541 1 30674 398921 0 0 5 12.00 30976 65.28410
FT4 0 0 0 8.790992 1 31015 417115 0 0 6 15.00 30289 63.83620
FT5 0 0 0 12.815883 2 27865 608088 0 0 9 23.00 27774 58.53566
FT6 0 0 0 8.776977 2 37565 416450 0 0 8 19.00 28832 60.76547





Outlier Detection

Conduct array quality metrics to identify outliers

real=newCountDataSet(counts,expDesign) 
real=estimateSizeFactors(real)
plot(sort(sizeFactors(real))) 


Figure 2 | Size factors of each sample used to normalize between libraries.

Create a directory with a bunch of output figures from arrayQualityMetrics() for you to explore if their are outliers.

cds=estimateDispersions(real,method="blind")
vsdBlind=varianceStabilizingTransformation(cds)
arrayQualityMetrics(vsdBlind,intgroup=c("treatment"), force=TRUE, outdir = "arrayQualityMetrics") # this makes a directory "arrayQualityMetrics" and outputs various reports on outliers

Figure 3 | shows a false color heatmap of the distances between arrays. The color scale is chosen to cover the range of distances encountered in the dataset. Patterns in this plot can indicate clustering of the arrays either because of intended biological or unintended experimental factors (batch effects). The distance dab between two arrays a and b is computed as the mean absolute difference (L1-distance) between the data of the arrays (using the data from all probes without filtering). In formula, dab = mean | Mai - Mbi |, where Mai is the value of the i-th probe on the a-th array. Outlier detection was performed by looking for arrays for which the sum of the distances to all other arrays, Sa = Σb dab was exceptionally large. One such array was detected, and it is marked by an asterisk, *.



Figure 4 | shows a bar chart of the sum of distances to other arrays Sa, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. Based on the distribution of the values across all arrays, a threshold of 17.5 was determined, which is indicated by the vertical line. One array exceeded the threshold and was considered an outlier.



Figure 5 | shows boxplots representing summaries of the signal intensity distributions of the arrays. Each box corresponds to one array. Typically, one expects the boxes to have similar positions and widths. If the distribution of an array is very different from the others, this may indicate an experimental problem. Outlier detection was performed by computing the Kolmogorov-Smirnov statistic Ka between each array’s distribution and the distribution of the pooled data.



Figure 6 | shows a bar chart of the Kolmogorov-Smirnov statistic Ka, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. Based on the distribution of the values across all arrays, a threshold of 0.603 was determined, which is indicated by the vertical line. None of the arrays exceeded the threshold and was considered an outlier.



Figure 7 | shows density estimates (smoothed histograms) of the data. Typically, the distributions of the arrays should have similar shapes and ranges. Arrays whose distributions are very different from the others should be considered for possible problems. Various features of the distributions can be indicative of quality related phenomena. For instance, high levels of background will shift an array’s distribution to the right. Lack of signal diminishes its right right tail. A bulge at the upper end of the intensity range often indicates signal saturation.



Figure 8 | shows a density plot of the standard deviation of the intensities across arrays on the y-axis versus the rank of their mean on the x-axis. The red dots, connected by lines, show the running median of the standard deviation. After normalisation and transformation to a logarithm(-like) scale, one typically expects the red line to be approximately horizontal, that is, show no substantial trend. In some cases, a hump on the right hand of the x-axis can be observed and is symptomatic of a saturation of the intensities.



Figure 9 | shows MA plots. M and A are defined as: M = log2(I1) - log2(I2) A = 1/2 (log2(I1)+log2(I2)), where I1 is the intensity of the array studied,and I2 is the intensity of a “pseudo”-array that consists of the median across arrays. Typically, we expect the mass of the distribution in an MA plot to be concentrated along the M = 0 axis, and there should be no trend in M as a function of A. If there is a trend in the lower range of A, this often indicates that the arrays have different background intensities; this may be addressed by background correction. A trend in the upper range of A can indicate saturation of the measurements; in mild cases, this may be addressed by non-linear normalisation (e.g. quantile normalisation). Outlier detection was performed by computing Hoeffding’s statistic Da on the joint distribution of A and M for each array. Shown are first the 4 arrays with the highest values of Da, then the 4 arrays with the lowest values. The value of Da is shown in the panel headings. 0 arrays had Da>0.15 and were marked as outliers. For more information on Hoeffing’s D-statistic, please see the manual page of the function hoeffd in the Hmisc package.



Figure 10 | shows a bar chart of the Da, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. A threshold of 0.15 was used, which is indicated by the vertical line. None of the arrays exceeded the threshold and was considered an outlier.



Outlier Conclusisons

Much like in the other experiment (hot), we see that a single sample (CP1 in this experiment) is identified as an outlier based on the distances between arrays which was much larger. This sample however was not flagged as being an outlier in any other analyses and so was therefore I considered it unworthy of removing.






Differential Expression




Perform a Wald test in DESeq2 (now the new version) doing a pairwise comparison between cold and control (aka the effect of treatment, denoted as design = ~ treatment below)

dds = DESeqDataSetFromMatrix(countData = counts, colData = expDesign, design = ~ genotype + treatment)
dds = DESeq(dds)
results = results(dds)



Let’s check to see if we set up our contrast correctly. We should have the treatment condition first and the control second in the log2 fold change (MLE) output.

head(results)
## log2 fold change (MLE): treatment cold vs control 
## Wald test p-value: treatment cold vs control 
## DataFrame with 6 rows and 6 columns
##                         baseMean     log2FoldChange             lfcSE
##                        <numeric>          <numeric>         <numeric>
## isogroup1      0.461529721479737  0.419668503236571  1.47118578725618
## isogroup10     0.553625275106359 0.0174438883464909  1.76578089796676
## isogroup1000     1.1906818878365 -0.272504895940878 0.874298472601528
## isogroup100001  23.4299485361196  -1.04491003705984 0.280285458356361
## isogroup100004 0.316470003418551 0.0303396598025434   3.3292952314765
## isogroup100011  2.34865670194132  -0.31339148990892 0.868960637164387
##                               stat               pvalue
##                          <numeric>            <numeric>
## isogroup1        0.285258671523242    0.775445973305387
## isogroup10     0.00987885210819581    0.992117944628513
## isogroup1000    -0.311684058111211    0.755280644635158
## isogroup100001   -3.72802086553956 0.000192989403027221
## isogroup100004 0.00911293763187476    0.992729028397179
## isogroup100011  -0.360650962201909    0.718360387292258
##                               padj
##                          <numeric>
## isogroup1                       NA
## isogroup10                      NA
## isogroup1000     0.890175149789675
## isogroup100001 0.00177421443128247
## isogroup100004                  NA
## isogroup100011   0.870452732525721

Looks good!



Table 4 | Descriptive statistics on the normalized read counts

min mean median max zeros percent.zeros
CE6 0 9.796057 0 24305.30 29362 61.88248
CD8 0 9.264012 0 37516.94 25628 54.01281
FE2 0 9.135883 0 15658.52 27194 57.31327
CE3 0 9.603827 0 25031.83 28586 60.24701
CF1 0 10.256199 0 66389.13 28397 59.84868
FF3 0 9.577625 0 27033.62 30276 63.80880
FF4 0 10.504855 0 27607.21 29810 62.82667
CF5 0 9.068442 0 44573.26 27069 57.04982
FG3 0 9.318387 0 15825.77 29483 62.13750
CH1 0 9.581351 0 41404.93 29836 62.88147
FH2 0 8.904224 0 16697.66 31403 66.18403
CH3 0 9.628922 0 24608.97 27437 57.82541
CI2 0 9.543230 0 29787.73 27193 57.31116
CI3 0 8.753919 0 34041.74 25676 54.11398
CJ2 0 8.689286 0 33323.45 30900 65.12393
CJ3 0 8.342631 0 28483.90 30368 64.00270
FJ5 0 9.508401 0 15720.81 28169 59.36815
FJ6 0 8.496500 0 21027.43 32022 67.48862
CP1 0 8.709014 0 24453.38 29055 61.23546
FP3 0 9.052077 0 20469.94 27773 58.53355
FP4 0 9.472008 0 22024.83 27899 58.79911
CT2 0 13.812646 0 34051.44 29742 62.68336
CT3 0 11.113983 0 40548.16 30976 65.28410
FT4 0 9.239044 0 32595.75 30289 63.83620
FT5 0 9.074095 0 19729.40 27774 58.53566
FT6 0 8.468310 0 36243.92 28832 60.76547



Lets do a rlogged transformation, which is useful various unsupervised clustering analyses. Be sure the include blind = TRUE as it doesn’t normalize the data with any priors from our experimental design.

rlogged = rlogTransformation(dds, blind = TRUE)



The order of treatments is really important here. Putting “cold” before “control” means that the gene expression of corals in the heated treatment will be compared to control, which is what we want. Make sure you put the control group second

In the output table, negative log2FoldChange indicates downregulation in heated compared to control, and positive log2FoldChagne indicates upregulation.

results_FvsC = results(dds, contrast = c("treatment", "cold", "control")) 
head(results_FvsC)
## log2 fold change (MLE): treatment cold vs control 
## Wald test p-value: treatment cold vs control 
## DataFrame with 6 rows and 6 columns
##                         baseMean     log2FoldChange             lfcSE
##                        <numeric>          <numeric>         <numeric>
## isogroup1      0.461529721479737  0.419668503236571  1.47118578725618
## isogroup10     0.553625275106359 0.0174438883464909  1.76578089796676
## isogroup1000     1.1906818878365 -0.272504895940878 0.874298472601528
## isogroup100001  23.4299485361196  -1.04491003705984 0.280285458356361
## isogroup100004 0.316470003418551 0.0303396598025434   3.3292952314765
## isogroup100011  2.34865670194132  -0.31339148990892 0.868960637164387
##                               stat               pvalue
##                          <numeric>            <numeric>
## isogroup1        0.285258671523242    0.775445973305387
## isogroup10     0.00987885210819581    0.992117944628513
## isogroup1000    -0.311684058111211    0.755280644635158
## isogroup100001   -3.72802086553956 0.000192989403027221
## isogroup100004 0.00911293763187476    0.992729028397179
## isogroup100011  -0.360650962201909    0.718360387292258
##                               padj
##                          <numeric>
## isogroup1                       NA
## isogroup10                      NA
## isogroup1000     0.890175149789675
## isogroup100001 0.00177421443128247
## isogroup100004                  NA
## isogroup100011   0.870452732525721



Summary of DEGs with FDR < 0.1
We can see how many are up regulated and how many are down regulated (and what percent of the genome) 4.3% of genes (1901 genes) are upregulated and 7.2% of genes (3151 genes) are downregulated.

summary(results_FvsC)
## 
## out of 44039 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1754, 4%
## LFC < 0 (down)     : 2401, 5.5%
## outliers [1]       : 0, 0%
## low counts [2]     : 28061, 64%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results



How about those for a FDR of < 0.05? Since there are enough genes to explore here, we will use this for subsequent anaylses.

results_FvsC05 = results(dds, contrast = c("treatment", "cold", "control"), alpha = 0.05) 
summary(results_FvsC05)
## 
## out of 44039 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 1468, 3.3%
## LFC < 0 (down)     : 1960, 4.5%
## outliers [1]       : 0, 0%
## low counts [2]     : 28061, 64%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
write.csv(results_FvsC05, file="tables/cold_results.csv")



MA plot because people like them, but besides looking neat I find them not overly informative

DESeq2::plotMA(results_FvsC, main = "Cold vs Control")

Figure 11 | “Minus over average”, or MA-plot of cold vs control treatments. Red dots represent genes differentially expressed FDR < 0.1 whereas grey dots are not differentially expressed.



Heatmaps




Overall expression, sample by distance.

sampleDists <- as.matrix(dist(t(assay(rlogged))))
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
          col=colorpanel(100, "black", "white"),
          margin=c(10, 10))

Figure 12 | Sample by distance heatmap. This figure takes overall expression and clusters samples based on euclidean distances. We can see that samples cluster mostly by genotype with some exceptions.



Individual genes heatmap
I like plotting the z scores, as heatmap2 creates nice clean clusters by doing this. The heatmap shows that our samples are clustering by treatment (C=cold, C=control). Upregulation indicated by warmer colors, downregulation indicated by cooler colors.

norm_counts = read.csv("tables/cold_normalized_counts.csv")
hm = read.csv("tables/cold_results.csv") %>% as_tibble() %>%
  filter(padj < 0.05) %>% # only want the most DEGs
  select(X) %>%
  merge(norm_counts, by.x = "X", by.y = "X")  # turn into a countdatafile
  row.names(hm) = hm$X
  hm$X = NULL

## Turning into z-score table
hm.z = data.matrix(hm)
hm.z = sweep(hm.z, 1L, rowMeans(hm.z), check.margin = FALSE)
hm.z.sx = apply(hm.z, 1L, sd)
hm.z = sweep(hm.z, 1L, hm.z.sx, "/", check.margin = FALSE)
hm.z = data.matrix(hm.z)

colour = colorRampPalette(rev(c("chocolate1","#FEE090","grey10", "cyan3","cyan")))(100)
heatmap.2(hm.z, col = colour, Rowv = TRUE, Colv = TRUE, scale = "row", 
          dendrogram = "both",
          trace = "none", 
          margin = c(5,15))


Figure 13 | Heatmap of z-scores of DEGs (FDR < 0.05). Upregulation indicated by warmer colors, downregulation indicated by cooler colors.



Principal Component Analyses


First we create a PCA data frame and calculate the variance estimated by PC1 and PC2

pcadata = DESeq2::plotPCA(rlogged, intgroup = c( "treatment", "genotype"), returnData = TRUE)
percentVar = round(100 * attr(pcadata, "percentVar"))
pca = prcomp(t(assay(rlogged)), center = TRUE, scale. = FALSE)



Using adonis from library(vegan) we can see if there are any significant differences based on treatment (as is illustrated in the model displayed below)

adonis(pca$x ~ genotype + treatment, method = 'eu')
## 
## Call:
## adonis(formula = pca$x ~ genotype + treatment, method = "eu") 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## genotype   8     78400  9800.0  2.6654 0.48631  0.001 ***
## treatment  1     23985 23984.6  6.5232 0.14878  0.001 ***
## Residuals 16     58829  3676.8         0.36491           
## Total     25    161213                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



cols = c("control" = "grey", "cold" = "#68a2ff")
DESeq2::plotPCA(rlogged, returnData = TRUE, intgroup = c("treatment", "genotype") ) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = treatment), size = 3) +
      stat_ellipse(geom = "polygon", alpha = 1/10, aes(fill = treatment)) +
      scale_colour_manual(values = cols) +
      scale_fill_manual(values = cols) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      labs(title = "~ treatment, p<0.001")+
      theme_cowplot()

Figure 14 | Principal component analyses highlighting treatment groups. Treatment has a significant affect (p < 0.001) on overall expression



DESeq2::plotPCA(rlogged, returnData = TRUE, intgroup = c("treatment", "genotype") ) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = genotype)) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      labs(title = "~ genotype, p<0.001")+
      theme_cowplot()

Figure 15 | Principal component analyses highlighting genotypes of coral. Treatment has a significant affect (p < 0.001) on overall expression





Gene ontology




We are going to do a Mann-Whitney U test which requries that we first -logged the pvalue, and multiply it by -1 if it’s less than zero or by 1 if it’s greater than zero

results = read.csv("tables/cold_results.csv")
go_input = results %>%
  mutate(mutated_p = -log(pvalue)) %>%
  mutate(mutated_p_updown = ifelse(log2FoldChange < 0, mutated_p*-1, mutated_p*1)) %>%
  select(X, mutated_p_updown) %>%
  na.omit()

colnames(go_input) = NULL

write.csv(go_input, "go_input.csv", row.names = FALSE)



These code are all adapted from Dr. Matz and can be found GO_MWU https://github.com/z0on/GO_MWU This will be broken down into three sections

Cellular Components

input="go_input_saved.csv" # two columns of comma-separated values: gene id, continuous measure of significance. To perform standard GO enrichment analysis based on Fisher's exact test, use binary measure (0 or 1, i.e., either sgnificant or not).
goAnnotations="wh_host_final_iso2go.tab" # two-column, tab-delimited, one line per gene, multiple GO terms separated by semicolon. If you have multiple lines per gene, use nrify_GOtable.pl prior to running this script.
goDatabase="go.obo" # download from http://www.geneontology.org/GO.downloads.ontology.shtml
goDivision="CC" # either MF, or BP, or CC
source("gomwu.functions.R")
gomwuStats(input, goDatabase, goAnnotations, goDivision,
    perlPath="perl", # replace with full path to perl executable if it is not in your system's PATH already
    largest=0.1,  # a GO category will not be considered if it contains more than this fraction of the total number of genes
    smallest=5,   # a GO category should contain at least this many genes to be considered
    clusterCutHeight=0.25, # threshold for merging similar (gene-sharing) terms. See README for details.
#   Alternative="g" # by default the MWU test is two-tailed; specify "g" or "l" of you want to test for "greater" or "less" instead. 
#   Module=TRUE,Alternative="g" # un-remark this if you are analyzing a SIGNED WGCNA module (values: 0 for not in module genes, kME for in-module genes). In the call to gomwuPlot below, specify absValue=0.001 (count number of "good genes" that fall into the module)
#   Module=TRUE # un-remark this if you are analyzing an UNSIGNED WGCNA module 
)
## Continuous measure of interest: will perform MWU test
## 146  GO terms at 10% FDR
# do not continue if the printout shows that no GO terms pass 10% FDR.
cc_results=gomwuPlot(input,goAnnotations,goDivision,
#   absValue=-log(0.05,10),  # genes with the measure value exceeding this will be counted as "good genes". Specify absValue=0.001 if you are doing Fisher's exact test for standard GO enrichment or analyzing a WGCNA module (all non-zero genes = "good genes").
    absValue=1,
    level1=0.001, # FDR threshold for plotting. Specify level1=1 to plot all GO categories containing genes exceeding the absValue.
    level2=0.0001, # FDR cutoff to print in regular (not italic) font.
    level3=0.00001, # FDR cutoff to print in large bold font.
    txtsize=1.2,    # decrease to fit more on one page, or increase (after rescaling the plot so the tree fits the text) for better "word cloud" effect
    treeHeight=0.5, # height of the hierarchical clustering tree
  colors=c("black","#0666ff","grey40","#7caeff") # these are default colors, un-remar and change if needed
)
## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## GO terms dispayed:  25 
## "Good genes" accounted for:  1727 out of 5906 ( 29% )
write.csv(cc_results, "cc.csv")

Figure 16 | Cellular component gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the cold treatment and purple colours are are GO terms enriched in upregulated genes in the cold treatment.



Molecular Functions

goDivision="MF" 
gomwuStats(input, goDatabase, goAnnotations, goDivision,
    perlPath="perl", 
    largest=0.1, 
    smallest=5,   
    clusterCutHeight=0.25, 
)
## Continuous measure of interest: will perform MWU test
## 106  GO terms at 10% FDR
mf_results=gomwuPlot(input,goAnnotations,goDivision,
    absValue=1,
    level1=0.01, 
    level2=0.001, 
    level3=0.0001, 
    txtsize=1.2,    
    treeHeight=0.5, 
  colors=c("black","#0666ff","grey40","#7caeff") 
)
## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## GO terms dispayed:  39 
## "Good genes" accounted for:  2075 out of 6457 ( 32% )
write.csv(mf_results, "mf.csv")

Figure 17 | Molecular function gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the cold treatment and purple colours are are GO terms enriched in upregulated genes in the cold treatment.



Biological process

goDivision="BP"
gomwuStats(input, goDatabase, goAnnotations, goDivision,
    perlPath="perl", 
    largest=0.1,  
    smallest=5,   
    clusterCutHeight=0.4, 
)
## Continuous measure of interest: will perform MWU test
## 419  GO terms at 10% FDR
bp_results=gomwuPlot(input,goAnnotations,goDivision,
    absValue=1,
    level1=0.00001, 
    level2=0.000001, 
    level3=0.0000001, 
    txtsize=1.2,    
    treeHeight=0.5, 
  colors=c("black","#0666ff","grey40","#7caeff") 
)
## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## GO terms dispayed:  25 
## "Good genes" accounted for:  2825 out of 6823 ( 41% )
write.csv(bp_results, "bp.csv")

Figure 18 | Biological processes gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the cold treatment and purple colours are are GO terms enriched in upregulated genes in the cold treatment.




Version Control


Here is our session info, containing information about versions used

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.5
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ape_5.2                     hexbin_1.27.2              
##  [3] prettydoc_0.2.1             reshape2_1.4.3             
##  [5] kableExtra_1.1.0            vegan_2.5-4                
##  [7] permute_0.9-4               plotly_4.9.0               
##  [9] knitr_1.21                  gplots_3.0.1.1             
## [11] RColorBrewer_1.1-2          readr_1.3.1                
## [13] cowplot_0.9.4               DESeq_1.34.1               
## [15] lattice_0.20-38             locfit_1.5-9.1             
## [17] genefilter_1.64.0           arrayQualityMetrics_3.38.0 
## [19] affycoretools_1.54.0        ggplot2_3.1.0              
## [21] DESeq2_1.22.2               SummarizedExperiment_1.12.0
## [23] DelayedArray_0.8.0          BiocParallel_1.16.5        
## [25] matrixStats_0.54.0          Biobase_2.42.0             
## [27] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
## [29] IRanges_2.16.0              S4Vectors_0.20.1           
## [31] BiocGenerics_0.28.0         dplyr_0.8.0.1              
## [33] plyr_1.8.4                  tidyr_0.8.2                
## 
## loaded via a namespace (and not attached):
##   [1] R.utils_2.7.0             tidyselect_0.2.5         
##   [3] RSQLite_2.1.1             AnnotationDbi_1.44.0     
##   [5] htmlwidgets_1.3           beadarray_2.32.0         
##   [7] grid_3.5.1                munsell_0.5.0            
##   [9] codetools_0.2-16          preprocessCore_1.44.0    
##  [11] withr_2.1.2               colorspace_1.4-0         
##  [13] Category_2.48.0           OrganismDbi_1.24.0       
##  [15] highr_0.7                 rstudioapi_0.9.0         
##  [17] setRNG_2013.9-1           GenomeInfoDbData_1.2.0   
##  [19] hwriter_1.3.2             bit64_0.9-7              
##  [21] xfun_0.5.3                biovizBase_1.30.1        
##  [23] R6_2.4.0                  illuminaio_0.24.0        
##  [25] gridSVG_1.6-0             AnnotationFilter_1.6.0   
##  [27] bitops_1.0-6              reshape_0.8.8            
##  [29] assertthat_0.2.0          scales_1.0.0             
##  [31] nnet_7.3-12               gtable_0.2.0             
##  [33] Cairo_1.5-9               affy_1.60.0              
##  [35] ggbio_1.30.0              ensembldb_2.6.5          
##  [37] rlang_0.3.1               splines_3.5.1            
##  [39] rtracklayer_1.42.1        lazyeval_0.2.1           
##  [41] acepack_1.4.1             dichromat_2.0-0          
##  [43] checkmate_1.9.1           BiocManager_1.30.4       
##  [45] yaml_2.2.0                GenomicFeatures_1.34.3   
##  [47] backports_1.1.3           Hmisc_4.2-0              
##  [49] RBGL_1.58.1               tools_3.5.1              
##  [51] affyio_1.52.0             ff_2.2-14                
##  [53] Rcpp_1.0.1                base64enc_0.1-3          
##  [55] progress_1.2.0            zlibbioc_1.28.0          
##  [57] purrr_0.3.0               RCurl_1.95-4.11          
##  [59] prettyunits_1.0.2         rpart_4.1-13             
##  [61] openssl_1.2.1             cluster_2.0.7-1          
##  [63] magrittr_1.5              data.table_1.12.0        
##  [65] ProtGenerics_1.14.0       hms_0.4.2                
##  [67] evaluate_0.12             xtable_1.8-3             
##  [69] XML_3.98-1.16             gcrma_2.54.0             
##  [71] gridExtra_2.3             compiler_3.5.1           
##  [73] biomaRt_2.38.0            tibble_2.0.1             
##  [75] KernSmooth_2.23-15        crayon_1.3.4             
##  [77] ReportingTools_2.22.1     R.oo_1.22.0              
##  [79] htmltools_0.3.6           GOstats_2.48.0           
##  [81] mgcv_1.8-26               Formula_1.2-3            
##  [83] geneplotter_1.60.0        DBI_1.0.0                
##  [85] MASS_7.3-51.1             Matrix_1.2-15            
##  [87] vsn_3.50.0                R.methodsS3_1.7.1        
##  [89] gdata_2.18.0              pkgconfig_2.0.2          
##  [91] GenomicAlignments_1.18.1  foreign_0.8-71           
##  [93] xml2_1.2.0                foreach_1.4.4            
##  [95] annotate_1.60.0           BeadDataPackR_1.34.0     
##  [97] affyPLM_1.58.0            webshot_0.5.1            
##  [99] XVector_0.22.0            AnnotationForge_1.24.0   
## [101] rvest_0.3.4               stringr_1.4.0            
## [103] VariantAnnotation_1.28.10 digest_0.6.18            
## [105] graph_1.60.0              Biostrings_2.50.2        
## [107] rmarkdown_1.11            base64_2.0               
## [109] htmlTable_1.13.1          edgeR_3.24.3             
## [111] GSEABase_1.44.0           curl_3.3                 
## [113] Rsamtools_1.34.1          gtools_3.8.1             
## [115] nlme_3.1-137              jsonlite_1.6             
## [117] PFAM.db_3.7.0             viridisLite_0.3.0        
## [119] askpass_1.1               limma_3.38.3             
## [121] BSgenome_1.50.0           pillar_1.3.1             
## [123] GGally_1.4.0              httr_1.4.0               
## [125] survival_2.43-3           GO.db_3.7.0              
## [127] glue_1.3.0.9000           iterators_1.0.10         
## [129] bit_1.1-14                Rgraphviz_2.26.0         
## [131] stringi_1.2.4             blob_1.1.1               
## [133] oligoClasses_1.44.0       latticeExtra_0.6-28      
## [135] caTools_1.17.1.1          memoise_1.1.0