Introduction


The gene expression dataset analyzed here comes from a project that examined the physiological and transcriptomic effects of hot water on the temperate coral Astrangia poculata. A. poculata is a facultatively symbiotic coral, meaning that it exists in the wild both with and without symbionts. However, only aposymbiotic corals were used in the experiment described here. Here, corals were placed in either a chilled treatment “hot” or a control condition. The hot group had temperatures increased to 31°C or maintained at 16°C (control) over the course of 15 days, and were sampled for gene expression on day 15. Physiology measured (polyp extension) over the course of the experiment demonstrated that the corals pulled in their polyps in response to warming temperatures, which is the same phenotype that corals in the related hot experiment exhibited. It is important to note that the control temperature used in this experiment is different from that used in the hot experiment, which is why we do not make direct comparisons between these studies. Here, we use the program DESeq2 to quantify differential gene expression of corals in the control and heated treatments.






Analysis pipeline


First we load appropriate libraries

library(tidyr)
library(plyr)
library(dplyr)
library(DESeq2)
library(ggplot2)
library(affycoretools)
library(arrayQualityMetrics)
library(genefilter)
library(DESeq)
library(cowplot)
library(readr)
library(RColorBrewer)
library(gplots)
library(knitr)
library(plotly)
library(vegan)
library(kableExtra)
library(reshape2)
library(prettydoc)
library(MASS)




Behavioural Response to food.

Coral polyp behaviours were observed every 3-4 days throughout the experiment. The total surface area of the coral that had extended vs retracted polyps was estimated and then scored between 1-5 based on polyp activity



Table 1 | Observational measurement of polyp activity. The percent of polyps that were extended were estimated and scored as below.
Score Percent.of.colony.with.extended.polyps
1 0
2 25
3 50
4 75
5 100



Read in data, transform to long form and organize.

behaviour = read.csv("hot_behaviour.csv") %>% 
  melt() %>% 
  rename(day = variable) %>%
  rename(polyp_behaviour = value) %>%
  mutate(genotype = sapply(strsplit(as.character(individual), split = ""), "[[", 2))

behaviour$day = as.character(behaviour$day)
data = behaviour %>%
  mutate(temperature = recode(day, d1 = "16", d2 = "23", d3 = "25", d4 = "28", d5 = "31"))



In case someone is interested in the descriptive statistics, here we calculate means, standard error and standard deviation.

hot_sum = behaviour %>%
  group_by(treatment, day) %>%
  summarise( 
    n=n(),
    mean=mean(polyp_behaviour),
    sd=sd(polyp_behaviour)
  ) %>%
  mutate( se=sd/sqrt(n))  %>%
  mutate( ic=se * qt((1-0.05)/2 + .5, n-1))

kable(hot_sum) %>%
  kable_styling(bootstrap_options = "striped", full_width = F) 
treatment day n mean sd se ic
control d1 21 5.000000 0.0000000 0.0000000 0.0000000
control d2 21 5.000000 0.0000000 0.0000000 0.0000000
control d3 21 5.000000 0.0000000 0.0000000 0.0000000
control d4 21 5.000000 0.0000000 0.0000000 0.0000000
control d5 21 5.000000 0.0000000 0.0000000 0.0000000
hot d1 21 5.000000 0.0000000 0.0000000 0.0000000
hot d2 21 4.428571 0.8701396 0.1898800 0.3960828
hot d3 21 4.476191 1.0304876 0.2248708 0.4690723
hot d4 21 2.476190 1.2090925 0.2638456 0.5503723
hot d5 21 2.714286 1.1892375 0.2595129 0.5413344



This is how we plotted the behavioural figure in the manuscript. We selected the same days tha

hot_stacked = behaviour %>%
  group_by(polyp_behaviour, day, treatment) %>%
  summarise(n = n()) %>%
  mutate(frequency = n / sum(n)) 

ggplot(hot_stacked, aes(y = frequency, x = day, fill = as.factor(polyp_behaviour))) + 
  geom_bar(stat = "identity", position = "fill") +
  facet_grid(. ~ treatment) + 
  scale_fill_brewer(palette = "Reds", direction=1) +
  labs(fill = "Behavioural Score") +
  theme_cowplot()

Figure 1 | Behavioural scores after feeding. The total surface area of the coral with extended polyps was estimated and then scored between 1-5 based on polyp activity (see Table 1)



model = polr(as.factor(polyp_behaviour) ~ treatment + day, data = behaviour)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
## 
## Re-fitting to get Hessian
## Call:
## polr(formula = as.factor(polyp_behaviour) ~ treatment + day, 
##     data = behaviour)
## 
## Coefficients:
##               Value Std. Error t value
## treatmenthot -19.22     0.1819 -105.69
## dayd2        -16.93     0.4107  -41.21
## dayd3        -16.43     0.4671  -35.17
## dayd4        -19.93     0.3726  -53.48
## dayd5        -19.59     0.3611  -54.26
## 
## Intercepts:
##     Value     Std. Error t value  
## 1|2  -40.3389    0.1819  -221.8048
## 2|3  -38.8912    0.2566  -151.5488
## 3|4  -37.8752    0.2909  -130.2071
## 4|5  -36.6582    0.3596  -101.9315
## 
## Residual Deviance: 209.8631 
## AIC: 227.8631
table = coef(summary(model))
## 
## Re-fitting to get Hessian
table
##                  Value Std. Error    t value
## treatmenthot -19.22189  0.1818671 -105.69199
## dayd2        -16.92510  0.4106563  -41.21475
## dayd3        -16.42655  0.4670713  -35.16926
## dayd4        -19.92747  0.3726301  -53.47788
## dayd5        -19.59392  0.3611129  -54.25981
## 1|2          -40.33895  0.1818669 -221.80477
## 2|3          -38.89117  0.2566247 -151.54883
## 3|4          -37.87523  0.2908845 -130.20711
## 4|5          -36.65824  0.3596361 -101.93148

Calculate and store the p-value

p = pnorm(abs(table[,"t value"]), lower.tail = FALSE) * 2
p
##  treatmenthot         dayd2         dayd3         dayd4         dayd5 
##  0.000000e+00  0.000000e+00 5.902959e-271  0.000000e+00  0.000000e+00 
##           1|2           2|3           3|4           4|5 
##  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00

Differential Expression




The general pipeline that was used to generate these count files are well described by the Matz lab out of UT Austin. Check out their github page for details: https://github.com/z0on/tag-based_RNAseq

counts = read.csv("~/Documents/Experiments/freezing_hot/hot_raw.csv", row.names = 1)



Set up experimental design matrix. You’ll notice that I strip the header of each column to inform me what treatment it belongs to.

treatment = as.factor(sapply(strsplit(colnames(counts), split = ""), "[[", 1)) %>%
            revalue(c("C" = "control", "H" = "hot"))
genotype  = as.factor(sapply(strsplit(colnames(counts), split = ""), "[[", 2))

expDesign = data.frame(treatment, genotype)
            expDesign$sample = colnames(counts)
            write.csv(expDesign, "tables/hot_expDesign.csv", row.names = F)



Table 2 | Experimental Design key with sample names
treatment genotype sample
hot A HA1
control A CA2
hot B HB2
control B CB3
control B CB4
hot B HB5
hot B HB6
control C CC2
control C CC3
control D CD2
control D CD4
hot D HD6
control F CF8
hot K HK4
hot N HN1
hot O HO1
hot O HO2
control O CO4
control Q CQ2
hot Q HQ3
control Q CQ4
control Q CQ5
hot Q HQ6



Table 3 | Descriptive Summary stats of the mapped reads.

Min. X1st.Qu. Median Mean X3rd.Qu. Max. libsum perc05 perc10 perc90 perc95 zeros percent.zeros
HA1 0 0 0 9.574418 2 40458 454287 0 0 8 19 27412 57.77272
CA2 0 0 0 13.053617 2 39814 619368 0 0 10 24 26086 54.97808
HB2 0 0 0 13.399869 2 29289 635797 0 0 9 24 27379 57.70317
CB3 0 0 0 7.600447 1 25784 360626 0 0 6 14 29587 62.35669
CB4 0 0 0 7.135727 1 22337 338576 0 0 6 13 29648 62.48525
HB5 0 0 0 5.802626 1 22823 275323 0 0 5 12 31056 65.45271
HB6 0 0 0 5.006238 1 38833 237536 0 0 4 10 32583 68.67097
CC2 0 0 0 17.092670 2 21412 811013 0 0 8 19 27634 58.24060
CC3 0 0 0 9.950514 2 30425 472132 0 0 6 14 28638 60.35660
CD2 0 0 0 14.828465 3 61616 703581 0 0 13 31 23755 50.06533
CD4 0 0 0 12.294554 1 21399 583352 0 0 5 13 30593 64.47690
HD6 0 0 0 4.055513 1 26716 192426 0 0 4 9 32916 69.37279
CF8 0 0 0 15.811267 2 34172 750213 0 0 9 23 25384 53.49857
HK4 0 0 0 7.253688 2 28958 344173 0 0 6 16 29312 61.77710
HN1 0 0 0 6.315018 1 12385 299635 0 0 4 10 33473 70.54670
HO1 0 0 0 7.296788 2 19439 346218 0 0 8 18 27625 58.22163
HO2 0 0 0 7.522593 2 46609 356932 0 0 6 16 28910 60.92986
CO4 0 0 0 7.327095 2 22917 347656 0 0 7 17 28159 59.34707
CQ2 0 0 0 7.587022 2 29543 359989 0 0 7 17 27665 58.30593
HQ3 0 0 0 10.137519 1 26059 481005 0 0 4 11 32532 68.56348
CQ4 0 0 0 28.439218 3 29319 1349384 0 0 10 26 25525 53.79573
CQ5 0 0 0 12.405750 2 33190 588628 0 0 10 25 25970 54.73360
HQ6 0 0 0 12.559138 2 33867 595906 0 0 9 23 27196 57.31748





Outlier Detection

Conduct array quality metrics to identify outliers

real=newCountDataSet(counts,expDesign) 
real=estimateSizeFactors(real)
plot(sort(sizeFactors(real))) 


Figure 2 | Size factors of each sample used to normalize between libraries.

Create a directory with a bunch of output figures from arrayQualityMetrics() for you to explore if their are outliers.

cds=estimateDispersions(real,method="blind")
vsdBlind=varianceStabilizingTransformation(cds)
arrayQualityMetrics(vsdBlind,intgroup=c("treatment"), force=TRUE, outdir = "arrayQualityMetrics") # this makes a directory "arrayQualityMetrics" and outputs various reports on outliers

Figure 3 | shows a false color heatmap of the distances between arrays. The color scale is chosen to cover the range of distances encountered in the dataset. Patterns in this plot can indicate clustering of the arrays either because of intended biological or unintended experimental factors (batch effects). The distance dab between two arrays a and b is computed as the mean absolute difference (L1-distance) between the data of the arrays (using the data from all probes without filtering). In formula, dab = mean | Mai - Mbi |, where Mai is the value of the i-th probe on the a-th array. Outlier detection was performed by looking for arrays for which the sum of the distances to all other arrays, Sa = Σb dab was exceptionally large. One such array was detected, and it is marked by an asterisk, *.



Figure 4 | shows a bar chart of the sum of distances to other arrays Sa, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. Based on the distribution of the values across all arrays, a threshold of 17.5 was determined, which is indicated by the vertical line. One array exceeded the threshold and was considered an outlier.



Figure 5 | shows boxplots representing summaries of the signal intensity distributions of the arrays. Each box corresponds to one array. Typically, one expects the boxes to have similar positions and widths. If the distribution of an array is very different from the others, this may indicate an experimental problem. Outlier detection was performed by computing the Kolmogorov-Smirnov statistic Ka between each array’s distribution and the distribution of the pooled data.



Figure 6 | shows a bar chart of the Kolmogorov-Smirnov statistic Ka, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. Based on the distribution of the values across all arrays, a threshold of 0.603 was determined, which is indicated by the vertical line. None of the arrays exceeded the threshold and was considered an outlier.



Figure 7 | shows density estimates (smoothed histograms) of the data. Typically, the distributions of the arrays should have similar shapes and ranges. Arrays whose distributions are very different from the others should be considered for possible problems. Various features of the distributions can be indicative of quality related phenomena. For instance, high levels of background will shift an array’s distribution to the right. Lack of signal diminishes its right right tail. A bulge at the upper end of the intensity range often indicates signal saturation.



Figure 8 | shows a density plot of the standard deviation of the intensities across arrays on the y-axis versus the rank of their mean on the x-axis. The red dots, connected by lines, show the running median of the standard deviation. After normalisation and transformation to a logarithm(-like) scale, one typically expects the red line to be approximately horizontal, that is, show no substantial trend. In some cases, a hump on the right hand of the x-axis can be observed and is symptomatic of a saturation of the intensities.



Figure 9 | shows MA plots. M and A are defined as: M = log2(I1) - log2(I2) A = 1/2 (log2(I1)+log2(I2)), where I1 is the intensity of the array studied,and I2 is the intensity of a “pseudo”-array that consists of the median across arrays. Typically, we expect the mass of the distribution in an MA plot to be concentrated along the M = 0 axis, and there should be no trend in M as a function of A. If there is a trend in the lower range of A, this often indicates that the arrays have different background intensities; this may be addressed by background correction. A trend in the upper range of A can indicate saturation of the measurements; in mild cases, this may be addressed by non-linear normalisation (e.g. quantile normalisation). Outlier detection was performed by computing Hoeffding’s statistic Da on the joint distribution of A and M for each array. Shown are first the 4 arrays with the highest values of Da, then the 4 arrays with the lowest values. The value of Da is shown in the panel headings. 0 arrays had Da>0.15 and were marked as outliers. For more information on Hoeffing’s D-statistic, please see the manual page of the function hoeffd in the Hmisc package.



Figure 10 | shows a bar chart of the Da, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. A threshold of 0.15 was used, which is indicated by the vertical line. None of the arrays exceeded the threshold and was considered an outlier.



Outlier Conclusisons

Much like in the other experiment (hot), we see that a single sample (CP1 in this experiment) is identified as an outlier based on the distances between arrays which was much larger. This sample however was not flagged as being an outlier in any other analyses and so was therefore I considered it unworthy of removing.






Differential Expression




Perform a Wald test in DESeq2 (now the new version) doing a pairwise comparison between hot and control (aka the effect of treatment, denoted as design = ~ treatment below)

dds = DESeqDataSetFromMatrix(countData = counts, colData = expDesign, design = ~ genotype + treatment)
dds = DESeq(dds)
results = results(dds)



Let’s check to see if we set up our contrast correctly. We should have the treatment condition first and the control second in the log2 fold change (MLE) output.

head(results)
## log2 fold change (MLE): treatment hot vs control 
## Wald test p-value: treatment hot vs control 
## DataFrame with 6 rows and 6 columns
##                         baseMean     log2FoldChange             lfcSE
##                        <numeric>          <numeric>         <numeric>
## isogroup1      0.754706405965306  -0.60136668210655  1.30182527512437
## isogroup10     0.383450425721418   1.08565060450841  3.50412555536888
## isogroup1000    1.32897901722752  -1.42687692603426 0.969795073733801
## isogroup100001  42.4408554733082   1.54428393380751  0.26011961979613
## isogroup100004 0.483099523825571  0.331367581259926  2.74337032051983
## isogroup100011  3.36831075613279 -0.464355783531558 0.613749442645296
##                              stat              pvalue                 padj
##                         <numeric>           <numeric>            <numeric>
## isogroup1      -0.461941163378547   0.644123513886682                   NA
## isogroup10      0.309820692025439   0.756697315380076                   NA
## isogroup1000    -1.47131797704504   0.141205143857845                   NA
## isogroup100001   5.93682220133126 2.9059951355831e-09 1.20250078710429e-06
## isogroup100004  0.120788498286712   0.903858560436392                   NA
## isogroup100011  -0.75658852174294   0.449296420158435    0.764299804712495

Looks good!



Table 4 | Descriptive statistics on the normalized read counts

min mean median max zeros percent.zeros
HA1 0 8.230796 0 34780.34 27412 57.77272
CA2 0 8.758342 0 26713.26 26086 54.97808
HB2 0 8.985652 0 19640.55 27379 57.70317
CB3 0 8.875032 0 30107.94 29587 62.35669
CB4 0 8.610759 0 26954.30 29648 62.48525
HB5 0 8.241059 0 32413.89 31056 65.45271
HB6 0 9.055725 0 70244.55 32583 68.67097
CC2 0 12.325855 0 15440.61 27634 58.24060
CC3 0 10.298430 0 31488.80 28638 60.35660
CD2 0 7.947732 0 33024.83 23755 50.06533
CD4 0 14.597700 0 25407.69 30593 64.47690
HD6 0 8.348241 0 54994.67 32916 69.37279
CF8 0 9.960673 0 21527.44 25384 53.49857
HK4 0 7.927985 0 31649.91 29312 61.77710
HN1 0 10.310104 0 20220.15 33473 70.54670
HO1 0 6.849463 0 18247.30 27625 58.22163
HO2 0 8.290486 0 51366.77 28910 60.92986
CO4 0 7.759618 0 24269.80 28159 59.34707
CQ2 0 7.621172 0 29675.98 27665 58.30593
HQ3 0 14.075792 0 36182.53 32532 68.56348
CQ4 0 16.615571 0 17129.58 25525 53.79573
CQ5 0 8.778528 0 23485.83 25970 54.73360
HQ6 0 8.732106 0 23547.02 27196 57.31748



Lets do a rlogged transformation, which is useful various unsupervised clustering analyses. Be sure the include blind = TRUE as it doesn’t normalize the data with any priors from our experimental design.

rlogged = rlogTransformation(dds, blind = TRUE)



The order of treatments is really important here. Putting “hot” before “control” means that the gene expression of corals in the heated treatment will be compared to control, which is what we want. Make sure you put the control group second

In the output table, negative log2FoldChange indicates downregulation in heated compared to control, and positive log2FoldChagne indicates upregulation.

results_FvsC = results(dds, contrast = c("treatment", "hot", "control")) 
head(results_FvsC)
## log2 fold change (MLE): treatment hot vs control 
## Wald test p-value: treatment hot vs control 
## DataFrame with 6 rows and 6 columns
##                         baseMean     log2FoldChange             lfcSE
##                        <numeric>          <numeric>         <numeric>
## isogroup1      0.754706405965306  -0.60136668210655  1.30182527512437
## isogroup10     0.383450425721418   1.08565060450841  3.50412555536888
## isogroup1000    1.32897901722752  -1.42687692603426 0.969795073733801
## isogroup100001  42.4408554733082   1.54428393380751  0.26011961979613
## isogroup100004 0.483099523825571  0.331367581259926  2.74337032051983
## isogroup100011  3.36831075613279 -0.464355783531558 0.613749442645296
##                              stat              pvalue                 padj
##                         <numeric>           <numeric>            <numeric>
## isogroup1      -0.461941163378547   0.644123513886682                   NA
## isogroup10      0.309820692025439   0.756697315380076                   NA
## isogroup1000    -1.47131797704504   0.141205143857845                   NA
## isogroup100001   5.93682220133126 2.9059951355831e-09 1.20250078710429e-06
## isogroup100004  0.120788498286712   0.903858560436392                   NA
## isogroup100011  -0.75658852174294   0.449296420158435    0.764299804712495



Summary of DEGs with FDR < 0.1
We can see how many are up regulated and how many are down regulated (and what percent of the genome) 4.3% of genes (1901 genes) are upregulated and 7.2% of genes (3151 genes) are downregulated.

summary(results_FvsC)
## 
## out of 43725 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 363, 0.83%
## LFC < 0 (down)     : 367, 0.84%
## outliers [1]       : 4, 0.0091%
## low counts [2]     : 35445, 81%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results



How about those for a FDR of < 0.05? Since there are enough genes to explore here, we will use this for subsequent anaylses.

results_FvsC05 = results(dds, contrast = c("treatment", "hot", "control"), alpha = 0.05) 
summary(results_FvsC05)
## 
## out of 43725 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 230, 0.53%
## LFC < 0 (down)     : 255, 0.58%
## outliers [1]       : 4, 0.0091%
## low counts [2]     : 34601, 79%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
write.csv(results_FvsC05, file="tables/hot_results.csv")



MA plot because people like them, but besides looking neat I find them not overly informative

DESeq2::plotMA(results_FvsC, main = "hot vs Control")

Figure 11 | “Minus over average”, or MA-plot of hot vs control treatments. Red dots represent genes differentially expressed FDR < 0.1 whereas grey dots are not differentially expressed.



Heatmaps




Overall expression, sample by distance.

sampleDists <- as.matrix(dist(t(assay(rlogged))))
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
          col=colorpanel(100, "black", "white"),
          margin=c(10, 10))

Figure 12 | Sample by distance heatmap. This figure takes overall expression and clusters samples based on euclidean distances. We can see that samples cluster mostly by genotype with some exceptions.



Individual genes heatmap
I like plotting the z scores, as heatmap2 creates nice clean clusters by doing this. The heatmap shows that our samples are clustering by treatment (H=hot, C=control). Upregulation indicated by warmer colors, downregulation indicated by cooler colors.

norm_counts = read.csv("tables/hot_normalized_counts.csv")
hm = read.csv("tables/hot_results.csv") %>% as_tibble() %>%
  filter(padj < 0.05) %>% # only want the most DEGs
  dplyr::select(X) %>%
  merge(norm_counts, by.x = "X", by.y = "X")  # turn into a countdatafile
  row.names(hm) = hm$X
  hm$X = NULL

## Turning into z-score table
hm.z = data.matrix(hm)
hm.z = sweep(hm.z, 1L, rowMeans(hm.z), check.margin = FALSE)
hm.z.sx = apply(hm.z, 1L, sd)
hm.z = sweep(hm.z, 1L, hm.z.sx, "/", check.margin = FALSE)
hm.z = data.matrix(hm.z)

colour = colorRampPalette(rev(c("chocolate1","#FEE090","grey10", "cyan3","cyan")))(100)
heatmap.2(hm.z, col = colour, Rowv = TRUE, Colv = TRUE, scale = "row", 
          dendrogram = "both",
          trace = "none", 
          margin = c(5,15))


Figure 13 | Heatmap of z-scores of DEGs (FDR < 0.05). Upregulation indicated by warmer colors, downregulation indicated by cooler colors.



Principal Component Analyses


First we create a PCA data frame and calculate the variance estimated by PC1 and PC2

pcadata = DESeq2::plotPCA(rlogged, intgroup = c( "treatment", "genotype"), returnData = TRUE)
percentVar = round(100 * attr(pcadata, "percentVar"))
pca = prcomp(t(assay(rlogged)), center = TRUE, scale. = FALSE)



Using adonis from library(vegan) we can see if there are any significant differences based on treatment (as is illustrated in the model displayed below)

adonis(pca$x ~ genotype + treatment, method = 'eu')
## 
## Call:
## adonis(formula = pca$x ~ genotype + treatment, method = "eu") 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## genotype   8     60962  7620.2  1.7056 0.47454  0.001 ***
## treatment  1      9424  9424.1  2.1094 0.07336  0.001 ***
## Residuals 13     58080  4467.7         0.45211           
## Total     22    128466                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



cols = c("control" = "grey", "hot" = "#ea6227")
DESeq2::plotPCA(rlogged, returnData = TRUE, intgroup = c("treatment", "genotype") ) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = treatment),size = 3) +
      stat_ellipse(geom = "polygon", alpha = 1/10, aes(fill = treatment)) +
      scale_colour_manual(values = cols) +
      scale_fill_manual(values = cols) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      labs(title = "~ treatment, p<0.001")+
      theme_cowplot()

Figure 14 | Principal component analyses highlighting treatment groups. Treatment has a significant affect (p < 0.001) on overall expression



DESeq2::plotPCA(rlogged, returnData = TRUE, intgroup = c("treatment", "genotype") ) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = genotype)) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      labs(title = "~ genotype, p<0.001")+
      theme_cowplot()

Figure 15 | Principal component analyses highlighting genotypes of coral. Treatment has a significant affect (p < 0.001) on overall expression





Gene ontology




We are going to do a Mann-Whitney U test which requries that we first -logged the pvalue, and multiply it by -1 if it’s less than zero or by 1 if it’s greater than zero

results = read.csv("tables/hot_results.csv")
go_input = results %>%
  mutate(mutated_p = -log(pvalue)) %>%
  mutate(mutated_p_updown = ifelse(log2FoldChange < 0, mutated_p*-1, mutated_p*1)) %>%
  dplyr::select(X, mutated_p_updown) %>%
  na.omit()

colnames(go_input) = NULL

write.csv(go_input, "go_input.csv", row.names = FALSE)



These code are all adapted from Dr. Matz and can be found GO_MWU https://github.com/z0on/GO_MWU This will be broken down into three sections

Cellular Components

input="go_input_saved.csv" # two columns of comma-separated values: gene id, continuous measure of significance. To perform standard GO enrichment analysis based on Fisher's exact test, use binary measure (0 or 1, i.e., either sgnificant or not).
goAnnotations="wh_host_final_iso2go.tab" # two-column, tab-delimited, one line per gene, multiple GO terms separated by semicolon. If you have multiple lines per gene, use nrify_GOtable.pl prior to running this script.
goDatabase="go.obo" # download from http://www.geneontology.org/GO.downloads.ontology.shtml
goDivision="CC" # either MF, or BP, or CC
source("gomwu.functions.R")
gomwuStats(input, goDatabase, goAnnotations, goDivision,
    perlPath="perl", # replace with full path to perl executable if it is not in your system's PATH already
    largest=0.1,  # a GO category will not be considered if it contains more than this fraction of the total number of genes
    smallest=5,   # a GO category should contain at least this many genes to be considered
    clusterCutHeight=0.25, # threshold for merging similar (gene-sharing) terms. See README for details.
#   Alternative="g" # by default the MWU test is two-tailed; specify "g" or "l" of you want to test for "greater" or "less" instead. 
#   Module=TRUE,Alternative="g" # un-remark this if you are analyzing a SIGNED WGCNA module (values: 0 for not in module genes, kME for in-module genes). In the call to gomwuPlot below, specify absValue=0.001 (count number of "good genes" that fall into the module)
#   Module=TRUE # un-remark this if you are analyzing an UNSIGNED WGCNA module 
)
## Continuous measure of interest: will perform MWU test
## 93  GO terms at 10% FDR
# do not continue if the printout shows that no GO terms pass 10% FDR.
cc_results=gomwuPlot(input,goAnnotations,goDivision,
#   absValue=-log(0.05,10),  # genes with the measure value exceeding this will be counted as "good genes". Specify absValue=0.001 if you are doing Fisher's exact test for standard GO enrichment or analyzing a WGCNA module (all non-zero genes = "good genes").
    absValue=1,
    level1=0.01, # FDR threshold for plotting. Specify level1=1 to plot all GO categories containing genes exceeding the absValue.
    level2=0.001, # FDR cutoff to print in regular (not italic) font.
    level3=0.0001, # FDR cutoff to print in large bold font.
    txtsize=1.2,    # decrease to fit more on one page, or increase (after rescaling the plot so the tree fits the text) for better "word cloud" effect
    treeHeight=0.5, # height of the hierarchical clustering tree
  colors=c("black","#ea6227","grey40","#f09167") # these are default colors, un-remar and change if needed
)
## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## GO terms dispayed:  29 
## "Good genes" accounted for:  2597 out of 4188 ( 62% )
write.csv(cc_results, "cc.csv")

Figure 16 | Cellular component gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the hot treatment and purple colours are are GO terms enriched in upregulated genes in the hot treatment.



Molecular Functions

goDivision="MF" 
gomwuStats(input, goDatabase, goAnnotations, goDivision,
    perlPath="perl", 
    largest=0.1, 
    smallest=5,   
    clusterCutHeight=0.25, 
)
## Continuous measure of interest: will perform MWU test
## 91  GO terms at 10% FDR
mf_results=gomwuPlot(input,goAnnotations,goDivision,
    absValue=1,
    level1=0.01, 
    level2=0.001, 
    level3=0.0001, 
    txtsize=1.2,    
    treeHeight=0.5, 
    colors=c("black","#ea6227","grey40","#f09167") 
)
## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## GO terms dispayed:  33 
## "Good genes" accounted for:  1736 out of 4609 ( 38% )
write.csv(mf_results, "mf.csv")

Figure 17 | Molecular function gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the hot treatment and purple colours are are GO terms enriched in upregulated genes in the hot treatment.



Biological process

goDivision="BP"
gomwuStats(input, goDatabase, goAnnotations, goDivision,
    perlPath="perl", 
    largest=0.1,  
    smallest=5,   
    clusterCutHeight=0.4, 
)
## Continuous measure of interest: will perform MWU test
## 179  GO terms at 10% FDR
bp_results=gomwuPlot(input,goAnnotations,goDivision,
    absValue=1,
    level1=0.01, 
    level2=0.001, 
    level3=0.0001, 
    txtsize=1.2,    
    treeHeight=0.5, 
    colors=c("black","#ea6227","grey40","#f09167") 
)
## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## GO terms dispayed:  36 
## "Good genes" accounted for:  2174 out of 4856 ( 45% )
write.csv(bp_results, "bp.csv")

Figure 18 | Biological processes gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the hot treatment and purple colours are are GO terms enriched in upregulated genes in the hot treatment.




Version Control


Here is our session info, containing information about versions used

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.15
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] MASS_7.3-51.1               prettydoc_0.2.1            
##  [3] reshape2_1.4.3              kableExtra_1.1.0           
##  [5] vegan_2.5-4                 permute_0.9-4              
##  [7] plotly_4.9.0                knitr_1.21                 
##  [9] gplots_3.0.1.1              RColorBrewer_1.1-2         
## [11] readr_1.3.1                 cowplot_0.9.4              
## [13] DESeq_1.34.1                lattice_0.20-38            
## [15] locfit_1.5-9.1              genefilter_1.64.0          
## [17] arrayQualityMetrics_3.38.0  affycoretools_1.54.0       
## [19] ggplot2_3.1.0               DESeq2_1.22.2              
## [21] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
## [23] BiocParallel_1.16.5         matrixStats_0.54.0         
## [25] Biobase_2.42.0              GenomicRanges_1.34.0       
## [27] GenomeInfoDb_1.18.1         IRanges_2.16.0             
## [29] S4Vectors_0.20.1            BiocGenerics_0.28.0        
## [31] dplyr_0.8.0.1               plyr_1.8.4                 
## [33] tidyr_0.8.2                
## 
## loaded via a namespace (and not attached):
##   [1] R.utils_2.7.0             tidyselect_0.2.5         
##   [3] RSQLite_2.1.1             AnnotationDbi_1.44.0     
##   [5] htmlwidgets_1.3           beadarray_2.32.0         
##   [7] grid_3.5.1                munsell_0.5.0            
##   [9] codetools_0.2-16          preprocessCore_1.44.0    
##  [11] withr_2.1.2               colorspace_1.4-0         
##  [13] Category_2.48.0           OrganismDbi_1.24.0       
##  [15] highr_0.7                 rstudioapi_0.9.0         
##  [17] setRNG_2013.9-1           GenomeInfoDbData_1.2.0   
##  [19] hwriter_1.3.2             bit64_0.9-7              
##  [21] xfun_0.5.3                biovizBase_1.30.1        
##  [23] R6_2.4.0                  illuminaio_0.24.0        
##  [25] gridSVG_1.6-0             AnnotationFilter_1.6.0   
##  [27] bitops_1.0-6              reshape_0.8.8            
##  [29] assertthat_0.2.0          scales_1.0.0             
##  [31] nnet_7.3-12               gtable_0.2.0             
##  [33] Cairo_1.5-9               affy_1.60.0              
##  [35] ggbio_1.30.0              ensembldb_2.6.5          
##  [37] rlang_0.3.1               splines_3.5.1            
##  [39] rtracklayer_1.42.1        lazyeval_0.2.1           
##  [41] acepack_1.4.1             dichromat_2.0-0          
##  [43] checkmate_1.9.1           BiocManager_1.30.4       
##  [45] yaml_2.2.0                GenomicFeatures_1.34.3   
##  [47] backports_1.1.3           Hmisc_4.2-0              
##  [49] RBGL_1.58.1               tools_3.5.1              
##  [51] affyio_1.52.0             ff_2.2-14                
##  [53] Rcpp_1.0.1                base64enc_0.1-3          
##  [55] progress_1.2.0            zlibbioc_1.28.0          
##  [57] purrr_0.3.0               RCurl_1.95-4.11          
##  [59] prettyunits_1.0.2         rpart_4.1-13             
##  [61] openssl_1.2.1             cluster_2.0.7-1          
##  [63] magrittr_1.5              data.table_1.12.0        
##  [65] ProtGenerics_1.14.0       hms_0.4.2                
##  [67] evaluate_0.12             xtable_1.8-3             
##  [69] XML_3.98-1.16             gcrma_2.54.0             
##  [71] gridExtra_2.3             compiler_3.5.1           
##  [73] biomaRt_2.38.0            tibble_2.0.1             
##  [75] KernSmooth_2.23-15        crayon_1.3.4             
##  [77] ReportingTools_2.22.1     R.oo_1.22.0              
##  [79] htmltools_0.3.6           GOstats_2.48.0           
##  [81] mgcv_1.8-26               Formula_1.2-3            
##  [83] geneplotter_1.60.0        DBI_1.0.0                
##  [85] Matrix_1.2-15             vsn_3.50.0               
##  [87] R.methodsS3_1.7.1         gdata_2.18.0             
##  [89] pkgconfig_2.0.2           GenomicAlignments_1.18.1 
##  [91] foreign_0.8-71            xml2_1.2.0               
##  [93] foreach_1.4.4             annotate_1.60.0          
##  [95] BeadDataPackR_1.34.0      affyPLM_1.58.0           
##  [97] webshot_0.5.1             XVector_0.22.0           
##  [99] AnnotationForge_1.24.0    rvest_0.3.4              
## [101] stringr_1.4.0             VariantAnnotation_1.28.10
## [103] digest_0.6.18             graph_1.60.0             
## [105] Biostrings_2.50.2         rmarkdown_1.11           
## [107] base64_2.0                htmlTable_1.13.1         
## [109] edgeR_3.24.3              GSEABase_1.44.0          
## [111] curl_3.3                  Rsamtools_1.34.1         
## [113] gtools_3.8.1              nlme_3.1-137             
## [115] jsonlite_1.6              PFAM.db_3.7.0            
## [117] viridisLite_0.3.0         askpass_1.1              
## [119] limma_3.38.3              BSgenome_1.50.0          
## [121] pillar_1.3.1              GGally_1.4.0             
## [123] httr_1.4.0                survival_2.43-3          
## [125] GO.db_3.7.0               glue_1.3.0.9000          
## [127] iterators_1.0.10          bit_1.1-14               
## [129] Rgraphviz_2.26.0          stringi_1.2.4            
## [131] blob_1.1.1                oligoClasses_1.44.0      
## [133] latticeExtra_0.6-28       caTools_1.17.1.1         
## [135] memoise_1.1.0