Introduction

The gene expression dataset analyzed here comes from a project that examined the physiological and transcriptomic effects of hot water on the temperate coral Astrangia poculata. A. poculata is a facultatively symbiotic coral, meaning that it exists in the wild both with and without symbionts. However, only aposymbiotic corals were used in the experiment described here. Here, corals were placed in either a chilled treatment “hot” or a control condition. The hot group had temperatures increased to 31°C or maintained at 16°C (control) over the course of 15 days, and were sampled for gene expression on day 15. Physiology measured (polyp extension) over the course of the experiment demonstrated that the corals pulled in their polyps in response to warming temperatures, which is the same phenotype that corals in the related hot experiment exhibited. It is important to note that the control temperature used in this experiment is different from that used in the hot experiment, which is why we do not make direct comparisons between these studies. Here, we use the program DESeq2 to quantify differential gene expression of corals in the control and heated treatments.

Analysis pipeline

First we load appropriate libraries

library(tidyr)
library(plyr)
library(dplyr)
library(DESeq2)
library(ggplot2)
library(affycoretools)
library(arrayQualityMetrics)
library(genefilter)
library(DESeq)
library(cowplot)
library(readr)
library(RColorBrewer)
library(gplots)
library(knitr)
library(plotly)
library(vegan)
library(kableExtra)
library(reshape2)
library(prettydoc)
library(MASS)

Behavioural Response to food.

Coral polyp behaviours were observed every 3-4 days throughout the experiment. The total surface area of the coral that had extended vs retracted polyps was estimated and then scored between 1-5 based on polyp activity

Table 1 | Observational measurement of polyp activity. The percent of polyps that were extended were estimated and scored as below.

Score	Percent.of.colony.with.extended.polyps
1	0
2	25
3	50
4	75
5	100

Read in data, transform to long form and organize.

behaviour = read.csv("hot_behaviour.csv") %>% 
  melt() %>% 
  rename(day = variable) %>%
  rename(polyp_behaviour = value) %>%
  mutate(genotype = sapply(strsplit(as.character(individual), split = ""), "[[", 2))

behaviour$day = as.character(behaviour$day)
data = behaviour %>%
  mutate(temperature = recode(day, d1 = "16", d2 = "23", d3 = "25", d4 = "28", d5 = "31"))

In case someone is interested in the descriptive statistics, here we calculate means, standard error and standard deviation.

hot_sum = behaviour %>%
  group_by(treatment, day) %>%
  summarise( 
    n=n(),
    mean=mean(polyp_behaviour),
    sd=sd(polyp_behaviour)
  ) %>%
  mutate( se=sd/sqrt(n))  %>%
  mutate( ic=se * qt((1-0.05)/2 + .5, n-1))

kable(hot_sum) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

treatment	day	n	mean	sd	se	ic
control	d1	21	5.000000	0.0000000	0.0000000	0.0000000
control	d2	21	5.000000	0.0000000	0.0000000	0.0000000
control	d3	21	5.000000	0.0000000	0.0000000	0.0000000
control	d4	21	5.000000	0.0000000	0.0000000	0.0000000
control	d5	21	5.000000	0.0000000	0.0000000	0.0000000
hot	d1	21	5.000000	0.0000000	0.0000000	0.0000000
hot	d2	21	4.428571	0.8701396	0.1898800	0.3960828
hot	d3	21	4.476191	1.0304876	0.2248708	0.4690723
hot	d4	21	2.476190	1.2090925	0.2638456	0.5503723
hot	d5	21	2.714286	1.1892375	0.2595129	0.5413344

This is how we plotted the behavioural figure in the manuscript. We selected the same days tha

hot_stacked = behaviour %>%
  group_by(polyp_behaviour, day, treatment) %>%
  summarise(n = n()) %>%
  mutate(frequency = n / sum(n)) 

ggplot(hot_stacked, aes(y = frequency, x = day, fill = as.factor(polyp_behaviour))) + 
  geom_bar(stat = "identity", position = "fill") +
  facet_grid(. ~ treatment) + 
  scale_fill_brewer(palette = "Reds", direction=1) +
  labs(fill = "Behavioural Score") +
  theme_cowplot()

Figure 1 | Behavioural scores after feeding. The total surface area of the coral with extended polyps was estimated and then scored between 1-5 based on polyp activity (see Table 1)

model = polr(as.factor(polyp_behaviour) ~ treatment + day, data = behaviour)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model)

## 
## Re-fitting to get Hessian

## Call:
## polr(formula = as.factor(polyp_behaviour) ~ treatment + day, 
##     data = behaviour)
## 
## Coefficients:
##               Value Std. Error t value
## treatmenthot -19.22     0.1819 -105.69
## dayd2        -16.93     0.4107  -41.21
## dayd3        -16.43     0.4671  -35.17
## dayd4        -19.93     0.3726  -53.48
## dayd5        -19.59     0.3611  -54.26
## 
## Intercepts:
##     Value     Std. Error t value  
## 1|2  -40.3389    0.1819  -221.8048
## 2|3  -38.8912    0.2566  -151.5488
## 3|4  -37.8752    0.2909  -130.2071
## 4|5  -36.6582    0.3596  -101.9315
## 
## Residual Deviance: 209.8631 
## AIC: 227.8631

table = coef(summary(model))

## 
## Re-fitting to get Hessian

table

##                  Value Std. Error    t value
## treatmenthot -19.22189  0.1818671 -105.69199
## dayd2        -16.92510  0.4106563  -41.21475
## dayd3        -16.42655  0.4670713  -35.16926
## dayd4        -19.92747  0.3726301  -53.47788
## dayd5        -19.59392  0.3611129  -54.25981
## 1|2          -40.33895  0.1818669 -221.80477
## 2|3          -38.89117  0.2566247 -151.54883
## 3|4          -37.87523  0.2908845 -130.20711
## 4|5          -36.65824  0.3596361 -101.93148

Calculate and store the p-value

p = pnorm(abs(table[,"t value"]), lower.tail = FALSE) * 2
p

##  treatmenthot         dayd2         dayd3         dayd4         dayd5 
##  0.000000e+00  0.000000e+00 5.902959e-271  0.000000e+00  0.000000e+00 
##           1|2           2|3           3|4           4|5 
##  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00

Differential Expression

The general pipeline that was used to generate these count files are well described by the Matz lab out of UT Austin. Check out their github page for details: https://github.com/z0on/tag-based_RNAseq

counts = read.csv("~/Documents/Experiments/freezing_hot/hot_raw.csv", row.names = 1)

Set up experimental design matrix. You’ll notice that I strip the header of each column to inform me what treatment it belongs to.

treatment = as.factor(sapply(strsplit(colnames(counts), split = ""), "[[", 1)) %>%
            revalue(c("C" = "control", "H" = "hot"))
genotype  = as.factor(sapply(strsplit(colnames(counts), split = ""), "[[", 2))

expDesign = data.frame(treatment, genotype)
            expDesign$sample = colnames(counts)
            write.csv(expDesign, "tables/hot_expDesign.csv", row.names = F)

Table 2 | Experimental Design key with sample names

treatment	genotype	sample
hot	A	HA1
control	A	CA2
hot	B	HB2
control	B	CB3
control	B	CB4
hot	B	HB5
hot	B	HB6
control	C	CC2
control	C	CC3
control	D	CD2
control	D	CD4
hot	D	HD6
control	F	CF8
hot	K	HK4
hot	N	HN1
hot	O	HO1
hot	O	HO2
control	O	CO4
control	Q	CQ2
hot	Q	HQ3
control	Q	CQ4
control	Q	CQ5
hot	Q	HQ6

Table 3 | Descriptive Summary stats of the mapped reads.

	Mean	X3rd.Qu.	Max.	libsum	perc90	perc95	zeros	percent.zeros
HA1	9.574418	2	40458	454287	8	19	27412	57.77272
CA2	13.053617	2	39814	619368	10	24	26086	54.97808
HB2	13.399869	2	29289	635797	9	24	27379	57.70317
CB3	7.600447	1	25784	360626	6	14	29587	62.35669
CB4	7.135727	1	22337	338576	6	13	29648	62.48525
HB5	5.802626	1	22823	275323	5	12	31056	65.45271
HB6	5.006238	1	38833	237536	4	10	32583	68.67097
CC2	17.092670	2	21412	811013	8	19	27634	58.24060
CC3	9.950514	2	30425	472132	6	14	28638	60.35660
CD2	14.828465	3	61616	703581	13	31	23755	50.06533
CD4	12.294554	1	21399	583352	5	13	30593	64.47690
HD6	4.055513	1	26716	192426	4	9	32916	69.37279
CF8	15.811267	2	34172	750213	9	23	25384	53.49857
HK4	7.253688	2	28958	344173	6	16	29312	61.77710
HN1	6.315018	1	12385	299635	4	10	33473	70.54670
HO1	7.296788	2	19439	346218	8	18	27625	58.22163
HO2	7.522593	2	46609	356932	6	16	28910	60.92986
CO4	7.327095	2	22917	347656	7	17	28159	59.34707
CQ2	7.587022	2	29543	359989	7	17	27665	58.30593
HQ3	10.137519	1	26059	481005	4	11	32532	68.56348
CQ4	28.439218	3	29319	1349384	10	26	25525	53.79573
CQ5	12.405750	2	33190	588628	10	25	25970	54.73360
HQ6	12.559138	2	33867	595906	9	23	27196	57.31748

Outlier Detection

Conduct array quality metrics to identify outliers

real=newCountDataSet(counts,expDesign) 
real=estimateSizeFactors(real)

plot(sort(sizeFactors(real)))

Figure 2 | Size factors of each sample used to normalize between libraries.

Create a directory with a bunch of output figures from arrayQualityMetrics() for you to explore if their are outliers.

cds=estimateDispersions(real,method="blind")
vsdBlind=varianceStabilizingTransformation(cds)
arrayQualityMetrics(vsdBlind,intgroup=c("treatment"), force=TRUE, outdir = "arrayQualityMetrics") # this makes a directory "arrayQualityMetrics" and outputs various reports on outliers

Figure 3 | shows a false color heatmap of the distances between arrays. The color scale is chosen to cover the range of distances encountered in the dataset. Patterns in this plot can indicate clustering of the arrays either because of intended biological or unintended experimental factors (batch effects). The distance dab between two arrays a and b is computed as the mean absolute difference (L1-distance) between the data of the arrays (using the data from all probes without filtering). In formula, dab = mean | Mai - Mbi |, where Mai is the value of the i-th probe on the a-th array. Outlier detection was performed by looking for arrays for which the sum of the distances to all other arrays, Sa = Σb dab was exceptionally large. One such array was detected, and it is marked by an asterisk, *.

Figure 4 | shows a bar chart of the sum of distances to other arrays Sa, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. Based on the distribution of the values across all arrays, a threshold of 17.5 was determined, which is indicated by the vertical line. One array exceeded the threshold and was considered an outlier.

Figure 5 | shows boxplots representing summaries of the signal intensity distributions of the arrays. Each box corresponds to one array. Typically, one expects the boxes to have similar positions and widths. If the distribution of an array is very different from the others, this may indicate an experimental problem. Outlier detection was performed by computing the Kolmogorov-Smirnov statistic Ka between each array’s distribution and the distribution of the pooled data.

Figure 6 | shows a bar chart of the Kolmogorov-Smirnov statistic Ka, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. Based on the distribution of the values across all arrays, a threshold of 0.603 was determined, which is indicated by the vertical line. None of the arrays exceeded the threshold and was considered an outlier.

Figure 7 | shows density estimates (smoothed histograms) of the data. Typically, the distributions of the arrays should have similar shapes and ranges. Arrays whose distributions are very different from the others should be considered for possible problems. Various features of the distributions can be indicative of quality related phenomena. For instance, high levels of background will shift an array’s distribution to the right. Lack of signal diminishes its right right tail. A bulge at the upper end of the intensity range often indicates signal saturation.

Figure 8 | shows a density plot of the standard deviation of the intensities across arrays on the y-axis versus the rank of their mean on the x-axis. The red dots, connected by lines, show the running median of the standard deviation. After normalisation and transformation to a logarithm(-like) scale, one typically expects the red line to be approximately horizontal, that is, show no substantial trend. In some cases, a hump on the right hand of the x-axis can be observed and is symptomatic of a saturation of the intensities.

Figure 9 | shows MA plots. M and A are defined as: M = log2(I1) - log2(I2) A = 1/2 (log2(I1)+log2(I2)), where I1 is the intensity of the array studied,and I2 is the intensity of a “pseudo”-array that consists of the median across arrays. Typically, we expect the mass of the distribution in an MA plot to be concentrated along the M = 0 axis, and there should be no trend in M as a function of A. If there is a trend in the lower range of A, this often indicates that the arrays have different background intensities; this may be addressed by background correction. A trend in the upper range of A can indicate saturation of the measurements; in mild cases, this may be addressed by non-linear normalisation (e.g. quantile normalisation). Outlier detection was performed by computing Hoeffding’s statistic Da on the joint distribution of A and M for each array. Shown are first the 4 arrays with the highest values of Da, then the 4 arrays with the lowest values. The value of Da is shown in the panel headings. 0 arrays had Da>0.15 and were marked as outliers. For more information on Hoeffing’s D-statistic, please see the manual page of the function hoeffd in the Hmisc package.

Figure 10 | shows a bar chart of the Da, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. A threshold of 0.15 was used, which is indicated by the vertical line. None of the arrays exceeded the threshold and was considered an outlier.

Outlier Conclusisons

Much like in the other experiment (hot), we see that a single sample (CP1 in this experiment) is identified as an outlier based on the distances between arrays which was much larger. This sample however was not flagged as being an outlier in any other analyses and so was therefore I considered it unworthy of removing.

Differential Expression

Perform a Wald test in DESeq2 (now the new version) doing a pairwise comparison between hot and control (aka the effect of treatment, denoted as design = ~ treatment below)

dds = DESeqDataSetFromMatrix(countData = counts, colData = expDesign, design = ~ genotype + treatment)
dds = DESeq(dds)
results = results(dds)

Let’s check to see if we set up our contrast correctly. We should have the treatment condition first and the control second in the log2 fold change (MLE) output.

head(results)

## log2 fold change (MLE): treatment hot vs control 
## Wald test p-value: treatment hot vs control 
## DataFrame with 6 rows and 6 columns
##                         baseMean     log2FoldChange             lfcSE
##                        <numeric>          <numeric>         <numeric>
## isogroup1      0.754706405965306  -0.60136668210655  1.30182527512437
## isogroup10     0.383450425721418   1.08565060450841  3.50412555536888
## isogroup1000    1.32897901722752  -1.42687692603426 0.969795073733801
## isogroup100001  42.4408554733082   1.54428393380751  0.26011961979613
## isogroup100004 0.483099523825571  0.331367581259926  2.74337032051983
## isogroup100011  3.36831075613279 -0.464355783531558 0.613749442645296
##                              stat              pvalue                 padj
##                         <numeric>           <numeric>            <numeric>
## isogroup1      -0.461941163378547   0.644123513886682                   NA
## isogroup10      0.309820692025439   0.756697315380076                   NA
## isogroup1000    -1.47131797704504   0.141205143857845                   NA
## isogroup100001   5.93682220133126 2.9059951355831e-09 1.20250078710429e-06
## isogroup100004  0.120788498286712   0.903858560436392                   NA
## isogroup100011  -0.75658852174294   0.449296420158435    0.764299804712495

Looks good!

Table 4 | Descriptive statistics on the normalized read counts

	mean	max	zeros	percent.zeros
HA1	8.230796	34780.34	27412	57.77272
CA2	8.758342	26713.26	26086	54.97808
HB2	8.985652	19640.55	27379	57.70317
CB3	8.875032	30107.94	29587	62.35669
CB4	8.610759	26954.30	29648	62.48525
HB5	8.241059	32413.89	31056	65.45271
HB6	9.055725	70244.55	32583	68.67097
CC2	12.325855	15440.61	27634	58.24060
CC3	10.298430	31488.80	28638	60.35660
CD2	7.947732	33024.83	23755	50.06533
CD4	14.597700	25407.69	30593	64.47690
HD6	8.348241	54994.67	32916	69.37279
CF8	9.960673	21527.44	25384	53.49857
HK4	7.927985	31649.91	29312	61.77710
HN1	10.310104	20220.15	33473	70.54670
HO1	6.849463	18247.30	27625	58.22163
HO2	8.290486	51366.77	28910	60.92986
CO4	7.759618	24269.80	28159	59.34707
CQ2	7.621172	29675.98	27665	58.30593
HQ3	14.075792	36182.53	32532	68.56348
CQ4	16.615571	17129.58	25525	53.79573
CQ5	8.778528	23485.83	25970	54.73360
HQ6	8.732106	23547.02	27196	57.31748

Lets do a rlogged transformation, which is useful various unsupervised clustering analyses. Be sure the include blind = TRUE as it doesn’t normalize the data with any priors from our experimental design.

rlogged = rlogTransformation(dds, blind = TRUE)

The order of treatments is really important here. Putting “hot” before “control” means that the gene expression of corals in the heated treatment will be compared to control, which is what we want. Make sure you put the control group second

In the output table, negative log2FoldChange indicates downregulation in heated compared to control, and positive log2FoldChagne indicates upregulation.

results_FvsC = results(dds, contrast = c("treatment", "hot", "control")) 
head(results_FvsC)

## log2 fold change (MLE): treatment hot vs control 
## Wald test p-value: treatment hot vs control 
## DataFrame with 6 rows and 6 columns
##                         baseMean     log2FoldChange             lfcSE
##                        <numeric>          <numeric>         <numeric>
## isogroup1      0.754706405965306  -0.60136668210655  1.30182527512437
## isogroup10     0.383450425721418   1.08565060450841  3.50412555536888
## isogroup1000    1.32897901722752  -1.42687692603426 0.969795073733801
## isogroup100001  42.4408554733082   1.54428393380751  0.26011961979613
## isogroup100004 0.483099523825571  0.331367581259926  2.74337032051983
## isogroup100011  3.36831075613279 -0.464355783531558 0.613749442645296
##                              stat              pvalue                 padj
##                         <numeric>           <numeric>            <numeric>
## isogroup1      -0.461941163378547   0.644123513886682                   NA
## isogroup10      0.309820692025439   0.756697315380076                   NA
## isogroup1000    -1.47131797704504   0.141205143857845                   NA
## isogroup100001   5.93682220133126 2.9059951355831e-09 1.20250078710429e-06
## isogroup100004  0.120788498286712   0.903858560436392                   NA
## isogroup100011  -0.75658852174294   0.449296420158435    0.764299804712495

Summary of DEGs with FDR < 0.1
We can see how many are up regulated and how many are down regulated (and what percent of the genome) 4.3% of genes (1901 genes) are upregulated and 7.2% of genes (3151 genes) are downregulated.

summary(results_FvsC)

## 
## out of 43725 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 363, 0.83%
## LFC < 0 (down)     : 367, 0.84%
## outliers [1]       : 4, 0.0091%
## low counts [2]     : 35445, 81%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

How about those for a FDR of < 0.05? Since there are enough genes to explore here, we will use this for subsequent anaylses.

results_FvsC05 = results(dds, contrast = c("treatment", "hot", "control"), alpha = 0.05) 
summary(results_FvsC05)

## 
## out of 43725 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 230, 0.53%
## LFC < 0 (down)     : 255, 0.58%
## outliers [1]       : 4, 0.0091%
## low counts [2]     : 34601, 79%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

write.csv(results_FvsC05, file="tables/hot_results.csv")

MA plot because people like them, but besides looking neat I find them not overly informative

DESeq2::plotMA(results_FvsC, main = "hot vs Control")

Figure 11 | “Minus over average”, or MA-plot of hot vs control treatments. Red dots represent genes differentially expressed FDR < 0.1 whereas grey dots are not differentially expressed.

Heatmaps

Overall expression, sample by distance.

sampleDists <- as.matrix(dist(t(assay(rlogged))))
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
          col=colorpanel(100, "black", "white"),
          margin=c(10, 10))

Figure 12 | Sample by distance heatmap. This figure takes overall expression and clusters samples based on euclidean distances. We can see that samples cluster mostly by genotype with some exceptions.

Individual genes heatmap
I like plotting the z scores, as heatmap2 creates nice clean clusters by doing this. The heatmap shows that our samples are clustering by treatment (H=hot, C=control). Upregulation indicated by warmer colors, downregulation indicated by cooler colors.

norm_counts = read.csv("tables/hot_normalized_counts.csv")
hm = read.csv("tables/hot_results.csv") %>% as_tibble() %>%
  filter(padj < 0.05) %>% # only want the most DEGs
  dplyr::select(X) %>%
  merge(norm_counts, by.x = "X", by.y = "X")  # turn into a countdatafile
  row.names(hm) = hm$X
  hm$X = NULL

## Turning into z-score table
hm.z = data.matrix(hm)
hm.z = sweep(hm.z, 1L, rowMeans(hm.z), check.margin = FALSE)
hm.z.sx = apply(hm.z, 1L, sd)
hm.z = sweep(hm.z, 1L, hm.z.sx, "/", check.margin = FALSE)
hm.z = data.matrix(hm.z)

colour = colorRampPalette(rev(c("chocolate1","#FEE090","grey10", "cyan3","cyan")))(100)
heatmap.2(hm.z, col = colour, Rowv = TRUE, Colv = TRUE, scale = "row", 
          dendrogram = "both",
          trace = "none", 
          margin = c(5,15))

Figure 13 | Heatmap of z-scores of DEGs (FDR < 0.05). Upregulation indicated by warmer colors, downregulation indicated by cooler colors.

Principal Component Analyses

First we create a PCA data frame and calculate the variance estimated by PC1 and PC2

pcadata = DESeq2::plotPCA(rlogged, intgroup = c( "treatment", "genotype"), returnData = TRUE)
percentVar = round(100 * attr(pcadata, "percentVar"))
pca = prcomp(t(assay(rlogged)), center = TRUE, scale. = FALSE)

Using adonis from library(vegan) we can see if there are any significant differences based on treatment (as is illustrated in the model displayed below)

adonis(pca$x ~ genotype + treatment, method = 'eu')

## 
## Call:
## adonis(formula = pca$x ~ genotype + treatment, method = "eu") 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## genotype   8     60962  7620.2  1.7056 0.47454  0.001 ***
## treatment  1      9424  9424.1  2.1094 0.07336  0.001 ***
## Residuals 13     58080  4467.7         0.45211           
## Total     22    128466                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cols = c("control" = "grey", "hot" = "#ea6227")
DESeq2::plotPCA(rlogged, returnData = TRUE, intgroup = c("treatment", "genotype") ) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = treatment),size = 3) +
      stat_ellipse(geom = "polygon", alpha = 1/10, aes(fill = treatment)) +
      scale_colour_manual(values = cols) +
      scale_fill_manual(values = cols) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      labs(title = "~ treatment, p<0.001")+
      theme_cowplot()

Figure 14 | Principal component analyses highlighting treatment groups. Treatment has a significant affect (p < 0.001) on overall expression

DESeq2::plotPCA(rlogged, returnData = TRUE, intgroup = c("treatment", "genotype") ) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = genotype)) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      labs(title = "~ genotype, p<0.001")+
      theme_cowplot()

Figure 15 | Principal component analyses highlighting genotypes of coral. Treatment has a significant affect (p < 0.001) on overall expression

Gene ontology

We are going to do a Mann-Whitney U test which requries that we first -logged the pvalue, and multiply it by -1 if it’s less than zero or by 1 if it’s greater than zero

results = read.csv("tables/hot_results.csv")
go_input = results %>%
  mutate(mutated_p = -log(pvalue)) %>%
  mutate(mutated_p_updown = ifelse(log2FoldChange < 0, mutated_p*-1, mutated_p*1)) %>%
  dplyr::select(X, mutated_p_updown) %>%
  na.omit()

colnames(go_input) = NULL

write.csv(go_input, "go_input.csv", row.names = FALSE)

These code are all adapted from Dr. Matz and can be found GO_MWU https://github.com/z0on/GO_MWU This will be broken down into three sections

Cellular Components

input="go_input_saved.csv" # two columns of comma-separated values: gene id, continuous measure of significance. To perform standard GO enrichment analysis based on Fisher's exact test, use binary measure (0 or 1, i.e., either sgnificant or not).
goAnnotations="wh_host_final_iso2go.tab" # two-column, tab-delimited, one line per gene, multiple GO terms separated by semicolon. If you have multiple lines per gene, use nrify_GOtable.pl prior to running this script.
goDatabase="go.obo" # download from http://www.geneontology.org/GO.downloads.ontology.shtml
goDivision="CC" # either MF, or BP, or CC
source("gomwu.functions.R")

gomwuStats(input, goDatabase, goAnnotations, goDivision,
    perlPath="perl", # replace with full path to perl executable if it is not in your system's PATH already
    largest=0.1,  # a GO category will not be considered if it contains more than this fraction of the total number of genes
    smallest=5,   # a GO category should contain at least this many genes to be considered
    clusterCutHeight=0.25, # threshold for merging similar (gene-sharing) terms. See README for details.
#   Alternative="g" # by default the MWU test is two-tailed; specify "g" or "l" of you want to test for "greater" or "less" instead. 
#   Module=TRUE,Alternative="g" # un-remark this if you are analyzing a SIGNED WGCNA module (values: 0 for not in module genes, kME for in-module genes). In the call to gomwuPlot below, specify absValue=0.001 (count number of "good genes" that fall into the module)
#   Module=TRUE # un-remark this if you are analyzing an UNSIGNED WGCNA module 
)

## Continuous measure of interest: will perform MWU test
## 93  GO terms at 10% FDR

# do not continue if the printout shows that no GO terms pass 10% FDR.

cc_results=gomwuPlot(input,goAnnotations,goDivision,
#   absValue=-log(0.05,10),  # genes with the measure value exceeding this will be counted as "good genes". Specify absValue=0.001 if you are doing Fisher's exact test for standard GO enrichment or analyzing a WGCNA module (all non-zero genes = "good genes").
    absValue=1,
    level1=0.01, # FDR threshold for plotting. Specify level1=1 to plot all GO categories containing genes exceeding the absValue.
    level2=0.001, # FDR cutoff to print in regular (not italic) font.
    level3=0.0001, # FDR cutoff to print in large bold font.
    txtsize=1.2,    # decrease to fit more on one page, or increase (after rescaling the plot so the tree fits the text) for better "word cloud" effect
    treeHeight=0.5, # height of the hierarchical clustering tree
  colors=c("black","#ea6227","grey40","#f09167") # these are default colors, un-remar and change if needed
)

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## GO terms dispayed:  29 
## "Good genes" accounted for:  2597 out of 4188 ( 62% )

write.csv(cc_results, "cc.csv")

Figure 16 | Cellular component gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the hot treatment and purple colours are are GO terms enriched in upregulated genes in the hot treatment.

Molecular Functions

goDivision="MF"

gomwuStats(input, goDatabase, goAnnotations, goDivision,
    perlPath="perl", 
    largest=0.1, 
    smallest=5,   
    clusterCutHeight=0.25, 
)

## Continuous measure of interest: will perform MWU test
## 91  GO terms at 10% FDR

mf_results=gomwuPlot(input,goAnnotations,goDivision,
    absValue=1,
    level1=0.01, 
    level2=0.001, 
    level3=0.0001, 
    txtsize=1.2,    
    treeHeight=0.5, 
    colors=c("black","#ea6227","grey40","#f09167") 
)

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## GO terms dispayed:  33 
## "Good genes" accounted for:  1736 out of 4609 ( 38% )

write.csv(mf_results, "mf.csv")

Figure 17 | Molecular function gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the hot treatment and purple colours are are GO terms enriched in upregulated genes in the hot treatment.

Biological process

goDivision="BP"

gomwuStats(input, goDatabase, goAnnotations, goDivision,
    perlPath="perl", 
    largest=0.1,  
    smallest=5,   
    clusterCutHeight=0.4, 
)

## Continuous measure of interest: will perform MWU test
## 179  GO terms at 10% FDR

bp_results=gomwuPlot(input,goAnnotations,goDivision,
    absValue=1,
    level1=0.01, 
    level2=0.001, 
    level3=0.0001, 
    txtsize=1.2,    
    treeHeight=0.5, 
    colors=c("black","#ea6227","grey40","#f09167") 
)

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## GO terms dispayed:  36 
## "Good genes" accounted for:  2174 out of 4856 ( 45% )

write.csv(bp_results, "bp.csv")

Figure 18 | Biological processes gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the hot treatment and purple colours are are GO terms enriched in upregulated genes in the hot treatment.

Version Control

Here is our session info, containing information about versions used

sessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.15
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] MASS_7.3-51.1               prettydoc_0.2.1            
##  [3] reshape2_1.4.3              kableExtra_1.1.0           
##  [5] vegan_2.5-4                 permute_0.9-4              
##  [7] plotly_4.9.0                knitr_1.21                 
##  [9] gplots_3.0.1.1              RColorBrewer_1.1-2         
## [11] readr_1.3.1                 cowplot_0.9.4              
## [13] DESeq_1.34.1                lattice_0.20-38            
## [15] locfit_1.5-9.1              genefilter_1.64.0          
## [17] arrayQualityMetrics_3.38.0  affycoretools_1.54.0       
## [19] ggplot2_3.1.0               DESeq2_1.22.2              
## [21] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
## [23] BiocParallel_1.16.5         matrixStats_0.54.0         
## [25] Biobase_2.42.0              GenomicRanges_1.34.0       
## [27] GenomeInfoDb_1.18.1         IRanges_2.16.0             
## [29] S4Vectors_0.20.1            BiocGenerics_0.28.0        
## [31] dplyr_0.8.0.1               plyr_1.8.4                 
## [33] tidyr_0.8.2                
## 
## loaded via a namespace (and not attached):
##   [1] R.utils_2.7.0             tidyselect_0.2.5         
##   [3] RSQLite_2.1.1             AnnotationDbi_1.44.0     
##   [5] htmlwidgets_1.3           beadarray_2.32.0         
##   [7] grid_3.5.1                munsell_0.5.0            
##   [9] codetools_0.2-16          preprocessCore_1.44.0    
##  [11] withr_2.1.2               colorspace_1.4-0         
##  [13] Category_2.48.0           OrganismDbi_1.24.0       
##  [15] highr_0.7                 rstudioapi_0.9.0         
##  [17] setRNG_2013.9-1           GenomeInfoDbData_1.2.0   
##  [19] hwriter_1.3.2             bit64_0.9-7              
##  [21] xfun_0.5.3                biovizBase_1.30.1        
##  [23] R6_2.4.0                  illuminaio_0.24.0        
##  [25] gridSVG_1.6-0             AnnotationFilter_1.6.0   
##  [27] bitops_1.0-6              reshape_0.8.8            
##  [29] assertthat_0.2.0          scales_1.0.0             
##  [31] nnet_7.3-12               gtable_0.2.0             
##  [33] Cairo_1.5-9               affy_1.60.0              
##  [35] ggbio_1.30.0              ensembldb_2.6.5          
##  [37] rlang_0.3.1               splines_3.5.1            
##  [39] rtracklayer_1.42.1        lazyeval_0.2.1           
##  [41] acepack_1.4.1             dichromat_2.0-0          
##  [43] checkmate_1.9.1           BiocManager_1.30.4       
##  [45] yaml_2.2.0                GenomicFeatures_1.34.3   
##  [47] backports_1.1.3           Hmisc_4.2-0              
##  [49] RBGL_1.58.1               tools_3.5.1              
##  [51] affyio_1.52.0             ff_2.2-14                
##  [53] Rcpp_1.0.1                base64enc_0.1-3          
##  [55] progress_1.2.0            zlibbioc_1.28.0          
##  [57] purrr_0.3.0               RCurl_1.95-4.11          
##  [59] prettyunits_1.0.2         rpart_4.1-13             
##  [61] openssl_1.2.1             cluster_2.0.7-1          
##  [63] magrittr_1.5              data.table_1.12.0        
##  [65] ProtGenerics_1.14.0       hms_0.4.2                
##  [67] evaluate_0.12             xtable_1.8-3             
##  [69] XML_3.98-1.16             gcrma_2.54.0             
##  [71] gridExtra_2.3             compiler_3.5.1           
##  [73] biomaRt_2.38.0            tibble_2.0.1             
##  [75] KernSmooth_2.23-15        crayon_1.3.4             
##  [77] ReportingTools_2.22.1     R.oo_1.22.0              
##  [79] htmltools_0.3.6           GOstats_2.48.0           
##  [81] mgcv_1.8-26               Formula_1.2-3            
##  [83] geneplotter_1.60.0        DBI_1.0.0                
##  [85] Matrix_1.2-15             vsn_3.50.0               
##  [87] R.methodsS3_1.7.1         gdata_2.18.0             
##  [89] pkgconfig_2.0.2           GenomicAlignments_1.18.1 
##  [91] foreign_0.8-71            xml2_1.2.0               
##  [93] foreach_1.4.4             annotate_1.60.0          
##  [95] BeadDataPackR_1.34.0      affyPLM_1.58.0           
##  [97] webshot_0.5.1             XVector_0.22.0           
##  [99] AnnotationForge_1.24.0    rvest_0.3.4              
## [101] stringr_1.4.0             VariantAnnotation_1.28.10
## [103] digest_0.6.18             graph_1.60.0             
## [105] Biostrings_2.50.2         rmarkdown_1.11           
## [107] base64_2.0                htmlTable_1.13.1         
## [109] edgeR_3.24.3              GSEABase_1.44.0          
## [111] curl_3.3                  Rsamtools_1.34.1         
## [113] gtools_3.8.1              nlme_3.1-137             
## [115] jsonlite_1.6              PFAM.db_3.7.0            
## [117] viridisLite_0.3.0         askpass_1.1              
## [119] limma_3.38.3              BSgenome_1.50.0          
## [121] pillar_1.3.1              GGally_1.4.0             
## [123] httr_1.4.0                survival_2.43-3          
## [125] GO.db_3.7.0               glue_1.3.0.9000          
## [127] iterators_1.0.10          bit_1.1-14               
## [129] Rgraphviz_2.26.0          stringi_1.2.4            
## [131] blob_1.1.1                oligoClasses_1.44.0      
## [133] latticeExtra_0.6-28       caTools_1.17.1.1         
## [135] memoise_1.1.0