Introduction

The gene expression dataset analyzed here comes from a project that examined the physiological and transcriptomic effects of cold water on the temperate coral Astrangia poculata. A. poculata is a facultatively symbiotic coral, meaning that it exists in the wild both with and without symbionts. However, only aposymbiotic corals were used in the experiment described here. Here, corals were placed in either a chilled treatment “cold” or a control condition. The cold group had temperatures decreased to 6°C or maintained at 22°C (control) over the course of 15 days, and were sampled for gene expression on day 15. Physiology measured (polyp extension) over the course of the experiment demonstrated that the corals pulled in their polyps in response to warming temperatures, which is the same phenotype that corals in the related cold experiment exhibited. It is important to note that the control temperature used in this experiment is different from that used in the cold experiment, which is why we do not make direct comparisons between these studies. Here, we use the program DESeq2 to quantify differential gene expression of corals in the control and heated treatments.

Analysis pipeline

First we load appropriate libraries

library(tidyr)
library(plyr)
library(dplyr)
library(DESeq2)
library(ggplot2)
library(affycoretools)
library(arrayQualityMetrics)
library(genefilter)
library(DESeq)
library(cowplot)
library(readr)
library(RColorBrewer)
library(gplots)
library(knitr)
library(plotly)
library(vegan)
library(kableExtra)
library(reshape2)
library(prettydoc)

Behavioural Response to food.

Coral polyp behaviours were observed every 3-4 days throughout the experiment. The total surface area of the coral that had extended vs retracted polyps was estimated and then scored between 1-5 based on polyp activity

Table 1 | Observational measurement of polyp activity. The percent of polyps that were extended were estimated and scored as below.

Score	Percent.of.colony.with.extended.polyps
1	0
2	25
3	50
4	75
5	100

Read in data, transform to long form and organize.

behaviour = read.csv("Freeze_behaviour.csv") %>% 
  melt() %>% 
  rename(day = variable) %>%
  rename(polyp_behaviour = value) %>%
  rename(treatment = group)

In case someone is interested in the descriptive statistics, here we calculate means, standard error and standard deviation.

cold_sum = behaviour %>%
  group_by(treatment, day) %>%
  summarise( 
    n=n(),
    mean=mean(polyp_behaviour),
    sd=sd(polyp_behaviour)
  ) %>%
  mutate( se=sd/sqrt(n))  %>%
  mutate( ic=se * qt((1-0.05)/2 + .5, n-1))

kable(cold_sum) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, fixed_thead = TRUE)

treatment	day	n	mean	sd	se	ic
control	d1	21	5.000000	0.0000000	0.0000000	0.0000000
control	d2	21	4.666667	0.6582806	0.1436486	0.2996457
control	d3	21	3.619048	1.1608700	0.2533226	0.5284217
control	d4	21	4.428571	0.9783368	0.2134906	0.4453336
control	d5	21	2.857143	1.3522468	0.2950844	0.6155354
control	d6	21	2.285714	1.1892375	0.2595129	0.5413344
control	d7	21	3.380952	1.3592715	0.2966174	0.6187330
control	d8	21	3.619048	1.0235326	0.2233531	0.4659065
control	d9	21	3.285714	1.3469542	0.2939295	0.6131262
control	d10	21	4.428571	0.8701396	0.1898800	0.3960828
control	d11	21	4.095238	0.9436505	0.2059214	0.4295445
control	d12	21	4.190476	0.9283883	0.2025909	0.4225973
control	d13	21	2.380952	1.4309504	0.3122590	0.6513608
control	d14	21	4.380952	0.7400129	0.1614840	0.3368498
control	d15	21	4.380952	0.9734573	0.2124258	0.4431124
freeze	d1	21	4.857143	0.4780914	0.1043281	0.2176246
freeze	d2	21	4.857143	0.3585686	0.0782461	0.1632185
freeze	d3	21	3.666667	1.2382784	0.2702145	0.5636576
freeze	d4	21	3.952381	1.2440334	0.2714703	0.5662772
freeze	d5	21	3.476190	1.1233453	0.2451341	0.5113407
freeze	d6	21	3.095238	0.9436505	0.2059214	0.4295445
freeze	d7	21	2.380952	0.9206623	0.2009050	0.4190804
freeze	d8	21	1.523810	0.6015852	0.1312767	0.2738383
freeze	d9	21	1.285714	0.4629100	0.1010153	0.2107141
freeze	d10	21	1.142857	0.3585686	0.0782461	0.1632185
freeze	d11	21	1.000000	0.0000000	0.0000000	0.0000000
freeze	d12	21	1.000000	0.0000000	0.0000000	0.0000000
freeze	d13	21	1.000000	0.0000000	0.0000000	0.0000000
freeze	d14	21	1.000000	0.0000000	0.0000000	0.0000000
freeze	d15	21	1.047619	0.2182179	0.0476190	0.0993316

This is how we plotted the behavioural figure in the manuscript. We selected the same days tha

cold_stacked = behaviour %>%
  filter(day == c("d1","d4","d7","d11","d15")) %>%
  group_by(polyp_behaviour, day, treatment) %>%
  summarise(n = n()) %>%
  mutate(frequency = n / sum(n)) 

ggplot(cold_stacked, aes(y = frequency, x = day, fill = as.factor(polyp_behaviour))) + 
  geom_bar(stat = "identity", position = "fill") +
  facet_grid(. ~ treatment) + 
  scale_fill_brewer(palette = "Blues", direction=1) +
  labs(fill = "Behavioural Score") +
  theme_cowplot()

Figure 1 | Behavioural scores after feeding. The total surface area of the coral with extended polyps was estimated and then scored between 1-5 based on polyp activity (see Table 1)

Differential Expression

The general pipeline that was used to generate these count files are well described by the Matz lab out of UT Austin. Check out their github page for details: https://github.com/z0on/tag-based_RNAseq

counts = read.csv("~/Documents/Experiments/freezing_hot/freezing_raw.csv", row.names = 1)

Set up experimental design matrix. You’ll notice that I strip the header of each column to inform me what treatment it belongs to.

treatment = as.factor(sapply(strsplit(colnames(counts), split = ""), "[[", 1)) %>%
            revalue(c("C" = "control", "F" = "cold"))
genotype  = as.factor(sapply(strsplit(colnames(counts), split = ""), "[[", 2))

expDesign = data.frame(treatment, genotype)
            expDesign$sample = colnames(counts)
            write.csv(expDesign, "tables/cold_expDesign.csv", row.names = F)

Table 2 | Experimental Design key with sample names

treatment	genotype	sample
control	E	CE6
control	D	CD8
cold	E	FE2
control	E	CE3
control	F	CF1
cold	F	FF3
cold	F	FF4
control	F	CF5
cold	G	FG3
control	H	CH1
cold	H	FH2
control	H	CH3
control	I	CI2
control	I	CI3
control	J	CJ2
control	J	CJ3
cold	J	FJ5
cold	J	FJ6
control	P	CP1
cold	P	FP3
cold	P	FP4
control	T	CT2
control	T	CT3
cold	T	FT4
cold	T	FT5
cold	T	FT6

Table 3 | Descriptive Summary stats of the mapped reads.

	Mean	X3rd.Qu.	Max.	libsum	perc90	perc95	zeros	percent.zeros
CE6	8.554165	1	21224	405878	6	14.00	29362	61.88248
CD8	13.491696	3	54638	640154	10	26.00	25628	54.01281
FE2	13.919828	2	23858	660468	10	28.00	27194	57.31327
CE3	9.783447	2	25500	464205	7	17.00	28586	60.24701
CF1	11.790381	2	76320	559430	8	21.00	28397	59.84868
FF3	9.595473	1	27084	455286	6	18.00	30276	63.80880
FF4	13.021097	1	34220	617825	7	22.00	29810	62.82667
CF5	9.666941	2	47515	458677	8	20.00	27069	57.04982
FG3	10.297715	1	17489	488606	6	17.00	29483	62.13750
CH1	7.611870	1	32894	361168	5	13.00	29836	62.88147
FH2	7.136107	1	13382	338594	6	15.00	31403	66.18403
CH3	11.392809	2	29117	540566	8	20.00	27437	57.82541
CI2	11.530602	2	35991	547104	9	22.65	27193	57.31116
CI3	11.251707	2	43755	533871	10	25.00	25676	54.11398
CJ2	5.008599	1	19208	237648	5	11.00	30900	65.12393
CJ3	5.397951	1	18430	256122	5	12.00	30368	64.00270
FJ5	13.344946	2	22064	633191	8	22.00	28169	59.36815
FJ6	5.601184	1	13862	265765	5	12.00	32022	67.48862
CP1	6.361153	1	17861	301824	6	15.00	29055	61.23546
FP3	12.424423	2	28096	589514	9	25.00	27773	58.53355
FP4	13.029970	2	30298	618246	9	25.00	27899	58.79911
CT2	13.048242	1	32167	619113	6	14.00	29742	62.68336
CT3	8.407541	1	30674	398921	5	12.00	30976	65.28410
FT4	8.790992	1	31015	417115	6	15.00	30289	63.83620
FT5	12.815883	2	27865	608088	9	23.00	27774	58.53566
FT6	8.776977	2	37565	416450	8	19.00	28832	60.76547

Outlier Detection

Conduct array quality metrics to identify outliers

real=newCountDataSet(counts,expDesign) 
real=estimateSizeFactors(real)

plot(sort(sizeFactors(real)))

Figure 2 | Size factors of each sample used to normalize between libraries.

Create a directory with a bunch of output figures from arrayQualityMetrics() for you to explore if their are outliers.

cds=estimateDispersions(real,method="blind")
vsdBlind=varianceStabilizingTransformation(cds)
arrayQualityMetrics(vsdBlind,intgroup=c("treatment"), force=TRUE, outdir = "arrayQualityMetrics") # this makes a directory "arrayQualityMetrics" and outputs various reports on outliers

Figure 3 | shows a false color heatmap of the distances between arrays. The color scale is chosen to cover the range of distances encountered in the dataset. Patterns in this plot can indicate clustering of the arrays either because of intended biological or unintended experimental factors (batch effects). The distance dab between two arrays a and b is computed as the mean absolute difference (L1-distance) between the data of the arrays (using the data from all probes without filtering). In formula, dab = mean | Mai - Mbi |, where Mai is the value of the i-th probe on the a-th array. Outlier detection was performed by looking for arrays for which the sum of the distances to all other arrays, Sa = Σb dab was exceptionally large. One such array was detected, and it is marked by an asterisk, *.

Figure 4 | shows a bar chart of the sum of distances to other arrays Sa, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. Based on the distribution of the values across all arrays, a threshold of 17.5 was determined, which is indicated by the vertical line. One array exceeded the threshold and was considered an outlier.

Figure 5 | shows boxplots representing summaries of the signal intensity distributions of the arrays. Each box corresponds to one array. Typically, one expects the boxes to have similar positions and widths. If the distribution of an array is very different from the others, this may indicate an experimental problem. Outlier detection was performed by computing the Kolmogorov-Smirnov statistic Ka between each array’s distribution and the distribution of the pooled data.

Figure 6 | shows a bar chart of the Kolmogorov-Smirnov statistic Ka, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. Based on the distribution of the values across all arrays, a threshold of 0.603 was determined, which is indicated by the vertical line. None of the arrays exceeded the threshold and was considered an outlier.

Figure 7 | shows density estimates (smoothed histograms) of the data. Typically, the distributions of the arrays should have similar shapes and ranges. Arrays whose distributions are very different from the others should be considered for possible problems. Various features of the distributions can be indicative of quality related phenomena. For instance, high levels of background will shift an array’s distribution to the right. Lack of signal diminishes its right right tail. A bulge at the upper end of the intensity range often indicates signal saturation.

Figure 8 | shows a density plot of the standard deviation of the intensities across arrays on the y-axis versus the rank of their mean on the x-axis. The red dots, connected by lines, show the running median of the standard deviation. After normalisation and transformation to a logarithm(-like) scale, one typically expects the red line to be approximately horizontal, that is, show no substantial trend. In some cases, a hump on the right hand of the x-axis can be observed and is symptomatic of a saturation of the intensities.

Figure 9 | shows MA plots. M and A are defined as: M = log2(I1) - log2(I2) A = 1/2 (log2(I1)+log2(I2)), where I1 is the intensity of the array studied,and I2 is the intensity of a “pseudo”-array that consists of the median across arrays. Typically, we expect the mass of the distribution in an MA plot to be concentrated along the M = 0 axis, and there should be no trend in M as a function of A. If there is a trend in the lower range of A, this often indicates that the arrays have different background intensities; this may be addressed by background correction. A trend in the upper range of A can indicate saturation of the measurements; in mild cases, this may be addressed by non-linear normalisation (e.g. quantile normalisation). Outlier detection was performed by computing Hoeffding’s statistic Da on the joint distribution of A and M for each array. Shown are first the 4 arrays with the highest values of Da, then the 4 arrays with the lowest values. The value of Da is shown in the panel headings. 0 arrays had Da>0.15 and were marked as outliers. For more information on Hoeffing’s D-statistic, please see the manual page of the function hoeffd in the Hmisc package.

Figure 10 | shows a bar chart of the Da, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. A threshold of 0.15 was used, which is indicated by the vertical line. None of the arrays exceeded the threshold and was considered an outlier.

Outlier Conclusisons

Much like in the other experiment (hot), we see that a single sample (CP1 in this experiment) is identified as an outlier based on the distances between arrays which was much larger. This sample however was not flagged as being an outlier in any other analyses and so was therefore I considered it unworthy of removing.

Differential Expression

Perform a Wald test in DESeq2 (now the new version) doing a pairwise comparison between cold and control (aka the effect of treatment, denoted as design = ~ treatment below)

dds = DESeqDataSetFromMatrix(countData = counts, colData = expDesign, design = ~ genotype + treatment)
dds = DESeq(dds)
results = results(dds)

Let’s check to see if we set up our contrast correctly. We should have the treatment condition first and the control second in the log2 fold change (MLE) output.

head(results)

## log2 fold change (MLE): treatment cold vs control 
## Wald test p-value: treatment cold vs control 
## DataFrame with 6 rows and 6 columns
##                         baseMean     log2FoldChange             lfcSE
##                        <numeric>          <numeric>         <numeric>
## isogroup1      0.461529721479737  0.419668503236571  1.47118578725618
## isogroup10     0.553625275106359 0.0174438883464909  1.76578089796676
## isogroup1000     1.1906818878365 -0.272504895940878 0.874298472601528
## isogroup100001  23.4299485361196  -1.04491003705984 0.280285458356361
## isogroup100004 0.316470003418551 0.0303396598025434   3.3292952314765
## isogroup100011  2.34865670194132  -0.31339148990892 0.868960637164387
##                               stat               pvalue
##                          <numeric>            <numeric>
## isogroup1        0.285258671523242    0.775445973305387
## isogroup10     0.00987885210819581    0.992117944628513
## isogroup1000    -0.311684058111211    0.755280644635158
## isogroup100001   -3.72802086553956 0.000192989403027221
## isogroup100004 0.00911293763187476    0.992729028397179
## isogroup100011  -0.360650962201909    0.718360387292258
##                               padj
##                          <numeric>
## isogroup1                       NA
## isogroup10                      NA
## isogroup1000     0.890175149789675
## isogroup100001 0.00177421443128247
## isogroup100004                  NA
## isogroup100011   0.870452732525721

Looks good!

Table 4 | Descriptive statistics on the normalized read counts

	mean	max	zeros	percent.zeros
CE6	9.796057	24305.30	29362	61.88248
CD8	9.264012	37516.94	25628	54.01281
FE2	9.135883	15658.52	27194	57.31327
CE3	9.603827	25031.83	28586	60.24701
CF1	10.256199	66389.13	28397	59.84868
FF3	9.577625	27033.62	30276	63.80880
FF4	10.504855	27607.21	29810	62.82667
CF5	9.068442	44573.26	27069	57.04982
FG3	9.318387	15825.77	29483	62.13750
CH1	9.581351	41404.93	29836	62.88147
FH2	8.904224	16697.66	31403	66.18403
CH3	9.628922	24608.97	27437	57.82541
CI2	9.543230	29787.73	27193	57.31116
CI3	8.753919	34041.74	25676	54.11398
CJ2	8.689286	33323.45	30900	65.12393
CJ3	8.342631	28483.90	30368	64.00270
FJ5	9.508401	15720.81	28169	59.36815
FJ6	8.496500	21027.43	32022	67.48862
CP1	8.709014	24453.38	29055	61.23546
FP3	9.052077	20469.94	27773	58.53355
FP4	9.472008	22024.83	27899	58.79911
CT2	13.812646	34051.44	29742	62.68336
CT3	11.113983	40548.16	30976	65.28410
FT4	9.239044	32595.75	30289	63.83620
FT5	9.074095	19729.40	27774	58.53566
FT6	8.468310	36243.92	28832	60.76547

Lets do a rlogged transformation, which is useful various unsupervised clustering analyses. Be sure the include blind = TRUE as it doesn’t normalize the data with any priors from our experimental design.

rlogged = rlogTransformation(dds, blind = TRUE)

The order of treatments is really important here. Putting “cold” before “control” means that the gene expression of corals in the heated treatment will be compared to control, which is what we want. Make sure you put the control group second

In the output table, negative log2FoldChange indicates downregulation in heated compared to control, and positive log2FoldChagne indicates upregulation.

results_FvsC = results(dds, contrast = c("treatment", "cold", "control")) 
head(results_FvsC)

## log2 fold change (MLE): treatment cold vs control 
## Wald test p-value: treatment cold vs control 
## DataFrame with 6 rows and 6 columns
##                         baseMean     log2FoldChange             lfcSE
##                        <numeric>          <numeric>         <numeric>
## isogroup1      0.461529721479737  0.419668503236571  1.47118578725618
## isogroup10     0.553625275106359 0.0174438883464909  1.76578089796676
## isogroup1000     1.1906818878365 -0.272504895940878 0.874298472601528
## isogroup100001  23.4299485361196  -1.04491003705984 0.280285458356361
## isogroup100004 0.316470003418551 0.0303396598025434   3.3292952314765
## isogroup100011  2.34865670194132  -0.31339148990892 0.868960637164387
##                               stat               pvalue
##                          <numeric>            <numeric>
## isogroup1        0.285258671523242    0.775445973305387
## isogroup10     0.00987885210819581    0.992117944628513
## isogroup1000    -0.311684058111211    0.755280644635158
## isogroup100001   -3.72802086553956 0.000192989403027221
## isogroup100004 0.00911293763187476    0.992729028397179
## isogroup100011  -0.360650962201909    0.718360387292258
##                               padj
##                          <numeric>
## isogroup1                       NA
## isogroup10                      NA
## isogroup1000     0.890175149789675
## isogroup100001 0.00177421443128247
## isogroup100004                  NA
## isogroup100011   0.870452732525721

Summary of DEGs with FDR < 0.1
We can see how many are up regulated and how many are down regulated (and what percent of the genome) 4.3% of genes (1901 genes) are upregulated and 7.2% of genes (3151 genes) are downregulated.

summary(results_FvsC)

## 
## out of 44039 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1754, 4%
## LFC < 0 (down)     : 2401, 5.5%
## outliers [1]       : 0, 0%
## low counts [2]     : 28061, 64%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

How about those for a FDR of < 0.05? Since there are enough genes to explore here, we will use this for subsequent anaylses.

results_FvsC05 = results(dds, contrast = c("treatment", "cold", "control"), alpha = 0.05) 
summary(results_FvsC05)

## 
## out of 44039 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 1468, 3.3%
## LFC < 0 (down)     : 1960, 4.5%
## outliers [1]       : 0, 0%
## low counts [2]     : 28061, 64%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

write.csv(results_FvsC05, file="tables/cold_results.csv")

MA plot because people like them, but besides looking neat I find them not overly informative

DESeq2::plotMA(results_FvsC, main = "Cold vs Control")

Figure 11 | “Minus over average”, or MA-plot of cold vs control treatments. Red dots represent genes differentially expressed FDR < 0.1 whereas grey dots are not differentially expressed.

Heatmaps

Overall expression, sample by distance.

sampleDists <- as.matrix(dist(t(assay(rlogged))))
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
          col=colorpanel(100, "black", "white"),
          margin=c(10, 10))

Figure 12 | Sample by distance heatmap. This figure takes overall expression and clusters samples based on euclidean distances. We can see that samples cluster mostly by genotype with some exceptions.

Individual genes heatmap
I like plotting the z scores, as heatmap2 creates nice clean clusters by doing this. The heatmap shows that our samples are clustering by treatment (C=cold, C=control). Upregulation indicated by warmer colors, downregulation indicated by cooler colors.

norm_counts = read.csv("tables/cold_normalized_counts.csv")
hm = read.csv("tables/cold_results.csv") %>% as_tibble() %>%
  filter(padj < 0.05) %>% # only want the most DEGs
  select(X) %>%
  merge(norm_counts, by.x = "X", by.y = "X")  # turn into a countdatafile
  row.names(hm) = hm$X
  hm$X = NULL

## Turning into z-score table
hm.z = data.matrix(hm)
hm.z = sweep(hm.z, 1L, rowMeans(hm.z), check.margin = FALSE)
hm.z.sx = apply(hm.z, 1L, sd)
hm.z = sweep(hm.z, 1L, hm.z.sx, "/", check.margin = FALSE)
hm.z = data.matrix(hm.z)

colour = colorRampPalette(rev(c("chocolate1","#FEE090","grey10", "cyan3","cyan")))(100)
heatmap.2(hm.z, col = colour, Rowv = TRUE, Colv = TRUE, scale = "row", 
          dendrogram = "both",
          trace = "none", 
          margin = c(5,15))

Figure 13 | Heatmap of z-scores of DEGs (FDR < 0.05). Upregulation indicated by warmer colors, downregulation indicated by cooler colors.

Principal Component Analyses

First we create a PCA data frame and calculate the variance estimated by PC1 and PC2

pcadata = DESeq2::plotPCA(rlogged, intgroup = c( "treatment", "genotype"), returnData = TRUE)
percentVar = round(100 * attr(pcadata, "percentVar"))
pca = prcomp(t(assay(rlogged)), center = TRUE, scale. = FALSE)

Using adonis from library(vegan) we can see if there are any significant differences based on treatment (as is illustrated in the model displayed below)

adonis(pca$x ~ genotype + treatment, method = 'eu')

## 
## Call:
## adonis(formula = pca$x ~ genotype + treatment, method = "eu") 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## genotype   8     78400  9800.0  2.6654 0.48631  0.001 ***
## treatment  1     23985 23984.6  6.5232 0.14878  0.001 ***
## Residuals 16     58829  3676.8         0.36491           
## Total     25    161213                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cols = c("control" = "grey", "cold" = "#68a2ff")
DESeq2::plotPCA(rlogged, returnData = TRUE, intgroup = c("treatment", "genotype") ) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = treatment), size = 3) +
      stat_ellipse(geom = "polygon", alpha = 1/10, aes(fill = treatment)) +
      scale_colour_manual(values = cols) +
      scale_fill_manual(values = cols) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      labs(title = "~ treatment, p<0.001")+
      theme_cowplot()

Figure 14 | Principal component analyses highlighting treatment groups. Treatment has a significant affect (p < 0.001) on overall expression

DESeq2::plotPCA(rlogged, returnData = TRUE, intgroup = c("treatment", "genotype") ) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = genotype)) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      labs(title = "~ genotype, p<0.001")+
      theme_cowplot()

Figure 15 | Principal component analyses highlighting genotypes of coral. Treatment has a significant affect (p < 0.001) on overall expression

Gene ontology

We are going to do a Mann-Whitney U test which requries that we first -logged the pvalue, and multiply it by -1 if it’s less than zero or by 1 if it’s greater than zero

results = read.csv("tables/cold_results.csv")
go_input = results %>%
  mutate(mutated_p = -log(pvalue)) %>%
  mutate(mutated_p_updown = ifelse(log2FoldChange < 0, mutated_p*-1, mutated_p*1)) %>%
  select(X, mutated_p_updown) %>%
  na.omit()

colnames(go_input) = NULL

write.csv(go_input, "go_input.csv", row.names = FALSE)

These code are all adapted from Dr. Matz and can be found GO_MWU https://github.com/z0on/GO_MWU This will be broken down into three sections

Cellular Components

input="go_input_saved.csv" # two columns of comma-separated values: gene id, continuous measure of significance. To perform standard GO enrichment analysis based on Fisher's exact test, use binary measure (0 or 1, i.e., either sgnificant or not).
goAnnotations="wh_host_final_iso2go.tab" # two-column, tab-delimited, one line per gene, multiple GO terms separated by semicolon. If you have multiple lines per gene, use nrify_GOtable.pl prior to running this script.
goDatabase="go.obo" # download from http://www.geneontology.org/GO.downloads.ontology.shtml
goDivision="CC" # either MF, or BP, or CC
source("gomwu.functions.R")

gomwuStats(input, goDatabase, goAnnotations, goDivision,
    perlPath="perl", # replace with full path to perl executable if it is not in your system's PATH already
    largest=0.1,  # a GO category will not be considered if it contains more than this fraction of the total number of genes
    smallest=5,   # a GO category should contain at least this many genes to be considered
    clusterCutHeight=0.25, # threshold for merging similar (gene-sharing) terms. See README for details.
#   Alternative="g" # by default the MWU test is two-tailed; specify "g" or "l" of you want to test for "greater" or "less" instead. 
#   Module=TRUE,Alternative="g" # un-remark this if you are analyzing a SIGNED WGCNA module (values: 0 for not in module genes, kME for in-module genes). In the call to gomwuPlot below, specify absValue=0.001 (count number of "good genes" that fall into the module)
#   Module=TRUE # un-remark this if you are analyzing an UNSIGNED WGCNA module 
)

## Continuous measure of interest: will perform MWU test
## 146  GO terms at 10% FDR

# do not continue if the printout shows that no GO terms pass 10% FDR.

cc_results=gomwuPlot(input,goAnnotations,goDivision,
#   absValue=-log(0.05,10),  # genes with the measure value exceeding this will be counted as "good genes". Specify absValue=0.001 if you are doing Fisher's exact test for standard GO enrichment or analyzing a WGCNA module (all non-zero genes = "good genes").
    absValue=1,
    level1=0.001, # FDR threshold for plotting. Specify level1=1 to plot all GO categories containing genes exceeding the absValue.
    level2=0.0001, # FDR cutoff to print in regular (not italic) font.
    level3=0.00001, # FDR cutoff to print in large bold font.
    txtsize=1.2,    # decrease to fit more on one page, or increase (after rescaling the plot so the tree fits the text) for better "word cloud" effect
    treeHeight=0.5, # height of the hierarchical clustering tree
  colors=c("black","#0666ff","grey40","#7caeff") # these are default colors, un-remar and change if needed
)

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## GO terms dispayed:  25 
## "Good genes" accounted for:  1727 out of 5906 ( 29% )

write.csv(cc_results, "cc.csv")

Figure 16 | Cellular component gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the cold treatment and purple colours are are GO terms enriched in upregulated genes in the cold treatment.

Molecular Functions

goDivision="MF"

gomwuStats(input, goDatabase, goAnnotations, goDivision,
    perlPath="perl", 
    largest=0.1, 
    smallest=5,   
    clusterCutHeight=0.25, 
)

## Continuous measure of interest: will perform MWU test
## 106  GO terms at 10% FDR

mf_results=gomwuPlot(input,goAnnotations,goDivision,
    absValue=1,
    level1=0.01, 
    level2=0.001, 
    level3=0.0001, 
    txtsize=1.2,    
    treeHeight=0.5, 
  colors=c("black","#0666ff","grey40","#7caeff") 
)

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## GO terms dispayed:  39 
## "Good genes" accounted for:  2075 out of 6457 ( 32% )

write.csv(mf_results, "mf.csv")

Figure 17 | Molecular function gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the cold treatment and purple colours are are GO terms enriched in upregulated genes in the cold treatment.

Biological process

goDivision="BP"

gomwuStats(input, goDatabase, goAnnotations, goDivision,
    perlPath="perl", 
    largest=0.1,  
    smallest=5,   
    clusterCutHeight=0.4, 
)

## Continuous measure of interest: will perform MWU test
## 419  GO terms at 10% FDR

bp_results=gomwuPlot(input,goAnnotations,goDivision,
    absValue=1,
    level1=0.00001, 
    level2=0.000001, 
    level3=0.0000001, 
    txtsize=1.2,    
    treeHeight=0.5, 
  colors=c("black","#0666ff","grey40","#7caeff") 
)

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'

## GO terms dispayed:  25 
## "Good genes" accounted for:  2825 out of 6823 ( 41% )

write.csv(bp_results, "bp.csv")

Figure 18 | Biological processes gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the cold treatment and purple colours are are GO terms enriched in upregulated genes in the cold treatment.

Version Control

Here is our session info, containing information about versions used

sessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.5
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ape_5.2                     hexbin_1.27.2              
##  [3] prettydoc_0.2.1             reshape2_1.4.3             
##  [5] kableExtra_1.1.0            vegan_2.5-4                
##  [7] permute_0.9-4               plotly_4.9.0               
##  [9] knitr_1.21                  gplots_3.0.1.1             
## [11] RColorBrewer_1.1-2          readr_1.3.1                
## [13] cowplot_0.9.4               DESeq_1.34.1               
## [15] lattice_0.20-38             locfit_1.5-9.1             
## [17] genefilter_1.64.0           arrayQualityMetrics_3.38.0 
## [19] affycoretools_1.54.0        ggplot2_3.1.0              
## [21] DESeq2_1.22.2               SummarizedExperiment_1.12.0
## [23] DelayedArray_0.8.0          BiocParallel_1.16.5        
## [25] matrixStats_0.54.0          Biobase_2.42.0             
## [27] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
## [29] IRanges_2.16.0              S4Vectors_0.20.1           
## [31] BiocGenerics_0.28.0         dplyr_0.8.0.1              
## [33] plyr_1.8.4                  tidyr_0.8.2                
## 
## loaded via a namespace (and not attached):
##   [1] R.utils_2.7.0             tidyselect_0.2.5         
##   [3] RSQLite_2.1.1             AnnotationDbi_1.44.0     
##   [5] htmlwidgets_1.3           beadarray_2.32.0         
##   [7] grid_3.5.1                munsell_0.5.0            
##   [9] codetools_0.2-16          preprocessCore_1.44.0    
##  [11] withr_2.1.2               colorspace_1.4-0         
##  [13] Category_2.48.0           OrganismDbi_1.24.0       
##  [15] highr_0.7                 rstudioapi_0.9.0         
##  [17] setRNG_2013.9-1           GenomeInfoDbData_1.2.0   
##  [19] hwriter_1.3.2             bit64_0.9-7              
##  [21] xfun_0.5.3                biovizBase_1.30.1        
##  [23] R6_2.4.0                  illuminaio_0.24.0        
##  [25] gridSVG_1.6-0             AnnotationFilter_1.6.0   
##  [27] bitops_1.0-6              reshape_0.8.8            
##  [29] assertthat_0.2.0          scales_1.0.0             
##  [31] nnet_7.3-12               gtable_0.2.0             
##  [33] Cairo_1.5-9               affy_1.60.0              
##  [35] ggbio_1.30.0              ensembldb_2.6.5          
##  [37] rlang_0.3.1               splines_3.5.1            
##  [39] rtracklayer_1.42.1        lazyeval_0.2.1           
##  [41] acepack_1.4.1             dichromat_2.0-0          
##  [43] checkmate_1.9.1           BiocManager_1.30.4       
##  [45] yaml_2.2.0                GenomicFeatures_1.34.3   
##  [47] backports_1.1.3           Hmisc_4.2-0              
##  [49] RBGL_1.58.1               tools_3.5.1              
##  [51] affyio_1.52.0             ff_2.2-14                
##  [53] Rcpp_1.0.1                base64enc_0.1-3          
##  [55] progress_1.2.0            zlibbioc_1.28.0          
##  [57] purrr_0.3.0               RCurl_1.95-4.11          
##  [59] prettyunits_1.0.2         rpart_4.1-13             
##  [61] openssl_1.2.1             cluster_2.0.7-1          
##  [63] magrittr_1.5              data.table_1.12.0        
##  [65] ProtGenerics_1.14.0       hms_0.4.2                
##  [67] evaluate_0.12             xtable_1.8-3             
##  [69] XML_3.98-1.16             gcrma_2.54.0             
##  [71] gridExtra_2.3             compiler_3.5.1           
##  [73] biomaRt_2.38.0            tibble_2.0.1             
##  [75] KernSmooth_2.23-15        crayon_1.3.4             
##  [77] ReportingTools_2.22.1     R.oo_1.22.0              
##  [79] htmltools_0.3.6           GOstats_2.48.0           
##  [81] mgcv_1.8-26               Formula_1.2-3            
##  [83] geneplotter_1.60.0        DBI_1.0.0                
##  [85] MASS_7.3-51.1             Matrix_1.2-15            
##  [87] vsn_3.50.0                R.methodsS3_1.7.1        
##  [89] gdata_2.18.0              pkgconfig_2.0.2          
##  [91] GenomicAlignments_1.18.1  foreign_0.8-71           
##  [93] xml2_1.2.0                foreach_1.4.4            
##  [95] annotate_1.60.0           BeadDataPackR_1.34.0     
##  [97] affyPLM_1.58.0            webshot_0.5.1            
##  [99] XVector_0.22.0            AnnotationForge_1.24.0   
## [101] rvest_0.3.4               stringr_1.4.0            
## [103] VariantAnnotation_1.28.10 digest_0.6.18            
## [105] graph_1.60.0              Biostrings_2.50.2        
## [107] rmarkdown_1.11            base64_2.0               
## [109] htmlTable_1.13.1          edgeR_3.24.3             
## [111] GSEABase_1.44.0           curl_3.3                 
## [113] Rsamtools_1.34.1          gtools_3.8.1             
## [115] nlme_3.1-137              jsonlite_1.6             
## [117] PFAM.db_3.7.0             viridisLite_0.3.0        
## [119] askpass_1.1               limma_3.38.3             
## [121] BSgenome_1.50.0           pillar_1.3.1             
## [123] GGally_1.4.0              httr_1.4.0               
## [125] survival_2.43-3           GO.db_3.7.0              
## [127] glue_1.3.0.9000           iterators_1.0.10         
## [129] bit_1.1-14                Rgraphviz_2.26.0         
## [131] stringi_1.2.4             blob_1.1.1               
## [133] oligoClasses_1.44.0       latticeExtra_0.6-28      
## [135] caTools_1.17.1.1          memoise_1.1.0