Introduction
The gene expression dataset analyzed here comes from a project that examined the physiological and transcriptomic effects of hot water on the temperate coral Astrangia poculata. A. poculata is a facultatively symbiotic coral, meaning that it exists in the wild both with and without symbionts. However, only aposymbiotic corals were used in the experiment described here. Here, corals were placed in either a chilled treatment “hot” or a control condition. The hot group had temperatures increased to 31°C or maintained at 16°C (control) over the course of 15 days, and were sampled for gene expression on day 15. Physiology measured (polyp extension) over the course of the experiment demonstrated that the corals pulled in their polyps in response to warming temperatures, which is the same phenotype that corals in the related hot experiment exhibited. It is important to note that the control temperature used in this experiment is different from that used in the hot experiment, which is why we do not make direct comparisons between these studies. Here, we use the program DESeq2 to quantify differential gene expression of corals in the control and heated treatments.
Analysis pipeline
First we load appropriate libraries
library(tidyr)
library(plyr)
library(dplyr)
library(DESeq2)
library(ggplot2)
library(affycoretools)
library(arrayQualityMetrics)
library(genefilter)
library(DESeq)
library(cowplot)
library(readr)
library(RColorBrewer)
library(gplots)
library(knitr)
library(plotly)
library(vegan)
library(kableExtra)
library(reshape2)
library(prettydoc)
library(MASS)
Behavioural Response to food.
Coral polyp behaviours were observed every 3-4 days throughout the experiment. The total surface area of the coral that had extended vs retracted polyps was estimated and then scored between 1-5 based on polyp activity
Score | Percent.of.colony.with.extended.polyps |
---|---|
1 | 0 |
2 | 25 |
3 | 50 |
4 | 75 |
5 | 100 |
Read in data, transform to long form and organize.
behaviour = read.csv("hot_behaviour.csv") %>%
melt() %>%
rename(day = variable) %>%
rename(polyp_behaviour = value) %>%
mutate(genotype = sapply(strsplit(as.character(individual), split = ""), "[[", 2))
behaviour$day = as.character(behaviour$day)
data = behaviour %>%
mutate(temperature = recode(day, d1 = "16", d2 = "23", d3 = "25", d4 = "28", d5 = "31"))
In case someone is interested in the descriptive statistics, here we calculate means, standard error and standard deviation.
hot_sum = behaviour %>%
group_by(treatment, day) %>%
summarise(
n=n(),
mean=mean(polyp_behaviour),
sd=sd(polyp_behaviour)
) %>%
mutate( se=sd/sqrt(n)) %>%
mutate( ic=se * qt((1-0.05)/2 + .5, n-1))
kable(hot_sum) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
treatment | day | n | mean | sd | se | ic |
---|---|---|---|---|---|---|
control | d1 | 21 | 5.000000 | 0.0000000 | 0.0000000 | 0.0000000 |
control | d2 | 21 | 5.000000 | 0.0000000 | 0.0000000 | 0.0000000 |
control | d3 | 21 | 5.000000 | 0.0000000 | 0.0000000 | 0.0000000 |
control | d4 | 21 | 5.000000 | 0.0000000 | 0.0000000 | 0.0000000 |
control | d5 | 21 | 5.000000 | 0.0000000 | 0.0000000 | 0.0000000 |
hot | d1 | 21 | 5.000000 | 0.0000000 | 0.0000000 | 0.0000000 |
hot | d2 | 21 | 4.428571 | 0.8701396 | 0.1898800 | 0.3960828 |
hot | d3 | 21 | 4.476191 | 1.0304876 | 0.2248708 | 0.4690723 |
hot | d4 | 21 | 2.476190 | 1.2090925 | 0.2638456 | 0.5503723 |
hot | d5 | 21 | 2.714286 | 1.1892375 | 0.2595129 | 0.5413344 |
This is how we plotted the behavioural figure in the manuscript. We selected the same days tha
hot_stacked = behaviour %>%
group_by(polyp_behaviour, day, treatment) %>%
summarise(n = n()) %>%
mutate(frequency = n / sum(n))
ggplot(hot_stacked, aes(y = frequency, x = day, fill = as.factor(polyp_behaviour))) +
geom_bar(stat = "identity", position = "fill") +
facet_grid(. ~ treatment) +
scale_fill_brewer(palette = "Reds", direction=1) +
labs(fill = "Behavioural Score") +
theme_cowplot()
Figure 1 | Behavioural scores after feeding. The total surface area of the coral with extended polyps was estimated and then scored between 1-5 based on polyp activity (see Table 1)
model = polr(as.factor(polyp_behaviour) ~ treatment + day, data = behaviour)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
##
## Re-fitting to get Hessian
## Call:
## polr(formula = as.factor(polyp_behaviour) ~ treatment + day,
## data = behaviour)
##
## Coefficients:
## Value Std. Error t value
## treatmenthot -19.22 0.1819 -105.69
## dayd2 -16.93 0.4107 -41.21
## dayd3 -16.43 0.4671 -35.17
## dayd4 -19.93 0.3726 -53.48
## dayd5 -19.59 0.3611 -54.26
##
## Intercepts:
## Value Std. Error t value
## 1|2 -40.3389 0.1819 -221.8048
## 2|3 -38.8912 0.2566 -151.5488
## 3|4 -37.8752 0.2909 -130.2071
## 4|5 -36.6582 0.3596 -101.9315
##
## Residual Deviance: 209.8631
## AIC: 227.8631
table = coef(summary(model))
##
## Re-fitting to get Hessian
table
## Value Std. Error t value
## treatmenthot -19.22189 0.1818671 -105.69199
## dayd2 -16.92510 0.4106563 -41.21475
## dayd3 -16.42655 0.4670713 -35.16926
## dayd4 -19.92747 0.3726301 -53.47788
## dayd5 -19.59392 0.3611129 -54.25981
## 1|2 -40.33895 0.1818669 -221.80477
## 2|3 -38.89117 0.2566247 -151.54883
## 3|4 -37.87523 0.2908845 -130.20711
## 4|5 -36.65824 0.3596361 -101.93148
Calculate and store the p-value
p = pnorm(abs(table[,"t value"]), lower.tail = FALSE) * 2
p
## treatmenthot dayd2 dayd3 dayd4 dayd5
## 0.000000e+00 0.000000e+00 5.902959e-271 0.000000e+00 0.000000e+00
## 1|2 2|3 3|4 4|5
## 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
Differential Expression
The general pipeline that was used to generate these count files are well described by the Matz lab out of UT Austin. Check out their github page for details: https://github.com/z0on/tag-based_RNAseq
counts = read.csv("~/Documents/Experiments/freezing_hot/hot_raw.csv", row.names = 1)
Set up experimental design matrix. You’ll notice that I strip the header of each column to inform me what treatment it belongs to.
treatment = as.factor(sapply(strsplit(colnames(counts), split = ""), "[[", 1)) %>%
revalue(c("C" = "control", "H" = "hot"))
genotype = as.factor(sapply(strsplit(colnames(counts), split = ""), "[[", 2))
expDesign = data.frame(treatment, genotype)
expDesign$sample = colnames(counts)
write.csv(expDesign, "tables/hot_expDesign.csv", row.names = F)
treatment | genotype | sample |
---|---|---|
hot | A | HA1 |
control | A | CA2 |
hot | B | HB2 |
control | B | CB3 |
control | B | CB4 |
hot | B | HB5 |
hot | B | HB6 |
control | C | CC2 |
control | C | CC3 |
control | D | CD2 |
control | D | CD4 |
hot | D | HD6 |
control | F | CF8 |
hot | K | HK4 |
hot | N | HN1 |
hot | O | HO1 |
hot | O | HO2 |
control | O | CO4 |
control | Q | CQ2 |
hot | Q | HQ3 |
control | Q | CQ4 |
control | Q | CQ5 |
hot | Q | HQ6 |
Table 3 | Descriptive Summary stats of the mapped reads.
Min. | X1st.Qu. | Median | Mean | X3rd.Qu. | Max. | libsum | perc05 | perc10 | perc90 | perc95 | zeros | percent.zeros | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HA1 | 0 | 0 | 0 | 9.574418 | 2 | 40458 | 454287 | 0 | 0 | 8 | 19 | 27412 | 57.77272 |
CA2 | 0 | 0 | 0 | 13.053617 | 2 | 39814 | 619368 | 0 | 0 | 10 | 24 | 26086 | 54.97808 |
HB2 | 0 | 0 | 0 | 13.399869 | 2 | 29289 | 635797 | 0 | 0 | 9 | 24 | 27379 | 57.70317 |
CB3 | 0 | 0 | 0 | 7.600447 | 1 | 25784 | 360626 | 0 | 0 | 6 | 14 | 29587 | 62.35669 |
CB4 | 0 | 0 | 0 | 7.135727 | 1 | 22337 | 338576 | 0 | 0 | 6 | 13 | 29648 | 62.48525 |
HB5 | 0 | 0 | 0 | 5.802626 | 1 | 22823 | 275323 | 0 | 0 | 5 | 12 | 31056 | 65.45271 |
HB6 | 0 | 0 | 0 | 5.006238 | 1 | 38833 | 237536 | 0 | 0 | 4 | 10 | 32583 | 68.67097 |
CC2 | 0 | 0 | 0 | 17.092670 | 2 | 21412 | 811013 | 0 | 0 | 8 | 19 | 27634 | 58.24060 |
CC3 | 0 | 0 | 0 | 9.950514 | 2 | 30425 | 472132 | 0 | 0 | 6 | 14 | 28638 | 60.35660 |
CD2 | 0 | 0 | 0 | 14.828465 | 3 | 61616 | 703581 | 0 | 0 | 13 | 31 | 23755 | 50.06533 |
CD4 | 0 | 0 | 0 | 12.294554 | 1 | 21399 | 583352 | 0 | 0 | 5 | 13 | 30593 | 64.47690 |
HD6 | 0 | 0 | 0 | 4.055513 | 1 | 26716 | 192426 | 0 | 0 | 4 | 9 | 32916 | 69.37279 |
CF8 | 0 | 0 | 0 | 15.811267 | 2 | 34172 | 750213 | 0 | 0 | 9 | 23 | 25384 | 53.49857 |
HK4 | 0 | 0 | 0 | 7.253688 | 2 | 28958 | 344173 | 0 | 0 | 6 | 16 | 29312 | 61.77710 |
HN1 | 0 | 0 | 0 | 6.315018 | 1 | 12385 | 299635 | 0 | 0 | 4 | 10 | 33473 | 70.54670 |
HO1 | 0 | 0 | 0 | 7.296788 | 2 | 19439 | 346218 | 0 | 0 | 8 | 18 | 27625 | 58.22163 |
HO2 | 0 | 0 | 0 | 7.522593 | 2 | 46609 | 356932 | 0 | 0 | 6 | 16 | 28910 | 60.92986 |
CO4 | 0 | 0 | 0 | 7.327095 | 2 | 22917 | 347656 | 0 | 0 | 7 | 17 | 28159 | 59.34707 |
CQ2 | 0 | 0 | 0 | 7.587022 | 2 | 29543 | 359989 | 0 | 0 | 7 | 17 | 27665 | 58.30593 |
HQ3 | 0 | 0 | 0 | 10.137519 | 1 | 26059 | 481005 | 0 | 0 | 4 | 11 | 32532 | 68.56348 |
CQ4 | 0 | 0 | 0 | 28.439218 | 3 | 29319 | 1349384 | 0 | 0 | 10 | 26 | 25525 | 53.79573 |
CQ5 | 0 | 0 | 0 | 12.405750 | 2 | 33190 | 588628 | 0 | 0 | 10 | 25 | 25970 | 54.73360 |
HQ6 | 0 | 0 | 0 | 12.559138 | 2 | 33867 | 595906 | 0 | 0 | 9 | 23 | 27196 | 57.31748 |
Outlier Detection
Conduct array quality metrics to identify outliers
real=newCountDataSet(counts,expDesign)
real=estimateSizeFactors(real)
plot(sort(sizeFactors(real)))
Figure 2 | Size factors of each sample used to normalize between libraries.
Create a directory with a bunch of output figures from arrayQualityMetrics() for you to explore if their are outliers.
cds=estimateDispersions(real,method="blind")
vsdBlind=varianceStabilizingTransformation(cds)
arrayQualityMetrics(vsdBlind,intgroup=c("treatment"), force=TRUE, outdir = "arrayQualityMetrics") # this makes a directory "arrayQualityMetrics" and outputs various reports on outliers
Figure 3 | shows a false color heatmap of the distances between arrays. The color scale is chosen to cover the range of distances encountered in the dataset. Patterns in this plot can indicate clustering of the arrays either because of intended biological or unintended experimental factors (batch effects). The distance dab between two arrays a and b is computed as the mean absolute difference (L1-distance) between the data of the arrays (using the data from all probes without filtering). In formula, dab = mean | Mai - Mbi |, where Mai is the value of the i-th probe on the a-th array. Outlier detection was performed by looking for arrays for which the sum of the distances to all other arrays, Sa = Σb dab was exceptionally large. One such array was detected, and it is marked by an asterisk, *.
Figure 4 | shows a bar chart of the sum of distances to other arrays Sa, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. Based on the distribution of the values across all arrays, a threshold of 17.5 was determined, which is indicated by the vertical line. One array exceeded the threshold and was considered an outlier.
Figure 5 | shows boxplots representing summaries of the signal intensity distributions of the arrays. Each box corresponds to one array. Typically, one expects the boxes to have similar positions and widths. If the distribution of an array is very different from the others, this may indicate an experimental problem. Outlier detection was performed by computing the Kolmogorov-Smirnov statistic Ka between each array’s distribution and the distribution of the pooled data.
Figure 6 | shows a bar chart of the Kolmogorov-Smirnov statistic Ka, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. Based on the distribution of the values across all arrays, a threshold of 0.603 was determined, which is indicated by the vertical line. None of the arrays exceeded the threshold and was considered an outlier.
Figure 7 | shows density estimates (smoothed histograms) of the data. Typically, the distributions of the arrays should have similar shapes and ranges. Arrays whose distributions are very different from the others should be considered for possible problems. Various features of the distributions can be indicative of quality related phenomena. For instance, high levels of background will shift an array’s distribution to the right. Lack of signal diminishes its right right tail. A bulge at the upper end of the intensity range often indicates signal saturation.
Figure 8 | shows a density plot of the standard deviation of the intensities across arrays on the y-axis versus the rank of their mean on the x-axis. The red dots, connected by lines, show the running median of the standard deviation. After normalisation and transformation to a logarithm(-like) scale, one typically expects the red line to be approximately horizontal, that is, show no substantial trend. In some cases, a hump on the right hand of the x-axis can be observed and is symptomatic of a saturation of the intensities.
Figure 9 | shows MA plots. M and A are defined as: M = log2(I1) - log2(I2) A = 1/2 (log2(I1)+log2(I2)), where I1 is the intensity of the array studied,and I2 is the intensity of a “pseudo”-array that consists of the median across arrays. Typically, we expect the mass of the distribution in an MA plot to be concentrated along the M = 0 axis, and there should be no trend in M as a function of A. If there is a trend in the lower range of A, this often indicates that the arrays have different background intensities; this may be addressed by background correction. A trend in the upper range of A can indicate saturation of the measurements; in mild cases, this may be addressed by non-linear normalisation (e.g. quantile normalisation). Outlier detection was performed by computing Hoeffding’s statistic Da on the joint distribution of A and M for each array. Shown are first the 4 arrays with the highest values of Da, then the 4 arrays with the lowest values. The value of Da is shown in the panel headings. 0 arrays had Da>0.15 and were marked as outliers. For more information on Hoeffing’s D-statistic, please see the manual page of the function hoeffd in the Hmisc package.
Figure 10 | shows a bar chart of the Da, the outlier detection criterion from the previous figure. The bars are shown in the original order of the arrays. A threshold of 0.15 was used, which is indicated by the vertical line. None of the arrays exceeded the threshold and was considered an outlier.
Outlier Conclusisons
Much like in the other experiment (hot), we see that a single sample (CP1 in this experiment) is identified as an outlier based on the distances between arrays which was much larger. This sample however was not flagged as being an outlier in any other analyses and so was therefore I considered it unworthy of removing.
Differential Expression
Perform a Wald test in DESeq2 (now the new version) doing a pairwise comparison between hot and control (aka the effect of treatment, denoted as design = ~ treatment below)
dds = DESeqDataSetFromMatrix(countData = counts, colData = expDesign, design = ~ genotype + treatment)
dds = DESeq(dds)
results = results(dds)
Let’s check to see if we set up our contrast correctly. We should have the treatment condition first and the control second in the log2 fold change (MLE) output.
head(results)
## log2 fold change (MLE): treatment hot vs control
## Wald test p-value: treatment hot vs control
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## isogroup1 0.754706405965306 -0.60136668210655 1.30182527512437
## isogroup10 0.383450425721418 1.08565060450841 3.50412555536888
## isogroup1000 1.32897901722752 -1.42687692603426 0.969795073733801
## isogroup100001 42.4408554733082 1.54428393380751 0.26011961979613
## isogroup100004 0.483099523825571 0.331367581259926 2.74337032051983
## isogroup100011 3.36831075613279 -0.464355783531558 0.613749442645296
## stat pvalue padj
## <numeric> <numeric> <numeric>
## isogroup1 -0.461941163378547 0.644123513886682 NA
## isogroup10 0.309820692025439 0.756697315380076 NA
## isogroup1000 -1.47131797704504 0.141205143857845 NA
## isogroup100001 5.93682220133126 2.9059951355831e-09 1.20250078710429e-06
## isogroup100004 0.120788498286712 0.903858560436392 NA
## isogroup100011 -0.75658852174294 0.449296420158435 0.764299804712495
Looks good!
Table 4 | Descriptive statistics on the normalized read counts
min | mean | median | max | zeros | percent.zeros | |
---|---|---|---|---|---|---|
HA1 | 0 | 8.230796 | 0 | 34780.34 | 27412 | 57.77272 |
CA2 | 0 | 8.758342 | 0 | 26713.26 | 26086 | 54.97808 |
HB2 | 0 | 8.985652 | 0 | 19640.55 | 27379 | 57.70317 |
CB3 | 0 | 8.875032 | 0 | 30107.94 | 29587 | 62.35669 |
CB4 | 0 | 8.610759 | 0 | 26954.30 | 29648 | 62.48525 |
HB5 | 0 | 8.241059 | 0 | 32413.89 | 31056 | 65.45271 |
HB6 | 0 | 9.055725 | 0 | 70244.55 | 32583 | 68.67097 |
CC2 | 0 | 12.325855 | 0 | 15440.61 | 27634 | 58.24060 |
CC3 | 0 | 10.298430 | 0 | 31488.80 | 28638 | 60.35660 |
CD2 | 0 | 7.947732 | 0 | 33024.83 | 23755 | 50.06533 |
CD4 | 0 | 14.597700 | 0 | 25407.69 | 30593 | 64.47690 |
HD6 | 0 | 8.348241 | 0 | 54994.67 | 32916 | 69.37279 |
CF8 | 0 | 9.960673 | 0 | 21527.44 | 25384 | 53.49857 |
HK4 | 0 | 7.927985 | 0 | 31649.91 | 29312 | 61.77710 |
HN1 | 0 | 10.310104 | 0 | 20220.15 | 33473 | 70.54670 |
HO1 | 0 | 6.849463 | 0 | 18247.30 | 27625 | 58.22163 |
HO2 | 0 | 8.290486 | 0 | 51366.77 | 28910 | 60.92986 |
CO4 | 0 | 7.759618 | 0 | 24269.80 | 28159 | 59.34707 |
CQ2 | 0 | 7.621172 | 0 | 29675.98 | 27665 | 58.30593 |
HQ3 | 0 | 14.075792 | 0 | 36182.53 | 32532 | 68.56348 |
CQ4 | 0 | 16.615571 | 0 | 17129.58 | 25525 | 53.79573 |
CQ5 | 0 | 8.778528 | 0 | 23485.83 | 25970 | 54.73360 |
HQ6 | 0 | 8.732106 | 0 | 23547.02 | 27196 | 57.31748 |
Lets do a rlogged transformation, which is useful various unsupervised clustering analyses. Be sure the include blind = TRUE as it doesn’t normalize the data with any priors from our experimental design.
rlogged = rlogTransformation(dds, blind = TRUE)
The order of treatments is really important here. Putting “hot” before “control” means that the gene expression of corals in the heated treatment will be compared to control, which is what we want. Make sure you put the control group second
In the output table, negative log2FoldChange indicates downregulation in heated compared to control, and positive log2FoldChagne indicates upregulation.
results_FvsC = results(dds, contrast = c("treatment", "hot", "control"))
head(results_FvsC)
## log2 fold change (MLE): treatment hot vs control
## Wald test p-value: treatment hot vs control
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## isogroup1 0.754706405965306 -0.60136668210655 1.30182527512437
## isogroup10 0.383450425721418 1.08565060450841 3.50412555536888
## isogroup1000 1.32897901722752 -1.42687692603426 0.969795073733801
## isogroup100001 42.4408554733082 1.54428393380751 0.26011961979613
## isogroup100004 0.483099523825571 0.331367581259926 2.74337032051983
## isogroup100011 3.36831075613279 -0.464355783531558 0.613749442645296
## stat pvalue padj
## <numeric> <numeric> <numeric>
## isogroup1 -0.461941163378547 0.644123513886682 NA
## isogroup10 0.309820692025439 0.756697315380076 NA
## isogroup1000 -1.47131797704504 0.141205143857845 NA
## isogroup100001 5.93682220133126 2.9059951355831e-09 1.20250078710429e-06
## isogroup100004 0.120788498286712 0.903858560436392 NA
## isogroup100011 -0.75658852174294 0.449296420158435 0.764299804712495
Summary of DEGs with FDR < 0.1
We can see how many are up regulated and how many are down regulated (and what percent of the genome) 4.3% of genes (1901 genes) are upregulated and 7.2% of genes (3151 genes) are downregulated.
summary(results_FvsC)
##
## out of 43725 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 363, 0.83%
## LFC < 0 (down) : 367, 0.84%
## outliers [1] : 4, 0.0091%
## low counts [2] : 35445, 81%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
How about those for a FDR of < 0.05? Since there are enough genes to explore here, we will use this for subsequent anaylses.
results_FvsC05 = results(dds, contrast = c("treatment", "hot", "control"), alpha = 0.05)
summary(results_FvsC05)
##
## out of 43725 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 230, 0.53%
## LFC < 0 (down) : 255, 0.58%
## outliers [1] : 4, 0.0091%
## low counts [2] : 34601, 79%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
write.csv(results_FvsC05, file="tables/hot_results.csv")
MA plot because people like them, but besides looking neat I find them not overly informative
DESeq2::plotMA(results_FvsC, main = "hot vs Control")
Figure 11 | “Minus over average”, or MA-plot of hot vs control treatments. Red dots represent genes differentially expressed FDR < 0.1 whereas grey dots are not differentially expressed.
Heatmaps
Overall expression, sample by distance.
sampleDists <- as.matrix(dist(t(assay(rlogged))))
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
col=colorpanel(100, "black", "white"),
margin=c(10, 10))
Figure 12 | Sample by distance heatmap. This figure takes overall expression and clusters samples based on euclidean distances. We can see that samples cluster mostly by genotype with some exceptions.
Individual genes heatmap
I like plotting the z scores, as heatmap2 creates nice clean clusters by doing this. The heatmap shows that our samples are clustering by treatment (H=hot, C=control). Upregulation indicated by warmer colors, downregulation indicated by cooler colors.
norm_counts = read.csv("tables/hot_normalized_counts.csv")
hm = read.csv("tables/hot_results.csv") %>% as_tibble() %>%
filter(padj < 0.05) %>% # only want the most DEGs
dplyr::select(X) %>%
merge(norm_counts, by.x = "X", by.y = "X") # turn into a countdatafile
row.names(hm) = hm$X
hm$X = NULL
## Turning into z-score table
hm.z = data.matrix(hm)
hm.z = sweep(hm.z, 1L, rowMeans(hm.z), check.margin = FALSE)
hm.z.sx = apply(hm.z, 1L, sd)
hm.z = sweep(hm.z, 1L, hm.z.sx, "/", check.margin = FALSE)
hm.z = data.matrix(hm.z)
colour = colorRampPalette(rev(c("chocolate1","#FEE090","grey10", "cyan3","cyan")))(100)
heatmap.2(hm.z, col = colour, Rowv = TRUE, Colv = TRUE, scale = "row",
dendrogram = "both",
trace = "none",
margin = c(5,15))
Figure 13 | Heatmap of z-scores of DEGs (FDR < 0.05). Upregulation indicated by warmer colors, downregulation indicated by cooler colors.
Principal Component Analyses
First we create a PCA data frame and calculate the variance estimated by PC1 and PC2
pcadata = DESeq2::plotPCA(rlogged, intgroup = c( "treatment", "genotype"), returnData = TRUE)
percentVar = round(100 * attr(pcadata, "percentVar"))
pca = prcomp(t(assay(rlogged)), center = TRUE, scale. = FALSE)
Using adonis from library(vegan) we can see if there are any significant differences based on treatment (as is illustrated in the model displayed below)
adonis(pca$x ~ genotype + treatment, method = 'eu')
##
## Call:
## adonis(formula = pca$x ~ genotype + treatment, method = "eu")
##
## Permutation: free
## Number of permutations: 999
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## genotype 8 60962 7620.2 1.7056 0.47454 0.001 ***
## treatment 1 9424 9424.1 2.1094 0.07336 0.001 ***
## Residuals 13 58080 4467.7 0.45211
## Total 22 128466 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cols = c("control" = "grey", "hot" = "#ea6227")
DESeq2::plotPCA(rlogged, returnData = TRUE, intgroup = c("treatment", "genotype") ) %>%
ggplot(aes(x = PC1, y = PC2)) +
geom_point(aes(colour = treatment),size = 3) +
stat_ellipse(geom = "polygon", alpha = 1/10, aes(fill = treatment)) +
scale_colour_manual(values = cols) +
scale_fill_manual(values = cols) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
labs(title = "~ treatment, p<0.001")+
theme_cowplot()
Figure 14 | Principal component analyses highlighting treatment groups. Treatment has a significant affect (p < 0.001) on overall expression
DESeq2::plotPCA(rlogged, returnData = TRUE, intgroup = c("treatment", "genotype") ) %>%
ggplot(aes(x = PC1, y = PC2)) +
geom_point(aes(colour = genotype)) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
labs(title = "~ genotype, p<0.001")+
theme_cowplot()
Figure 15 | Principal component analyses highlighting genotypes of coral. Treatment has a significant affect (p < 0.001) on overall expression
Gene ontology
We are going to do a Mann-Whitney U test which requries that we first -logged the pvalue, and multiply it by -1 if it’s less than zero or by 1 if it’s greater than zero
results = read.csv("tables/hot_results.csv")
go_input = results %>%
mutate(mutated_p = -log(pvalue)) %>%
mutate(mutated_p_updown = ifelse(log2FoldChange < 0, mutated_p*-1, mutated_p*1)) %>%
dplyr::select(X, mutated_p_updown) %>%
na.omit()
colnames(go_input) = NULL
write.csv(go_input, "go_input.csv", row.names = FALSE)
These code are all adapted from Dr. Matz and can be found GO_MWU https://github.com/z0on/GO_MWU This will be broken down into three sections
Cellular Components
input="go_input_saved.csv" # two columns of comma-separated values: gene id, continuous measure of significance. To perform standard GO enrichment analysis based on Fisher's exact test, use binary measure (0 or 1, i.e., either sgnificant or not).
goAnnotations="wh_host_final_iso2go.tab" # two-column, tab-delimited, one line per gene, multiple GO terms separated by semicolon. If you have multiple lines per gene, use nrify_GOtable.pl prior to running this script.
goDatabase="go.obo" # download from http://www.geneontology.org/GO.downloads.ontology.shtml
goDivision="CC" # either MF, or BP, or CC
source("gomwu.functions.R")
gomwuStats(input, goDatabase, goAnnotations, goDivision,
perlPath="perl", # replace with full path to perl executable if it is not in your system's PATH already
largest=0.1, # a GO category will not be considered if it contains more than this fraction of the total number of genes
smallest=5, # a GO category should contain at least this many genes to be considered
clusterCutHeight=0.25, # threshold for merging similar (gene-sharing) terms. See README for details.
# Alternative="g" # by default the MWU test is two-tailed; specify "g" or "l" of you want to test for "greater" or "less" instead.
# Module=TRUE,Alternative="g" # un-remark this if you are analyzing a SIGNED WGCNA module (values: 0 for not in module genes, kME for in-module genes). In the call to gomwuPlot below, specify absValue=0.001 (count number of "good genes" that fall into the module)
# Module=TRUE # un-remark this if you are analyzing an UNSIGNED WGCNA module
)
## Continuous measure of interest: will perform MWU test
## 93 GO terms at 10% FDR
# do not continue if the printout shows that no GO terms pass 10% FDR.
cc_results=gomwuPlot(input,goAnnotations,goDivision,
# absValue=-log(0.05,10), # genes with the measure value exceeding this will be counted as "good genes". Specify absValue=0.001 if you are doing Fisher's exact test for standard GO enrichment or analyzing a WGCNA module (all non-zero genes = "good genes").
absValue=1,
level1=0.01, # FDR threshold for plotting. Specify level1=1 to plot all GO categories containing genes exceeding the absValue.
level2=0.001, # FDR cutoff to print in regular (not italic) font.
level3=0.0001, # FDR cutoff to print in large bold font.
txtsize=1.2, # decrease to fit more on one page, or increase (after rescaling the plot so the tree fits the text) for better "word cloud" effect
treeHeight=0.5, # height of the hierarchical clustering tree
colors=c("black","#ea6227","grey40","#f09167") # these are default colors, un-remar and change if needed
)
## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'
## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'
## GO terms dispayed: 29
## "Good genes" accounted for: 2597 out of 4188 ( 62% )
write.csv(cc_results, "cc.csv")
Figure 16 | Cellular component gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the hot treatment and purple colours are are GO terms enriched in upregulated genes in the hot treatment.
Molecular Functions
goDivision="MF"
gomwuStats(input, goDatabase, goAnnotations, goDivision,
perlPath="perl",
largest=0.1,
smallest=5,
clusterCutHeight=0.25,
)
## Continuous measure of interest: will perform MWU test
## 91 GO terms at 10% FDR
mf_results=gomwuPlot(input,goAnnotations,goDivision,
absValue=1,
level1=0.01,
level2=0.001,
level3=0.0001,
txtsize=1.2,
treeHeight=0.5,
colors=c("black","#ea6227","grey40","#f09167")
)
## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'
## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'
## GO terms dispayed: 33
## "Good genes" accounted for: 1736 out of 4609 ( 38% )
write.csv(mf_results, "mf.csv")
Figure 17 | Molecular function gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the hot treatment and purple colours are are GO terms enriched in upregulated genes in the hot treatment.
Biological process
goDivision="BP"
gomwuStats(input, goDatabase, goAnnotations, goDivision,
perlPath="perl",
largest=0.1,
smallest=5,
clusterCutHeight=0.4,
)
## Continuous measure of interest: will perform MWU test
## 179 GO terms at 10% FDR
bp_results=gomwuPlot(input,goAnnotations,goDivision,
absValue=1,
level1=0.01,
level2=0.001,
level3=0.0001,
txtsize=1.2,
treeHeight=0.5,
colors=c("black","#ea6227","grey40","#f09167")
)
## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'
## Warning in plot.formula(c(1:top) ~ c(1:top), type = "n", axes = F, xlab =
## "", : the formula 'c(1:top) ~ c(1:top)' is treated as 'c(1:top) ~ 1'
## GO terms dispayed: 36
## "Good genes" accounted for: 2174 out of 4856 ( 45% )
write.csv(bp_results, "bp.csv")
Figure 18 | Biological processes gene ontolgy enrichment using Mann-Whitney U tessts based on ranking of signed log p-values. Results are plotted as dendrograms tracing the level of gene sharing between significant GO terms. Green colours are GO terms enriched with downregulated genes in the hot treatment and purple colours are are GO terms enriched in upregulated genes in the hot treatment.
Version Control
Here is our session info, containing information about versions used
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.15
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] MASS_7.3-51.1 prettydoc_0.2.1
## [3] reshape2_1.4.3 kableExtra_1.1.0
## [5] vegan_2.5-4 permute_0.9-4
## [7] plotly_4.9.0 knitr_1.21
## [9] gplots_3.0.1.1 RColorBrewer_1.1-2
## [11] readr_1.3.1 cowplot_0.9.4
## [13] DESeq_1.34.1 lattice_0.20-38
## [15] locfit_1.5-9.1 genefilter_1.64.0
## [17] arrayQualityMetrics_3.38.0 affycoretools_1.54.0
## [19] ggplot2_3.1.0 DESeq2_1.22.2
## [21] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
## [23] BiocParallel_1.16.5 matrixStats_0.54.0
## [25] Biobase_2.42.0 GenomicRanges_1.34.0
## [27] GenomeInfoDb_1.18.1 IRanges_2.16.0
## [29] S4Vectors_0.20.1 BiocGenerics_0.28.0
## [31] dplyr_0.8.0.1 plyr_1.8.4
## [33] tidyr_0.8.2
##
## loaded via a namespace (and not attached):
## [1] R.utils_2.7.0 tidyselect_0.2.5
## [3] RSQLite_2.1.1 AnnotationDbi_1.44.0
## [5] htmlwidgets_1.3 beadarray_2.32.0
## [7] grid_3.5.1 munsell_0.5.0
## [9] codetools_0.2-16 preprocessCore_1.44.0
## [11] withr_2.1.2 colorspace_1.4-0
## [13] Category_2.48.0 OrganismDbi_1.24.0
## [15] highr_0.7 rstudioapi_0.9.0
## [17] setRNG_2013.9-1 GenomeInfoDbData_1.2.0
## [19] hwriter_1.3.2 bit64_0.9-7
## [21] xfun_0.5.3 biovizBase_1.30.1
## [23] R6_2.4.0 illuminaio_0.24.0
## [25] gridSVG_1.6-0 AnnotationFilter_1.6.0
## [27] bitops_1.0-6 reshape_0.8.8
## [29] assertthat_0.2.0 scales_1.0.0
## [31] nnet_7.3-12 gtable_0.2.0
## [33] Cairo_1.5-9 affy_1.60.0
## [35] ggbio_1.30.0 ensembldb_2.6.5
## [37] rlang_0.3.1 splines_3.5.1
## [39] rtracklayer_1.42.1 lazyeval_0.2.1
## [41] acepack_1.4.1 dichromat_2.0-0
## [43] checkmate_1.9.1 BiocManager_1.30.4
## [45] yaml_2.2.0 GenomicFeatures_1.34.3
## [47] backports_1.1.3 Hmisc_4.2-0
## [49] RBGL_1.58.1 tools_3.5.1
## [51] affyio_1.52.0 ff_2.2-14
## [53] Rcpp_1.0.1 base64enc_0.1-3
## [55] progress_1.2.0 zlibbioc_1.28.0
## [57] purrr_0.3.0 RCurl_1.95-4.11
## [59] prettyunits_1.0.2 rpart_4.1-13
## [61] openssl_1.2.1 cluster_2.0.7-1
## [63] magrittr_1.5 data.table_1.12.0
## [65] ProtGenerics_1.14.0 hms_0.4.2
## [67] evaluate_0.12 xtable_1.8-3
## [69] XML_3.98-1.16 gcrma_2.54.0
## [71] gridExtra_2.3 compiler_3.5.1
## [73] biomaRt_2.38.0 tibble_2.0.1
## [75] KernSmooth_2.23-15 crayon_1.3.4
## [77] ReportingTools_2.22.1 R.oo_1.22.0
## [79] htmltools_0.3.6 GOstats_2.48.0
## [81] mgcv_1.8-26 Formula_1.2-3
## [83] geneplotter_1.60.0 DBI_1.0.0
## [85] Matrix_1.2-15 vsn_3.50.0
## [87] R.methodsS3_1.7.1 gdata_2.18.0
## [89] pkgconfig_2.0.2 GenomicAlignments_1.18.1
## [91] foreign_0.8-71 xml2_1.2.0
## [93] foreach_1.4.4 annotate_1.60.0
## [95] BeadDataPackR_1.34.0 affyPLM_1.58.0
## [97] webshot_0.5.1 XVector_0.22.0
## [99] AnnotationForge_1.24.0 rvest_0.3.4
## [101] stringr_1.4.0 VariantAnnotation_1.28.10
## [103] digest_0.6.18 graph_1.60.0
## [105] Biostrings_2.50.2 rmarkdown_1.11
## [107] base64_2.0 htmlTable_1.13.1
## [109] edgeR_3.24.3 GSEABase_1.44.0
## [111] curl_3.3 Rsamtools_1.34.1
## [113] gtools_3.8.1 nlme_3.1-137
## [115] jsonlite_1.6 PFAM.db_3.7.0
## [117] viridisLite_0.3.0 askpass_1.1
## [119] limma_3.38.3 BSgenome_1.50.0
## [121] pillar_1.3.1 GGally_1.4.0
## [123] httr_1.4.0 survival_2.43-3
## [125] GO.db_3.7.0 glue_1.3.0.9000
## [127] iterators_1.0.10 bit_1.1-14
## [129] Rgraphviz_2.26.0 stringi_1.2.4
## [131] blob_1.1.1 oligoClasses_1.44.0
## [133] latticeExtra_0.6-28 caTools_1.17.1.1
## [135] memoise_1.1.0