In their article (Guida et al., 2011), the authors repeated the experiment 6 times for normoxic condition (with O2) and 4 times for hypoxic conditions (without O2). The results obtained for all experiments are combined in the file “count_data_diffAnalysis.txt”. This file will be used to search for differentially expressed genes using the DESeq2 R package (Love et al. 2014).
The DESeq2 package provides methods to test for differential expression by use of the negative binonial distribution and a shrinkage estimator for the distribution’s variance.
library(DESeq2)
library(dplyr)
library(reshape2)
library(ggplot2)
#Set ggplot theme as classic
theme_set(theme_classic())
countData <- read.table("/shared/projects/ens_hts_2021/data/rnaseq/count_data_diffAnalysis.txt", row.names = 1, header = TRUE)
Visualize some general information : The count matrix stores for each samples (in columns) the number reads mapped to each ORF (in rows) counted with bedtools multicov
# Print the first 10 row names (ORF)
row.names(countData)[1:10]
## [1] "CPAR2_404750" "CPAR2_212570" "CPAR2_302130" "CPAR2_209760" "CPAR2_403260"
## [6] "CPAR2_209040" "CPAR2_804040" "CPAR2_211500" "CPAR2_102540" "CPAR2_109950"
You can see that the first 4 samples correspond to the 4 biological replicates of the anoxic conditions while there are 6 replicates for the normoxic condition
# Print samples name
colnames(countData)
## [1] "noO2rep1" "noO2rep2" "noO2rep3" "noO2rep4" "O2rep1" "O2rep2"
## [7] "O2rep3" "O2rep4" "O2rep5" "O2rep6"
head(countData)
## noO2rep1 noO2rep2 noO2rep3 noO2rep4 O2rep1 O2rep2 O2rep3 O2rep4
## CPAR2_404750 607 318 461 619 481 487 584 561
## CPAR2_212570 536 257 437 496 633 457 552 648
## CPAR2_302130 24 9 14 14 1 6 10 8
## CPAR2_209760 877 315 550 1207 1400 735 922 1186
## CPAR2_403260 2896 740 3019 2619 1951 1858 2615 2365
## CPAR2_209040 860 258 258 730 289 390 572 452
## O2rep5 O2rep6
## CPAR2_404750 626 642
## CPAR2_212570 984 721
## CPAR2_302130 11 8
## CPAR2_209760 1585 1328
## CPAR2_403260 5065 3185
## CPAR2_209040 329 372
The design matrix store the information about the biological condition associated to each samples:
N = Normoxic condition (with O2)
H = Hypoxic condition (without O2)
colData <- read.table("/shared/projects/ens_hts_2021/data/rnaseq/design.txt", row.names = 1, header = TRUE)
colData
## condition
## noO2rep1 H
## noO2rep2 H
## noO2rep3 H
## noO2rep4 H
## O2rep1 N
## O2rep2 N
## O2rep3 N
## O2rep4 N
## O2rep5 N
## O2rep6 N
DEseq2 use a specific format to store all the data required to perform the DE analysis. All the results of our analysis will also be stored in specific slots of this object.
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~condition)
Using head, you can obtain information about the structure of the object to know where information are stored
head(dds)
## class: DESeqDataSet
## dim: 6 10
## metadata(1): version
## assays(1): counts
## rownames(6): CPAR2_404750 CPAR2_212570 ... CPAR2_403260 CPAR2_209040
## rowData names(0):
## colnames(10): noO2rep1 noO2rep2 ... O2rep5 O2rep6
## colData names(1): condition
For instance, the count matrix can be access using : counts()
head(counts(dds))
## noO2rep1 noO2rep2 noO2rep3 noO2rep4 O2rep1 O2rep2 O2rep3 O2rep4
## CPAR2_404750 607 318 461 619 481 487 584 561
## CPAR2_212570 536 257 437 496 633 457 552 648
## CPAR2_302130 24 9 14 14 1 6 10 8
## CPAR2_209760 877 315 550 1207 1400 735 922 1186
## CPAR2_403260 2896 740 3019 2619 1951 1858 2615 2365
## CPAR2_209040 860 258 258 730 289 390 572 452
## O2rep5 O2rep6
## CPAR2_404750 626 642
## CPAR2_212570 984 721
## CPAR2_302130 11 8
## CPAR2_209760 1585 1328
## CPAR2_403260 5065 3185
## CPAR2_209040 329 372
Normalization is a necessary step to correct for different sequencing depth between samples. As you can see their is significant differences between total number of reads obtained for each samples.
# Bar plots
barplot(colSums(counts(dds)), main = "Total read counts per samples")
Histogramme of the total read count per library.
Calculate the size factor for each sample, which will be use as a correcting factor for sequencing depth (by dividing the raw counts by the size factors)
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
## noO2rep1 noO2rep2 noO2rep3 noO2rep4 O2rep1 O2rep2 O2rep3 O2rep4
## 1.0032672 0.4149502 0.7813992 1.1207499 1.1734090 0.8409439 1.1246014 1.1915117
## O2rep5 O2rep6
## 1.6880443 1.3096478
# Row data
Row.data <- melt(log2(counts(dds)+1))
colnames(Row.data) <- c("ORF", "Samples","Expr.Val")
Row.data <- data.frame(Row.data,
Condition = c(substr(Row.data$Samples[1:(4*5830)], 1, 4),
substr(Row.data$Samples[(4*5830+1):nrow(Row.data)], 1, 2)))
ggplot(Row.data, aes(x = Samples, y = Expr.Val, fill = Condition)) +
geom_boxplot() +
ylab(expression(log[2](count + 1))) +
ggtitle(label = "Per sample distribution of the\nlog2 transformed expression value before normalization")+
theme(axis.text.x=element_text(angle=45, hjust=1))
# Normalized data
Norm.data <- melt(log2(counts(dds, normalized=TRUE)+1))
colnames(Norm.data) <- c("ORF", "Samples","Expr.Val")
Norm.data <- data.frame(Norm.data,
Condition = c(substr(Norm.data$Samples[1:(4*5830)], 1, 4),
substr(Norm.data$Samples[(4*5830+1):nrow(Norm.data)], 1, 2)))
ggplot(Norm.data, aes(x = Samples, y = Expr.Val, fill = Condition)) +
geom_boxplot() +
ylab(expression(log[2](count + 1))) +
ggtitle(label = "Per sample distribution of the\nlog2 transformed expression value after normalization") +
theme(axis.text.x=element_text(angle=45, hjust=1))
Boxplot representing the per sample distribution, before and after normalization.
vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
# PCA plot
plotPCA(vsd) +
ggtitle(label = "Principal Component Analysis (PCA)")
Scater plot of sample along PC1 and PC2 coordinates
dds <- estimateDispersions(dds)
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
plotDispEsts(dds)
Scater plot representing the mean-dispersion relationship for each genes
dds <- nbinomWaldTest(dds)
# Generate a summary
summary(results(dds))
##
## out of 5788 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 1431, 25%
## LFC < 0 (down) : 1412, 24%
## outliers [1] : 4, 0.069%
## low counts [2] : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
res <- results(dds)
head(res)
## log2 fold change (MLE): condition N vs H
## Wald test p-value: condition N vs H
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## CPAR2_404750 535.3861 -0.4056168 0.168628 -2.405389 0.01615524 0.0405212
## CPAR2_212570 540.6451 0.0164868 0.128962 0.127841 0.89827441 0.9366682
## CPAR2_302130 11.2238 -1.6472390 0.556416 -2.960445 0.00307195 0.0103797
## CPAR2_209760 924.9407 0.1855940 0.166547 1.114365 0.26512276 0.3816501
## CPAR2_403260 2448.5032 -0.2607853 0.229521 -1.136218 0.25586548 0.3710017
## CPAR2_209040 453.7463 -0.8316175 0.315945 -2.632161 0.00848436 0.0242099
Take a look at the log2FoldChange. **What is the directionality of the calculated log2FC?
# Using "mcols()" you can obtain a description of the result information stored in columns
mcols(res, use.names=TRUE)
## DataFrame with 6 rows and 2 columns
## type description
## <character> <character>
## baseMean intermediate mean of normalized c..
## log2FoldChange results log2 fold change (ML..
## lfcSE results standard error: cond..
## stat results Wald statistic: cond..
## pvalue results Wald test p-value: c..
## padj results BH adjusted p-values
# Display the whole column description
res@elementMetadata@listData[["description"]]
## [1] "mean of normalized counts for all samples"
## [2] "log2 fold change (MLE): condition N vs H"
## [3] "standard error: condition N vs H"
## [4] "Wald statistic: condition N vs H"
## [5] "Wald test p-value: condition N vs H"
## [6] "BH adjusted p-values"
Convert res into data frame
res <- as.data.frame(res)
ggplot(res, aes(x = pvalue)) +
geom_histogram(bins = 100) +
ggtitle(label = "P-value distibution")
Histogram of the p-value distribution
#Set a new data frame which will store all variable needed to generate the volcano plot
Vplot.data <- data.frame(gene = row.names(res),
padj = res$padj,
Log2FC = res$log2FoldChange)
# Remove any rows that have NA
Vplot.data <- na.omit(Vplot.data)
# Mark the genes which are significantly up regulated or down regulated according to the following thresholds
logFCthreshold <- 2
adjPVthreshold <- 0.05
Vplot.data <- mutate(Vplot.data, color = case_when(Vplot.data$Log2FC > logFCthreshold & Vplot.data$padj < adjPVthreshold ~ "Normoxic > Hypoxic",
Vplot.data$Log2FC < -logFCthreshold & Vplot.data$padj < adjPVthreshold ~ "Hypoxic > Normoxic",
Vplot.data$padj > adjPVthreshold | abs(Vplot.data$Log2FC) < logFCthreshold ~ "NS"))
# Build the ggplot. Note that we -log10() transform the padj so that gene with the lowest padj will have the highest y axis value
ggplot(Vplot.data, aes(x = Log2FC, y = -log10(padj), color = color)) + # Set the default plot
geom_point(size = 2.5, alpha = 0.8, na.rm = T) + # Set the dot size
scale_color_manual(name = "Expression change",
values = c("Normoxic > Hypoxic" = "#008B00",
"Hypoxic > Normoxic" = "#CD4F39",
"NS" = "darkgray")) +
geom_hline(yintercept = -log10(adjPVthreshold), colour = "red") + # p-adj value cutoff
geom_vline(xintercept = c(-logFCthreshold,logFCthreshold), colour = "red") + # Log2FC value cutoff
xlab(expression(log[2]("Fold change"))) + ylab(expression(-log[10]("adjusted p-value"))) +
ggtitle(label = "Normoxic VS Hypoxic", subtitle = paste("|Log2FC| >",logFCthreshold," & Padj <",adjPVthreshold))
Volcano plot with genes colored according to DE filters
plotMA(results(dds))
MA plot of DE test results
Genes with log2FoldChange > 2 and padj < 0.05
FoldFc_pos <- res[res$log2FoldChange > logFCthreshold & res$padj < adjPVthreshold,]
nrow(FoldFc_pos)
## [1] 127
# To save the result in your home working directory
write.table(FoldFc_pos, row.names = T, quote = F, sep = ";", file = "~/RNAseq_FoldFc_pos.csv")
Genes with log2FoldChange < -2 and padj < 0.05
FoldFc_neg <- res[res$log2FoldChange <(-logFCthreshold) & res$padj < adjPVthreshold,]
nrow(FoldFc_neg)
## [1] 151
# To save the result in your home working directory
write.table(FoldFc_neg, row.names = T, quote = F, sep = ";", file = "~/RNAseq_FoldFc_neg.csv")
Several tools are available online to evaluate the biological relevance of the gene sets you select after the differential analysis. For example you can you use the [GoTermFinder tool] (http://www.candidagenome.org/cgi-bin/GO/goTermFinder) dedicated to Candida yeast species to retrieve functional annotation. You can also obtain information on a specific gene using the Candida Genome Database.
# Date
format(Sys.time(), "%d %B, %Y, %H,%M")
## [1] "19 August, 2021, 17,34"
# Packages used
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.0.3/lib/libopenblasp-r0.3.10.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggplot2_3.3.4 reshape2_1.4.4
## [3] dplyr_1.0.6 DESeq2_1.30.1
## [5] SummarizedExperiment_1.20.0 Biobase_2.50.0
## [7] MatrixGenerics_1.2.1 matrixStats_0.59.0
## [9] GenomicRanges_1.42.0 GenomeInfoDb_1.26.7
## [11] IRanges_2.24.1 S4Vectors_0.28.1
## [13] BiocGenerics_0.36.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 sass_0.4.0 bit64_4.0.5
## [4] jsonlite_1.7.2 splines_4.0.3 bslib_0.2.5.1
## [7] assertthat_0.2.1 highr_0.9 blob_1.2.1
## [10] GenomeInfoDbData_1.2.4 yaml_2.2.1 pillar_1.6.1
## [13] RSQLite_2.2.7 lattice_0.20-41 glue_1.4.2
## [16] digest_0.6.27 RColorBrewer_1.1-2 XVector_0.30.0
## [19] colorspace_2.0-1 plyr_1.8.6 htmltools_0.5.1.1
## [22] Matrix_1.3-4 XML_3.99-0.6 pkgconfig_2.0.3
## [25] genefilter_1.72.1 zlibbioc_1.36.0 purrr_0.3.4
## [28] xtable_1.8-4 scales_1.1.1 BiocParallel_1.24.1
## [31] tibble_3.1.2 annotate_1.68.0 farver_2.1.0
## [34] generics_0.1.0 ellipsis_0.3.2 withr_2.4.2
## [37] cachem_1.0.5 survival_3.2-10 magrittr_2.0.1
## [40] crayon_1.4.1 memoise_2.0.0 evaluate_0.14
## [43] fansi_0.5.0 tools_4.0.3 lifecycle_1.0.0
## [46] stringr_1.4.0 munsell_0.5.0 locfit_1.5-9.4
## [49] DelayedArray_0.16.3 AnnotationDbi_1.52.0 compiler_4.0.3
## [52] jquerylib_0.1.4 rlang_0.4.11 grid_4.0.3
## [55] RCurl_1.98-1.3 labeling_0.4.2 bitops_1.0-7
## [58] rmarkdown_2.8 gtable_0.3.0 DBI_1.1.1
## [61] R6_2.5.0 knitr_1.33 fastmap_1.1.0
## [64] bit_4.0.4 utf8_1.2.1 stringi_1.6.2
## [67] Rcpp_1.0.6 vctrs_0.3.8 geneplotter_1.68.0
## [70] tidyselect_1.1.1 xfun_0.23