In their article (Guida et al., 2011), the authors repeated the experiment 6 times for normoxic condition (with O2) and 4 times for hypoxic conditions (without O2). The results obtained for all experiments are combined in the file “count_data_diffAnalysis.txt”. This file will be used to search for differentially expressed genes using the DESeq2 R package (Love et al. 2014).
The DESeq2 package provides methods to test for differential expression by use of the negative binonial distribution and a shrinkage estimator for the distribution’s variance.
library(DESeq2)
library(dplyr)
library(reshape2)
library(ggplot2)
#Set ggplot theme as classic
theme_set(theme_classic())
countData <- read.table("/shared/projects/form_2022_23/data/rnaseq/count_data_diffAnalysis_no0counts.txt", row.names = 1, header = TRUE)
Visualize some general information : The count matrix stores for each samples (in columns) the number reads mapped to each ORF (in rows) counted with bedtools multicov
# Print the first 10 row names (ORF)
row.names(countData)[1:10]
## [1] "CPAR2_404750" "CPAR2_212570" "CPAR2_302130" "CPAR2_209760" "CPAR2_403260"
## [6] "CPAR2_209040" "CPAR2_804040" "CPAR2_211500" "CPAR2_102540" "CPAR2_109950"
You can see that the first 4 samples correspond to the 4 biological replicates of the anoxic conditions while there are 6 replicates for the normoxic condition
# Print samples name
colnames(countData)
## [1] "noO2rep1" "noO2rep2" "noO2rep3" "noO2rep4" "O2rep1" "O2rep2"
## [7] "O2rep3" "O2rep4" "O2rep5" "O2rep6"
head(countData)
## noO2rep1 noO2rep2 noO2rep3 noO2rep4 O2rep1 O2rep2 O2rep3 O2rep4
## CPAR2_404750 607 318 461 619 481 487 584 561
## CPAR2_212570 536 257 437 496 633 457 552 648
## CPAR2_302130 24 9 14 14 1 6 10 8
## CPAR2_209760 877 315 550 1207 1400 735 922 1186
## CPAR2_403260 2896 740 3019 2619 1951 1858 2615 2365
## CPAR2_209040 860 258 258 730 289 390 572 452
## O2rep5 O2rep6
## CPAR2_404750 626 642
## CPAR2_212570 984 721
## CPAR2_302130 11 8
## CPAR2_209760 1585 1328
## CPAR2_403260 5065 3185
## CPAR2_209040 329 372
The design matrix store the information about the
biological condition associated to each samples:
N = Normoxic condition (with O2)
H = Hypoxic condition (without O2)
colData <- read.table("/shared/projects/form_2022_23/data/rnaseq/design.txt", row.names = 1, header = TRUE)
colData
## condition
## noO2rep1 H
## noO2rep2 H
## noO2rep3 H
## noO2rep4 H
## O2rep1 N
## O2rep2 N
## O2rep3 N
## O2rep4 N
## O2rep5 N
## O2rep6 N
DEseq2 use a specific format to store all the data required to perform the DE analysis. All the results of our analysis will also be stored in specific slots of this object.
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~condition)
Using head, you can obtain information about the structure of the object to know where information are stored
head(dds)
## class: DESeqDataSet
## dim: 6 10
## metadata(1): version
## assays(1): counts
## rownames(6): CPAR2_404750 CPAR2_212570 ... CPAR2_403260 CPAR2_209040
## rowData names(0):
## colnames(10): noO2rep1 noO2rep2 ... O2rep5 O2rep6
## colData names(1): condition
For instance, the count matrix can be access using : counts()
head(counts(dds))
## noO2rep1 noO2rep2 noO2rep3 noO2rep4 O2rep1 O2rep2 O2rep3 O2rep4
## CPAR2_404750 607 318 461 619 481 487 584 561
## CPAR2_212570 536 257 437 496 633 457 552 648
## CPAR2_302130 24 9 14 14 1 6 10 8
## CPAR2_209760 877 315 550 1207 1400 735 922 1186
## CPAR2_403260 2896 740 3019 2619 1951 1858 2615 2365
## CPAR2_209040 860 258 258 730 289 390 572 452
## O2rep5 O2rep6
## CPAR2_404750 626 642
## CPAR2_212570 984 721
## CPAR2_302130 11 8
## CPAR2_209760 1585 1328
## CPAR2_403260 5065 3185
## CPAR2_209040 329 372
Normalization is a necessary step to correct for different sequencing depth between samples. As you can see their is significant differences between total number of reads obtained for each samples.
# Bar plots
barplot(colSums(counts(dds)), main = "Total read counts per samples")
Histogramme of the total read count per library.
Calculate the size factor for each sample, which will be use as a correcting factor for sequencing depth (by dividing the raw counts by the size factors)
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
## noO2rep1 noO2rep2 noO2rep3 noO2rep4 O2rep1 O2rep2 O2rep3 O2rep4
## 1.0032672 0.4149502 0.7813992 1.1207499 1.1734090 0.8409439 1.1246014 1.1915117
## O2rep5 O2rep6
## 1.6880443 1.3096478
# Row data
Row.data <- melt(log2(counts(dds)+1))
colnames(Row.data) <- c("ORF", "Samples","Expr.Val")
Row.data <- data.frame(Row.data,
Condition = c(substr(Row.data$Samples[1:(4*5788)], 1, 4),
substr(Row.data$Samples[(4*5788+1):nrow(Row.data)], 1, 2)))
ggplot(Row.data, aes(x = Samples, y = Expr.Val, fill = Condition)) +
geom_boxplot() +
ylab(expression(log[2](count + 1))) +
ggtitle(label = "Per sample distribution of the\nlog2 transformed expression value before normalization")+
theme(axis.text.x=element_text(angle=45, hjust=1))
# Normalized data
Norm.data <- melt(log2(counts(dds, normalized=TRUE)+1))
colnames(Norm.data) <- c("ORF", "Samples","Expr.Val")
Norm.data <- data.frame(Norm.data,
Condition = c(substr(Norm.data$Samples[1:(4*5788)], 1, 4),
substr(Norm.data$Samples[(4*5788+1):nrow(Norm.data)], 1, 2)))
ggplot(Norm.data, aes(x = Samples, y = Expr.Val, fill = Condition)) +
geom_boxplot() +
ylab(expression(log[2](count + 1))) +
ggtitle(label = "Per sample distribution of the\nlog2 transformed expression value after normalization") +
theme(axis.text.x=element_text(angle=45, hjust=1))
Boxplot representing the per sample distribution, before and after normalization.
vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
# PCA plot
plotPCA(vsd) +
ggtitle(label = "Principal Component Analysis (PCA)")
Scater plot of sample along PC1 and PC2 coordinates
dds <- estimateDispersions(dds)
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
plotDispEsts(dds)
Scater plot representing the mean-dispersion relationship for each genes
dds <- nbinomWaldTest(dds)
res <- results(dds, contrast=c("condition","H","N"))
head(res)
## log2 fold change (MLE): condition H vs N
## Wald test p-value: condition H vs N
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## CPAR2_404750 535.3861 0.4056168 0.168628 2.405389 0.01615524 0.0405212
## CPAR2_212570 540.6451 -0.0164868 0.128962 -0.127841 0.89827441 0.9366682
## CPAR2_302130 11.2238 1.6472390 0.556416 2.960445 0.00307195 0.0103797
## CPAR2_209760 924.9407 -0.1855940 0.166547 -1.114365 0.26512276 0.3816501
## CPAR2_403260 2448.5032 0.2607853 0.229521 1.136218 0.25586548 0.3710017
## CPAR2_209040 453.7463 0.8316175 0.315945 2.632161 0.00848436 0.0242099
# Generate a summary
summary(res)
##
## out of 5788 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 1412, 24%
## LFC < 0 (down) : 1431, 25%
## outliers [1] : 4, 0.069%
## low counts [2] : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Take a look at the log2FoldChange. What is the directionality of the calculated log2FC?
# Using "mcols()" you can obtain a description of the result information stored in columns
mcols(res, use.names=TRUE)
## DataFrame with 6 rows and 2 columns
## type description
## <character> <character>
## baseMean intermediate mean of normalized c..
## log2FoldChange results log2 fold change (ML..
## lfcSE results standard error: cond..
## stat results Wald statistic: cond..
## pvalue results Wald test p-value: c..
## padj results BH adjusted p-values
# Display the whole column description
res@elementMetadata@listData[["description"]]
## [1] "mean of normalized counts for all samples"
## [2] "log2 fold change (MLE): condition H vs N"
## [3] "standard error: condition H vs N"
## [4] "Wald statistic: condition H vs N"
## [5] "Wald test p-value: condition H vs N"
## [6] "BH adjusted p-values"
Convert res into data frame
res <- as.data.frame(res)
ggplot(res, aes(x = pvalue)) +
geom_histogram(bins = 100) +
ggtitle(label = "P-value distibution")
Histogram of the p-value distribution
We will draw a Volcano plot using the EnhanceVolcano Bioconductor package
# EnhanceVolcano library loading
library(EnhancedVolcano)
# Threshold selection
logFCthreshold <- 2
adjPVthreshold <- 0.05
# Volcano plot drawing
EnhancedVolcano(res,
lab = rownames(res),
x = 'log2FoldChange',
y = 'padj',
axisLabSize = 12,
ylim = c(0,max(-log10(res[[6]]))),
caption = paste0(nrow(res), " total genes"),
captionLabSize = 12,
FCcutoff = logFCthreshold,
pCutoff = adjPVthreshold,
title = 'Hypoxia VS Normoxia',
subtitle = 'Volcano plot',
pointSize = 2.0,
labSize = 4.0,
legendLabSize = 12,
legendIconSize = 2.0,
drawConnectors = TRUE)
## Warning: ggrepel: 166 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
plotMA(results(dds, contrast=c("condition","H","N")),
alpha = adjPVthreshold,
main = paste("MA plot (Hypoxia VS Normoxia) adjPV <",adjPVthreshold),
ylim=c(min(na.omit(res[[2]])),max(na.omit(res[[2]]))))
Genes with log2FoldChange > 2 and padj < 0.05
FoldFc_pos <- res[res$log2FoldChange > logFCthreshold & res$padj < adjPVthreshold,]
paste(nrow(FoldFc_pos), "upregulated genes in Hypoxia")
## [1] "109 upregulated genes in Hypoxia"
# To save the result in your home working directory
write.table(FoldFc_pos, row.names = T, quote = F, sep = ";", file = "~/RNAseq_FoldFc_pos.csv")
Genes with log2FoldChange < -2 and padj < 0.05
FoldFc_neg <- res[res$log2FoldChange <(-logFCthreshold) & res$padj < adjPVthreshold,]
paste(nrow(FoldFc_neg), "upregulated genes in Normaxia")
## [1] "85 upregulated genes in Normaxia"
# To save the result in your home working directory
write.table(FoldFc_neg, row.names = T, quote = F, sep = ";", file = "~/RNAseq_FoldFc_neg.csv")
Several tools are available online to evaluate the biological relevance of the gene sets you select after the differential analysis. For example you can you use the [GoTermFinder tool] (http://www.candidagenome.org/cgi-bin/GO/goTermFinder) dedicated to Candida yeast species to retrieve functional annotation. You can also obtain information on a specific gene using the Candida Genome Database.
# Date
format(Sys.time(), "%d %B, %Y, %H,%M")
## [1] "24 August, 2022, 14,25"
# Packages used
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.1.1/lib/libopenblasp-r0.3.18.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] EnhancedVolcano_1.12.0 ggrepel_0.9.1
## [3] ggplot2_3.3.6 reshape2_1.4.4
## [5] dplyr_1.0.9 DESeq2_1.34.0
## [7] SummarizedExperiment_1.24.0 Biobase_2.54.0
## [9] MatrixGenerics_1.6.0 matrixStats_0.62.0
## [11] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
## [13] IRanges_2.28.0 S4Vectors_0.32.4
## [15] BiocGenerics_0.40.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 bit64_4.0.5 ash_1.0-15
## [4] RColorBrewer_1.1-3 httr_1.4.2 tools_4.1.1
## [7] bslib_0.3.1 utf8_1.2.2 R6_2.5.1
## [10] vipor_0.4.5 KernSmooth_2.23-20 DBI_1.1.2
## [13] colorspace_2.0-3 withr_2.5.0 ggrastr_1.0.1
## [16] tidyselect_1.1.2 ggalt_0.4.0 extrafontdb_1.0
## [19] bit_4.0.4 compiler_4.1.1 cli_3.3.0
## [22] DelayedArray_0.20.0 labeling_0.4.2 sass_0.4.1
## [25] scales_1.2.0 proj4_1.0-11 genefilter_1.76.0
## [28] stringr_1.4.0 digest_0.6.29 rmarkdown_2.11
## [31] XVector_0.34.0 pkgconfig_2.0.3 htmltools_0.5.2
## [34] extrafont_0.17 maps_3.4.0 fastmap_1.1.0
## [37] highr_0.9 rlang_1.0.4 rstudioapi_0.13
## [40] RSQLite_2.2.8 jquerylib_0.1.4 generics_0.1.3
## [43] farver_2.1.1 jsonlite_1.8.0 BiocParallel_1.28.3
## [46] RCurl_1.98-1.7 magrittr_2.0.3 GenomeInfoDbData_1.2.7
## [49] Matrix_1.3-4 ggbeeswarm_0.6.0 Rcpp_1.0.9
## [52] munsell_0.5.0 fansi_1.0.3 lifecycle_1.0.1
## [55] stringi_1.7.8 yaml_2.3.5 MASS_7.3-54
## [58] zlibbioc_1.40.0 plyr_1.8.7 grid_4.1.1
## [61] blob_1.2.2 parallel_4.1.1 crayon_1.5.1
## [64] lattice_0.20-45 Biostrings_2.62.0 splines_4.1.1
## [67] annotate_1.72.0 KEGGREST_1.34.0 locfit_1.5-9.4
## [70] knitr_1.37 pillar_1.8.0 geneplotter_1.72.0
## [73] XML_3.99-0.9 glue_1.6.2 evaluate_0.14
## [76] png_0.1-7 vctrs_0.4.1 Rttf2pt1_1.3.9
## [79] gtable_0.3.0 purrr_0.3.4 assertthat_0.2.1
## [82] cachem_1.0.6 xfun_0.31 xtable_1.8-4
## [85] survival_3.2-13 tibble_3.1.7 beeswarm_0.4.0
## [88] AnnotationDbi_1.56.1 memoise_2.0.1 ellipsis_0.3.2