1. Introduction

In their article (Guida et al., 2011), the authors repeated the experiment 6 times for normoxic condition (with O2) and 4 times for hypoxic conditions (without O2). The results obtained for all experiments are combined in the file “count_data_diffAnalysis.txt”. This file will be used to search for differentially expressed genes using the DESeq2 R package (Love et al. 2014).

The DESeq2 package provides methods to test for differential expression by use of the negative binonial distribution and a shrinkage estimator for the distribution’s variance.

Aim: Searching for differentially expressed genes using DESeq R package.

How many genes are selected with different p-value thresholds (5%, 1%, etc.) ?
Check your results with IGV and use GOtermFinder to analyse the function of the selected genes.

2. Prepare the data for statistical analysis

Library loading

library(DESeq2)

library(dplyr)
library(reshape2)
library(ggplot2)

#Set ggplot theme as classic
theme_set(theme_classic())

Load the raw count matrix

countData <- read.table("/shared/projects/ens_hts_2021/data/rnaseq/count_data_diffAnalysis.txt", row.names = 1, header = TRUE)

Visualize some general information : The count matrix stores for each samples (in columns) the number reads mapped to each ORF (in rows) counted with bedtools multicov

# Print the first 10 row names (ORF)
row.names(countData)[1:10]

##  [1] "CPAR2_404750" "CPAR2_212570" "CPAR2_302130" "CPAR2_209760" "CPAR2_403260"
##  [6] "CPAR2_209040" "CPAR2_804040" "CPAR2_211500" "CPAR2_102540" "CPAR2_109950"

You can see that the first 4 samples correspond to the 4 biological replicates of the anoxic conditions while there are 6 replicates for the normoxic condition

# Print samples name
colnames(countData)

##  [1] "noO2rep1" "noO2rep2" "noO2rep3" "noO2rep4" "O2rep1"   "O2rep2"  
##  [7] "O2rep3"   "O2rep4"   "O2rep5"   "O2rep6"

head(countData)

##              noO2rep1 noO2rep2 noO2rep3 noO2rep4 O2rep1 O2rep2 O2rep3 O2rep4
## CPAR2_404750      607      318      461      619    481    487    584    561
## CPAR2_212570      536      257      437      496    633    457    552    648
## CPAR2_302130       24        9       14       14      1      6     10      8
## CPAR2_209760      877      315      550     1207   1400    735    922   1186
## CPAR2_403260     2896      740     3019     2619   1951   1858   2615   2365
## CPAR2_209040      860      258      258      730    289    390    572    452
##              O2rep5 O2rep6
## CPAR2_404750    626    642
## CPAR2_212570    984    721
## CPAR2_302130     11      8
## CPAR2_209760   1585   1328
## CPAR2_403260   5065   3185
## CPAR2_209040    329    372

Load metadata of the experiment (design matrix)

The design matrix store the information about the biological condition associated to each samples:
N = Normoxic condition (with O2)
H = Hypoxic condition (without O2)

colData <- read.table("/shared/projects/ens_hts_2021/data/rnaseq/design.txt", row.names = 1, header = TRUE)

colData

##          condition
## noO2rep1         H
## noO2rep2         H
## noO2rep3         H
## noO2rep4         H
## O2rep1           N
## O2rep2           N
## O2rep3           N
## O2rep4           N
## O2rep5           N
## O2rep6           N

Creation of the DEseq2’s specific object DESeqDataSet

DEseq2 use a specific format to store all the data required to perform the DE analysis. All the results of our analysis will also be stored in specific slots of this object.

dds <- DESeqDataSetFromMatrix(countData = countData,
                             colData = colData,
                             design = ~condition)

Using head, you can obtain information about the structure of the object to know where information are stored

head(dds)

## class: DESeqDataSet 
## dim: 6 10 
## metadata(1): version
## assays(1): counts
## rownames(6): CPAR2_404750 CPAR2_212570 ... CPAR2_403260 CPAR2_209040
## rowData names(0):
## colnames(10): noO2rep1 noO2rep2 ... O2rep5 O2rep6
## colData names(1): condition

For instance, the count matrix can be access using : counts()

head(counts(dds))

##              noO2rep1 noO2rep2 noO2rep3 noO2rep4 O2rep1 O2rep2 O2rep3 O2rep4
## CPAR2_404750      607      318      461      619    481    487    584    561
## CPAR2_212570      536      257      437      496    633    457    552    648
## CPAR2_302130       24        9       14       14      1      6     10      8
## CPAR2_209760      877      315      550     1207   1400    735    922   1186
## CPAR2_403260     2896      740     3019     2619   1951   1858   2615   2365
## CPAR2_209040      860      258      258      730    289    390    572    452
##              O2rep5 O2rep6
## CPAR2_404750    626    642
## CPAR2_212570    984    721
## CPAR2_302130     11      8
## CPAR2_209760   1585   1328
## CPAR2_403260   5065   3185
## CPAR2_209040    329    372

3. Normalization

Normalization is a necessary step to correct for different sequencing depth between samples. As you can see their is significant differences between total number of reads obtained for each samples.

# Bar plots
barplot(colSums(counts(dds)), main = "Total read counts per samples")

Histogramme of the total read count per library.

Calculate the size factor

Calculate the size factor for each sample, which will be use as a correcting factor for sequencing depth (by dividing the raw counts by the size factors)

dds <- estimateSizeFactors(dds)
sizeFactors(dds)

##  noO2rep1  noO2rep2  noO2rep3  noO2rep4    O2rep1    O2rep2    O2rep3    O2rep4 
## 1.0032672 0.4149502 0.7813992 1.1207499 1.1734090 0.8409439 1.1246014 1.1915117 
##    O2rep5    O2rep6 
## 1.6880443 1.3096478

# Row data
Row.data <- melt(log2(counts(dds)+1))
colnames(Row.data) <- c("ORF", "Samples","Expr.Val")
Row.data <- data.frame(Row.data,
                 Condition = c(substr(Row.data$Samples[1:(4*5830)], 1, 4),
                               substr(Row.data$Samples[(4*5830+1):nrow(Row.data)], 1, 2)))

ggplot(Row.data, aes(x = Samples, y = Expr.Val, fill = Condition)) +
  geom_boxplot() +
  ylab(expression(log[2](count + 1))) + 
  ggtitle(label = "Per sample distribution of the\nlog2 transformed expression value before normalization")+
  theme(axis.text.x=element_text(angle=45, hjust=1))

# Normalized data
Norm.data <- melt(log2(counts(dds, normalized=TRUE)+1))
colnames(Norm.data) <- c("ORF", "Samples","Expr.Val")
Norm.data <- data.frame(Norm.data,
                 Condition = c(substr(Norm.data$Samples[1:(4*5830)], 1, 4),
                               substr(Norm.data$Samples[(4*5830+1):nrow(Norm.data)], 1, 2)))


ggplot(Norm.data, aes(x = Samples, y = Expr.Val, fill = Condition)) +
  geom_boxplot() +
  ylab(expression(log[2](count + 1))) + 
  ggtitle(label = "Per sample distribution of the\nlog2 transformed expression value after normalization") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Boxplot representing the per sample distribution, before and after normalization.

PCA of samples

vsd <- varianceStabilizingTransformation(dds, blind=TRUE)

# PCA plot
plotPCA(vsd) + 
  ggtitle(label = "Principal Component Analysis (PCA)")

Scater plot of sample along PC1 and PC2 coordinates

4. Differential expression analysis

Variance/Dispersion estimation

dds <- estimateDispersions(dds)

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

plotDispEsts(dds)

Scater plot representing the mean-dispersion relationship for each genes

Negative binomial wald test implemented by DEseq2

dds <- nbinomWaldTest(dds)

# Generate a summary
summary(results(dds))

## 
## out of 5788 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1431, 25%
## LFC < 0 (down)     : 1412, 24%
## outliers [1]       : 4, 0.069%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Retrieve the result table

res <- results(dds)
head(res)

## log2 fold change (MLE): condition N vs H 
## Wald test p-value: condition N vs H 
## DataFrame with 6 rows and 6 columns
##               baseMean log2FoldChange     lfcSE      stat     pvalue      padj
##              <numeric>      <numeric> <numeric> <numeric>  <numeric> <numeric>
## CPAR2_404750  535.3861     -0.4056168  0.168628 -2.405389 0.01615524 0.0405212
## CPAR2_212570  540.6451      0.0164868  0.128962  0.127841 0.89827441 0.9366682
## CPAR2_302130   11.2238     -1.6472390  0.556416 -2.960445 0.00307195 0.0103797
## CPAR2_209760  924.9407      0.1855940  0.166547  1.114365 0.26512276 0.3816501
## CPAR2_403260 2448.5032     -0.2607853  0.229521 -1.136218 0.25586548 0.3710017
## CPAR2_209040  453.7463     -0.8316175  0.315945 -2.632161 0.00848436 0.0242099

Take a look at the log2FoldChange. **What is the directionality of the calculated log2FC?

# Using "mcols()" you can obtain a description of the result information stored in columns 
mcols(res, use.names=TRUE)

## DataFrame with 6 rows and 2 columns
##                        type            description
##                 <character>            <character>
## baseMean       intermediate mean of normalized c..
## log2FoldChange      results log2 fold change (ML..
## lfcSE               results standard error: cond..
## stat                results Wald statistic: cond..
## pvalue              results Wald test p-value: c..
## padj                results   BH adjusted p-values

# Display the whole column description
res@elementMetadata@listData[["description"]]

## [1] "mean of normalized counts for all samples"
## [2] "log2 fold change (MLE): condition N vs H" 
## [3] "standard error: condition N vs H"         
## [4] "Wald statistic: condition N vs H"         
## [5] "Wald test p-value: condition N vs H"      
## [6] "BH adjusted p-values"

Convert res into data frame

res <- as.data.frame(res)

Diagnostic plots

Histogram of the p values

ggplot(res, aes(x = pvalue)) +
  geom_histogram(bins = 100) + 
  ggtitle(label = "P-value distibution")

Histogram of the p-value distribution

Volcano plot

#Set a new data frame which will store all variable needed to generate the volcano plot
Vplot.data <- data.frame(gene = row.names(res),
                         padj = res$padj, 
                         Log2FC = res$log2FoldChange)

# Remove any rows that have NA 
Vplot.data <- na.omit(Vplot.data)

# Mark the genes which are significantly up regulated or down regulated according to the following thresholds

logFCthreshold <- 2
adjPVthreshold <- 0.05

Vplot.data <- mutate(Vplot.data, color = case_when(Vplot.data$Log2FC > logFCthreshold & Vplot.data$padj < adjPVthreshold ~ "Normoxic > Hypoxic",
                                                   Vplot.data$Log2FC < -logFCthreshold & Vplot.data$padj < adjPVthreshold ~ "Hypoxic > Normoxic",
                                                   Vplot.data$padj > adjPVthreshold | abs(Vplot.data$Log2FC) < logFCthreshold ~ "NS"))

# Build the ggplot. Note that we -log10() transform the padj so that gene with the lowest padj will have the highest y axis value
ggplot(Vplot.data, aes(x = Log2FC, y = -log10(padj), color = color)) + # Set the default plot
  geom_point(size = 2.5, alpha = 0.8, na.rm = T) + # Set the dot size
  scale_color_manual(name = "Expression change",
                     values = c("Normoxic > Hypoxic" = "#008B00",
                                "Hypoxic > Normoxic" = "#CD4F39",
                                "NS" = "darkgray")) +
  geom_hline(yintercept = -log10(adjPVthreshold), colour = "red") + #  p-adj value cutoff
  geom_vline(xintercept = c(-logFCthreshold,logFCthreshold), colour = "red") + # Log2FC value cutoff
  xlab(expression(log[2]("Fold change"))) + ylab(expression(-log[10]("adjusted p-value"))) +
  ggtitle(label = "Normoxic VS Hypoxic", subtitle = paste("|Log2FC| >",logFCthreshold," & Padj <",adjPVthreshold))

Volcano plot with genes colored according to DE filters

MA plot

plotMA(results(dds))

MA plot of DE test results

Export results

Genes with log2FoldChange > 2 and padj < 0.05

FoldFc_pos <- res[res$log2FoldChange > logFCthreshold & res$padj < adjPVthreshold,]
nrow(FoldFc_pos)

## [1] 127

# To save the result in your home working directory
write.table(FoldFc_pos, row.names = T, quote = F, sep = ";", file = "~/RNAseq_FoldFc_pos.csv")

Genes with log2FoldChange < -2 and padj < 0.05

FoldFc_neg <- res[res$log2FoldChange <(-logFCthreshold) & res$padj < adjPVthreshold,]
nrow(FoldFc_neg)

## [1] 151

# To save the result in your home working directory
write.table(FoldFc_neg, row.names = T, quote = F, sep = ";", file = "~/RNAseq_FoldFc_neg.csv")

5. Functional analyzes of differentially expressed genes

Several tools are available online to evaluate the biological relevance of the gene sets you select after the differential analysis. For example you can you use the [GoTermFinder tool] (http://www.candidagenome.org/cgi-bin/GO/goTermFinder) dedicated to Candida yeast species to retrieve functional annotation. You can also obtain information on a specific gene using the Candida Genome Database.

What are the functions of the genes in your lists (identified at the previous step).

Are they relevant with the studied biological system described by Guida et al. in their publication?

Reproductibility

# Date
format(Sys.time(), "%d %B, %Y, %H,%M")

## [1] "19 August, 2021, 17,34"

# Packages used
sessionInfo()

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.0.3/lib/libopenblasp-r0.3.10.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggplot2_3.3.4               reshape2_1.4.4             
##  [3] dplyr_1.0.6                 DESeq2_1.30.1              
##  [5] SummarizedExperiment_1.20.0 Biobase_2.50.0             
##  [7] MatrixGenerics_1.2.1        matrixStats_0.59.0         
##  [9] GenomicRanges_1.42.0        GenomeInfoDb_1.26.7        
## [11] IRanges_2.24.1              S4Vectors_0.28.1           
## [13] BiocGenerics_0.36.1        
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2             sass_0.4.0             bit64_4.0.5           
##  [4] jsonlite_1.7.2         splines_4.0.3          bslib_0.2.5.1         
##  [7] assertthat_0.2.1       highr_0.9              blob_1.2.1            
## [10] GenomeInfoDbData_1.2.4 yaml_2.2.1             pillar_1.6.1          
## [13] RSQLite_2.2.7          lattice_0.20-41        glue_1.4.2            
## [16] digest_0.6.27          RColorBrewer_1.1-2     XVector_0.30.0        
## [19] colorspace_2.0-1       plyr_1.8.6             htmltools_0.5.1.1     
## [22] Matrix_1.3-4           XML_3.99-0.6           pkgconfig_2.0.3       
## [25] genefilter_1.72.1      zlibbioc_1.36.0        purrr_0.3.4           
## [28] xtable_1.8-4           scales_1.1.1           BiocParallel_1.24.1   
## [31] tibble_3.1.2           annotate_1.68.0        farver_2.1.0          
## [34] generics_0.1.0         ellipsis_0.3.2         withr_2.4.2           
## [37] cachem_1.0.5           survival_3.2-10        magrittr_2.0.1        
## [40] crayon_1.4.1           memoise_2.0.0          evaluate_0.14         
## [43] fansi_0.5.0            tools_4.0.3            lifecycle_1.0.0       
## [46] stringr_1.4.0          munsell_0.5.0          locfit_1.5-9.4        
## [49] DelayedArray_0.16.3    AnnotationDbi_1.52.0   compiler_4.0.3        
## [52] jquerylib_0.1.4        rlang_0.4.11           grid_4.0.3            
## [55] RCurl_1.98-1.3         labeling_0.4.2         bitops_1.0-7          
## [58] rmarkdown_2.8          gtable_0.3.0           DBI_1.1.1             
## [61] R6_2.5.0               knitr_1.33             fastmap_1.1.0         
## [64] bit_4.0.4              utf8_1.2.1             stringi_1.6.2         
## [67] Rcpp_1.0.6             vctrs_0.3.8            geneplotter_1.68.0    
## [70] tidyselect_1.1.1       xfun_0.23

Analysis of RNA-seq data using DEseq2

Stéphane Le Crom, Matthieu Moreau & Gaëlle Lelandais

August 2021