1. Introduction

Aim: Getting started with statistical approaches and bioinformatics tools commonly used to analyze microarray experiments and to select genes according to their expression profiles.

This practical is divided in 3 sections:

Correction of experimental biases
Detection of differentially expressed genes
Clustering of co-expressed genes

The first microarray datasets were collected from the publication of Guida et al.2011.
The authors used high throughput technologies (microarrays and high throughput sequencing) to determine the transcriptional profile of the pathogenic yeast Candida parapsilosis growing in several conditions including media, temperature and oxygen concentrations.
We will use the datasets related to the study of the hypoxic (low oxygen) response in C. parapsilosis.

2. Preprocessing of the raw data : Correction of experimental biases

The experiments were performed comparing one cell culture incubated at atmospheric oxygen conditions (call normoxic and labelled using Cy3 dye) and another one incubated in 1% O~2 (call hypoxic and labelled using Cy5 dye).

Reading of GPR file

Input : A GPR file with detailed information for each spot on the slide (gene name, Cy5 and Cy3 intensity values, background intensities and other statistics).

Library loading

library(marray)

The R package Marray offers several functions to : * Read GPR files * Draw graphical representations of microarray results (foreground and background signals, missing values, MAplots, etc.) * Perform the normalization between Cy5 and Cy3 signals.

# Read the GPR file using the marray package function read.GenePix
rawdata <- read.GenePix(fnames="dataFile1_normAnalysis.gpr",
                        path= "/shared/projects/ens_hts_2021/data/microarrays/data")

## Reading ...  /shared/projects/ens_hts_2021/data/microarrays/data/dataFile1_normAnalysis.gpr

Note : This function reads a GPR file and creates objects of class “marrayRaw”. In these objects, you can find, for instance, vectors with intensity values (“rawdata@maRf” or “rawdata@maGf”). These vectors can be manipulated using classical R functions like “summary()”, “hist()”, etc.

Take a few minutes to better understand the structure of the R object “marrayRaw”. Start for instance to manipulate the vectors with foreground signals (“rawdata@maRf” or “rawdata@maGf”).

# f is for foreground
# Intensity values in red/hypoxic channel
head(rawdata@maRf)

##      /shared/projects/ens_hts_2021/data/microarrays/data/dataFile1_normAnalysis.gpr
## [1,]                                                                            992
## [2,]                                                                           1907
## [3,]                                                                            559
## [4,]                                                                            645
## [5,]                                                                          32939
## [6,]                                                                            681

# Intensity values in green/normoxic channel
head(rawdata@maGf)

##      /shared/projects/ens_hts_2021/data/microarrays/data/dataFile1_normAnalysis.gpr
## [1,]                                                                           2561
## [2,]                                                                           2585
## [3,]                                                                           1588
## [4,]                                                                           1604
## [5,]                                                                          43755
## [6,]                                                                           1732

Visualization of foreground signal

# Red/hypoxic signals
image(rawdata,
      xvar = "maRf",
      main = "Hypoxic signal (with flags)")

## [1] FALSE

## NULL

# Green/normoxic signals
image(rawdata,
      xvar = "maGf",
      main = "Normoxic signal (with flags)")

## [1] FALSE

## NULL

Evaluation of data quality

Visualization of background signal

Visualize background signals in Red/Hypoxic and Green/Normoxic channels. Try to interpret the obtained results. How is the quality of the experiment?

# b is for background
# Red/Hypoxic channel background signals
image(rawdata,
      xvar = "maRb",
      main = "Hypoxic background (with flags)")

## [1] FALSE

## NULL

# Green/Normoxic channel background signals
image(rawdata,
      xvar = "maGb",
      main = "Normoxic background (with flags)")

## [1] FALSE

## NULL

“Flag” locations on the slide

Each spot is automatically associated with a flag value reporting some quality information

Flag values :

-50 not found
-75 empty
-100 bad
0 good

Manipulate flags annotation.

What is the number of spots for each type of Flags?
What does that mean?
Visualize the location of the Flags on the slide.
Is there any problem you detect?

Manipulation of spots with flags

Display distribution of spots by flag values

# Spots with certain values will be eliminated from further analysis
table(rawdata@maW)

## 
## -100  -75  -50    0 
##  103  192 5922 9335

Flag location on the slide

# We take 5 colors in the default palette (thanks to Quentin Lamy for this trick)
MyColor <- palette()[1:5]
MyColor[5] = "white"

image(rawdata,
      xvar = "maW",
      col = MyColor,
      zlim = c(min(rawdata@maW), max(rawdata@maW)),
      main = "Location of flags on the slide")

## [1] FALSE

## NULL

Remove background intensity value for flagged spots, by replacing current values by NA

# We work on a copy of the rawdata
rawdataWithoutFlags <- rawdata

rawdataWithoutFlags@maRb[rawdataWithoutFlags@maW < 0] = NA
rawdataWithoutFlags@maGb[rawdataWithoutFlags@maW < 0] = NA

Visualization of background signals without flags

image(rawdata,
      xvar = "maRb",
      main = "Hypoxic background (with flags)",
      colorinfo =F)

## [1] FALSE

## NULL

image(rawdataWithoutFlags,
      xvar = "maRb",
      main = "Hypoxic background (without flag)",
      colorinfo =F)

## [1] FALSE

## NULL

Comparison of the red/hypoxic background signal with and without flagged spots

image(rawdata,
      xvar = "maGb",
      main = "Normoxic background (with flags)",
      colorinfo =F)

## [1] FALSE

## NULL

image(rawdataWithoutFlags,
      xvar = "maGb",
      main = "Normoxic background (without flag)",
      colorinfo =F)

## [1] FALSE

## NULL

Comparison of the green/normoxic background signal with and without flagged spots

We will now correct for experimental biases. To do so, it is important to exclude all the spots for which the Flag values are negative. For that, intensity values in foreground and background signals have to be replaced with the R symbol “NA” (missing values, “Not Available”).

Filter flagged spots

Negatively flagged spots will be eliminated from further analyses by replacing their intensity values by NA (missing values)

#Signal value to NA
rawdataWithoutFlags@maRf[rawdataWithoutFlags@maW < 0] = NA 
rawdataWithoutFlags@maGf[rawdataWithoutFlags@maW < 0] = NA

Background correction

An intuitive approach for background correction consists in subtracting background intensity values (“rawdata@maRb” and “rawdata@maGb”) from global foreground intensities (“rawdata@maRf” and “rawdata@maGf”). Nevertheless this method can be debatable mainly because it creates overestimated log(Ratio) values in case of low intensities and add more noise to the data than expected. For this reason the following analyses will be performed with no background subtraction.

#Replace all background by 0
rawdataWithoutFlags@maGb[] = 0
rawdataWithoutFlags@maRb[] = 0

Normalization

Comparison of Cy5/Hypoxic and Cy3/Normoxic global signals

Draw the MA plot between Cy5/Hypoxic and Cy3/Normoxic signals.

Is there any experimental bias?
What information gives the boxplot representation?
What kind of normalization method needs to be applied?

plot(rawdataWithoutFlags,legend.func = NULL, main = "MA plot before normalization")
plot(rawdataWithoutFlags, main = "MA plot before normalization")

boxplot(rawdataWithoutFlags, main = "Boxplot before normalization")

At this stage you can compare MA plot and boxplot of data with background subtraction if you want to visualize the direct impact of background subtration on data distribution.

Comparison of three different normalization procedures for signal normalization

Normalize the intensity measures between Cy5/Hypoxic and Cy3/Normoxic signals.

Try different normalization methods (« median », « loess » and « printTipLoess »).
Draw the associated MA plot and BoxPlot (after normalization). What are the differences with the graphs obtained before normalization?
Draw the log2(R/G) distribution before and after normalization. How do you interpret the results?

rawdataWithoutFlagsNorm <- maNorm(rawdataWithoutFlags, norm = "median", echo = T)

## Normalization method: median.
## Normalizing array 1.

rawdataWithoutFlagsNorm2 <- maNorm(rawdataWithoutFlags, norm = "loess", echo = T)

## Normalization method: loess.
## Normalizing array 1.

rawdataWithoutFlagsNorm3 <- maNorm(rawdataWithoutFlags, norm = "printTipLoess", echo = T)

## Normalization method: printTipLoess.
## Normalizing array 1.

Several plots allow for comparison of the normalization methods

plot(rawdataWithoutFlagsNorm, legend.func = NULL, main = "norm = Median")

plot(rawdataWithoutFlagsNorm2, legend.func = NULL, main = "norm = Loess")

plot(rawdataWithoutFlagsNorm3, legend.func = NULL, main = "norm = printTipLoess")

boxplot(rawdataWithoutFlagsNorm, main = "norm = Median")

boxplot(rawdataWithoutFlagsNorm2, main = "norm = Loess")

boxplot(rawdataWithoutFlagsNorm3, main = "norm = printTipLoess")

plot(density(maM(rawdataWithoutFlagsNorm2),na.rm = T),
     lwd = 2, col = 2, main = "Distribution of log(Ratio)")
lines(density(maM(rawdataWithoutFlags), na.rm = T), lwd = 2)
abline(v = 0)
legend(x= 0.5, y= 1.2,c("Before normalization","After normalization with loess"), fill = c(1,2))

3. Search for differentially expressed genes

In their article (Guida et al., 2011), the authors repeated the experiment 4 times for normoxic condition (with O~2 ) and 4 times for hypoxic conditions (without O~2 ). Expressions of genes between the two conditions were compared using microarrays (Ratio = hypoxia / normoxia).

We will perform the DE analysis using the limma package

Library loading

library(limma)

Data loading of the normalized log ratio intensity value for each replicates

Input: A text file with four different biological replicates (after normalization).

dataFile <- "/shared/projects/ens_hts_2021/data/microarrays/data/dataFile_diffAnalysis.txt"
data <- as.matrix(read.table(dataFile, row.names = 1, header = T))

# Retrieve some information from the data table
dim(data)

## [1] 5526    4

data[1:10,1:4]

##                   logVal1      logVal2      logVal3     logVal4
## CPAR2_201050 -0.265265616 -0.130465012  0.008997103 -0.06624613
## CPAR2_101960 -0.843512598 -0.608422137 -0.103000282 -0.45358870
## CPAR2_101290  0.056414092  0.000296908 -0.068354697  0.05983511
## CPAR2_405520  0.464588136  0.509999239  0.284349940  0.44530769
## CPAR2_201590 -0.230207648 -0.176294382 -0.265324830 -0.24833664
## CPAR2_103750 -0.194992750 -0.186335163  0.191242260 -0.57185971
## CPAR2_100170 -0.132982234 -0.191465175 -0.126354218  0.00331530
## CPAR2_202790  0.973402061  0.853915233  0.808972712  0.74969076
## CPAR2_301860 -0.008917937  0.018171339 -0.021780941  0.16899955
## CPAR2_106430 -1.598703129 -1.508676852 -0.642865880 -0.87494246

summary(data)

##     logVal1            logVal2             logVal3             logVal4         
##  Min.   :-3.20789   Min.   :-2.902426   Min.   :-3.634918   Min.   :-3.854502  
##  1st Qu.:-0.33699   1st Qu.:-0.318338   1st Qu.:-0.252692   1st Qu.:-0.278030  
##  Median :-0.01331   Median :-0.002482   Median : 0.002798   Median :-0.009852  
##  Mean   : 0.02278   Mean   : 0.025141   Mean   : 0.075240   Mean   : 0.014181  
##  3rd Qu.: 0.30528   3rd Qu.: 0.309444   3rd Qu.: 0.322220   3rd Qu.: 0.282426  
##  Max.   : 6.65491   Max.   : 6.422750   Max.   : 6.559013   Max.   : 6.213929

Linear model estimations and calculation of the Bayesian statistics

# Linear model estimation
fit <- lmFit(data)

# Bayesian statistics
limma.res <- eBayes(fit)

# Overview of the most differentially expressed genes
head(topTable(limma.res))

##                 logFC  AveExpr        t      P.Value    adj.P.Val        B
## CPAR2_404850 6.462651 6.462651 71.45643 3.788865e-10 2.093727e-06 13.59386
## CPAR2_503990 5.168192 5.168192 61.93992 9.047930e-10 2.499943e-06 13.02649
## CPAR2_502580 3.504953 3.504953 50.62005 3.091002e-09 4.333583e-06 12.10863
## CPAR2_807620 3.614666 3.614666 50.49767 3.136868e-09 4.333583e-06 12.09684
## CPAR2_401230 3.328905 3.328905 45.27100 6.097772e-09 5.982882e-06 11.54690
## CPAG_00607   3.396506 3.396506 43.79661 7.457963e-09 5.982882e-06 11.37366

Identify the differentially expressed genes using LIMMA method.

Try different adjusted p-value thresholds: 5%, 1%, etc..
How many genes are induced in hypoxic condition (without 02 > with 02)?
How many genes are repressed in hypoxic condition (without O2 < with 02)?

Selection of differentially expressed genes (using an adjusted p-value threshold of 0.01)

allgenes.limma <- topTable(limma.res, number = nrow(data)) # Retrieve result table for all genes
siggenes.limma <- allgenes.limma[allgenes.limma[,5] < 0.01,] # Filter on the adj.P.Val

paste(dim(siggenes.limma[siggenes.limma[,2] > 0,])[1], "upregulated genes (logFC value > 0)")

## [1] "942 upregulated genes (logFC value > 0)"

paste(dim(siggenes.limma[siggenes.limma[,2] < 0,])[1], "downpregulated genes (logFC value < 0)")

## [1] "725 downpregulated genes (logFC value < 0)"

# Export DE gene table into your home directory:
write.table(siggenes.limma[siggenes.limma[,2] > 0,], 
            row.names = T, quote = F, sep = ";",
            file = "~/limma_up_signif_genes.csv")

write.table(siggenes.limma[siggenes.limma[,2] < 0,], 
            row.names = T, quote = F, sep = ";",
            file = "~/limma_low_signif_genes.csv")

Plot results of the differential analysis

Volcano plot

attach(allgenes.limma)

logFCthreshold <- 1
adjPVthreshold <- 0.005

volcanoplot(limma.res, main = "Hypoxic  VS normoxic ",pch =21)
abline(v = c(-logFCthreshold,logFCthreshold), col = "red")
abline(h = -log10(adjPVthreshold), col= "red", lty =2)
points(siggenes.limma$logFC[logFC > logFCthreshold & adj.P.Val < adjPVthreshold], -1 * log10(siggenes.limma$P.Value[logFC > logFCthreshold  & adj.P.Val < adjPVthreshold]), col ="red")
points(siggenes.limma$logFC[logFC <(-logFCthreshold) & adj.P.Val < adjPVthreshold], -1 * log10(siggenes.limma$P.Value[logFC <(-logFCthreshold)  & adj.P.Val < adjPVthreshold]), col ="green")

#legend("topleft", c("Genes with LogFC > 1 in hypoxic VS normoxic", "Genes with LogFc > 1 in normoxic VS hypoxic"),pch = 21, col = c("green", "red"), bty ="n", cex =.9)

4. Functional analyzes of differentially expressed genes

Several tools are available online to evaluate the biological relevance of the gene sets you select after the differential analysis. For example you can you use the [GoTermFinder tool] (http://www.candidagenome.org/cgi-bin/GO/goTermFinder) dedicated to Candida yeast species to retrieve functional annotation. You can also obtain information on a specific gene using the Candida Genome Database.

What are the functions of the genes in your lists (identified at the previous step).

Are they relevant with the studied biological system described by Guida et al. in their publication?

Reproductibility

#Date
format(Sys.time(), "%d %B, %Y, %H,%M")

## [1] "20 August, 2021, 12,17"

#Packages used
sessionInfo()

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.0.3/lib/libopenblasp-r0.3.10.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] marray_1.68.0 limma_3.46.0 
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.27     R6_2.5.0          jsonlite_1.7.2    magrittr_2.0.1   
##  [5] evaluate_0.14     highr_0.9         rlang_0.4.11      stringi_1.6.2    
##  [9] jquerylib_0.1.4   bslib_0.2.5.1     rmarkdown_2.8     tools_4.0.3      
## [13] stringr_1.4.0     xfun_0.23         yaml_2.2.1        compiler_4.0.3   
## [17] htmltools_0.5.1.1 knitr_1.33        sass_0.4.0

Analysis of microarray data

Stéphane Le Crom, Matthieu Moreau & Gaëlle Lelandais

August 2021

1. Introduction

2. Preprocessing of the raw data : Correction of experimental biases

Reading of GPR file

Library loading

Evaluation of data quality

Visualization of background signal

“Flag” locations on the slide

Manipulation of spots with flags

Filter flagged spots

Background correction

Normalization

Comparison of Cy5/Hypoxic and Cy3/Normoxic global signals

Comparison of three different normalization procedures for signal normalization

3. Search for differentially expressed genes

Library loading

Data loading of the normalized log ratio intensity value for each replicates

Linear model estimations and calculation of the Bayesian statistics

Selection of differentially expressed genes (using an adjusted p-value threshold of 0.01)

Plot results of the differential analysis

4. Functional analyzes of differentially expressed genes

Reproductibility