Practical 2: Analysis of RNA-seq data

Introduction
Set up your working environment
Sequence quality controls
Mapping of reads on the reference genome
Post-processing of alignment files
Alignments visualization using a genome browser
Gene counts estimation
Search for differentially expressed genes

Introduction

Objective of this practical session

During this practical session, you will learn:

To conduct an analysis of RNA-seq raw fastq files. You will first perform quality control of the sequencer file outputs before proceding to the mapping of the reads onto the reference genome.
To perform statistical analysis of the gene expression matrices in order to identify differentialy expressed genes between two conditions.

Dataset

Data used in these practical were collected from the following publication: Guida, A., Lindstädt, C., Maguire, S. L., Ding, C., Higgins, D. G., Corton, N. J., Berriman, M., et al. (2011). Using RNA-seq to determine the transcriptional landscape and the hypoxic response of the pathogenic yeast Candida parapsilosis. Guida et al. BMC Genomics 2011

The raw sequencing files located into the projet data folder of the IFB-core Cluster : /shared/projects/ens_hts_2021/ where retrieved from the EBI European Electronic Archive using the study accession number PRJNA154483.

Set up your working environment

Connect to the IFB-core server. Look at the tutorial to see how to proceed. You can also look at the useful commands to work on the IFB-core Cluster page.
Go to your home directory
```
 cd 
```
Create a directory that will contain all results of the upcoming analyses
```
 mkdir RNAseq
```
Go to the newly created directory
```
 cd RNAseq
```
Check your are in the right directory using the pwd command:
```
 pwd
 /shared/home/<your login>/RNAseq
```

Sequence quality controls

Fastq files are raw results from RNA-seq experiments. These files comprise all the sequences (or reads) obtained from the sequencer device (Illumina technology in this practical), together with base quality scores (PHRED scores).

Two different files will be analyzed during this practical from Guida et al. publication:

O2rep2_SRR352263.fastq refereed to a transcriptome analysis of yeasts C. parapsilosis under normoxic condition (ENA access number: SRR352263)
noO2rep3_SRR352271.fastq refereed to a transcriptome analysis of yeasts C. parapsilosis under hypoxic condition (ENA access number: SRR352271)

In a first step, controls will be performed on each FASTQ files in order to evaluate the overall quality of the sequences and identify potential problems that could impact the following analyses. During this practical we will use the FastQC software solution.

FastQC is a quality control application for high throughput sequence data. It provides an interactive application to review the results of several different quality control checks, and create an HTML based report. The main functions of FastQC are:

Import of data from BAM, SAM or FastQ files (any variant)
Providing a quick overview to tell you in which areas there may be problems
Summary graphs and tables to quickly assess your data
Export of results to an HTML based report
Offline operation mode to allow automated generation of reports without running the interactive application

What you have to do:

Use FASTQC to evaluate the quality of sequences in each FASTQ files. Using information from the Fastqc help page as well as exemples of good or bad illumina data as references.

Compare results between the two FASTQ files. Is there any concern related to the following analyses?

Create a new directory to store the output of fastqc

 mkdir 1-QC
 # Using the `tree` command, your directory should look like this:
 tree
 │
 └───1-QC

Go to this directory
```
 cd 1-QC
```
Make fastqc available in your environment
```
 module load fastqc/0.11.9
```
Check the help page of the programme to see its usage and parameters
```
 srun fastqc --help
```

Run fastqc on each experiment files

 # O2 condition
 # Absolute path to the file: /shared/projects/ens_hts_2021/data/rnaseq/O2rep2_SRR352263.fastq
 # -o option creates all output files in the specified output directory, '.' means current directory

 srun fastqc /shared/projects/ens_hts_2021/data/rnaseq/O2rep2_SRR352263.fastq -o .

 # noO2 condition
 srun fastqc /shared/projects/ens_hts_2021/data/rnaseq/noO2rep3_SRR352271.fastq -o .

At this point you should see the new files in your directory using the tree command

 tree
 │
 └───1-QC
     └─── O2rep2_SRR352263.fastqc.html
     └─── O2rep2_SRR352263.fastqc.zip
     └─── noO2rep3_SRR352271.fastqc.html
     └─── noO2rep3_SRR352271.fastqc.zip

Download the HTML file reports on your local machine

 ## OPEN A NEW TERMINAL
 # Create a directory where to put generated files on your computer
 mkdir ~/Desktop/RNAseq/

 # Go to this directory
 cd ~/Desktop/RNAseq/

 # Download html report files from IFB server
 scp <login>@core.cluster.france-bioinformatique.fr:~/RNAseq/1-QC/*.html .
 # Enter your password 

Open the .html report with your internet browser

Mapping of reads on the reference genome

Once data quality is verified, reads will be mapped onto the reference genome of yeast C. parapsilosis. The genome sequence of C. parapsilosis and its annotations (locations of ORFs) were retrieved from the Candidat genome database (CGD).

Different aligner and algorithms for RNA-seq analysis exist. We will use Bowtie 1.2 an ultrafast (memory-efficient) short read aligner. As an input, Bowtie uses a Fastq file (with reads to be aligned) and “pre-built indexes” of the reference genome. These indexes are named “C_parapsilosis.1.ebwt”, “C_parapsilosis.2.ebwt”, etc. They will allow boosting the alignment process.
As an output, Bowtie provides a SAM file. SAM (Sequence Alignment/Map) is a generic format for storing large nucleotide sequence alignments.

What you have to do:

Run sequence alignments with Bowtie using the two Fastq files

While Bowtie is running, take a look at Bowtie documentation that describes the options (particularly the -m)

Look at the alignment statistic outputs (file .out)

What is the proportion of reads aligned on the reference genome?

Is there any concern to take into account for the following analyses?

Create a new directory to store the output of bowtie

 #Go back to the parental directory "RNAseq"
 cd ../

 #Create a new directory to store results of the alignment
 mkdir 2-Mapping

Your directory should now look like this :

 tree
 │
 └───1-QC
     └─── O2rep2_SRR352263.fastqc.html
     └─── O2rep2_SRR352263.fastqc.zip
     └─── noO2rep3_SRR352271.fastqc.html
     └─── noO2rep3_SRR352271.fastqc.zip
 └─── 2-Mapping

Go to the newly created directory
```
 cd 2-Mapping
```
Load Bowtie into your environment
```
 module load bowtie/1.2.2
```

Map the reads to the reference genome

 ## We will use the following options:
 # "-S" will output the result in SAM format
 # "/shared/projects/ens_hts_2021/data/rnaseq/bowtie_indexes/C_parapsilosis" specify the location and the "prefix (C_parapsilosis)"" of the bowtie's index files
 # "/shared/projects/ens_hts_2021/data/rnaseq/Fastqc/O2rep2_SRR352263.fastq.gz" location of the input fastq
 # "2>" will save in a file some statistic about the aligment (#of reads mapped, etc...)
 # "> redirects the mapping output into a .sam file

 # Map the aerobic condition reads
 srun bowtie -S /shared/projects/ens_hts_2021/data/rnaseq/bowtie_indexes/C_parapsilosis /shared/projects/ens_hts_2021/data/rnaseq/O2rep2_SRR352263.fastq 2> O2rep2_SRR352263_bowtie_mapping.out > O2rep2_SRR352263_bowtie_mapping.sam

 # Map the hypoxic condition reads
 srun bowtie -S /shared/projects/ens_hts_2021/data/rnaseq/bowtie_indexes/C_parapsilosis /shared/projects/ens_hts_2021/data/rnaseq/noO2rep3_SRR352271.fastq 2> noO2rep3_SRR352271_bowtie_mapping.out > noO2rep3_SRR352271_bowtie_mapping.sam

Your directory should now look like this :

 tree
 .
 ├── 1-QC
 │   ├── noO2rep3_SRR352271_fastqc.html
 │   ├── noO2rep3_SRR352271_fastqc.zip
 │   ├── O2rep2_SRR352263_fastqc.html
 │   └── O2rep2_SRR352263_fastqc.zip
 └── 2-Mapping
     ├── noO2rep3_SRR352271_bowtie_mapping.out
     ├── noO2rep3_SRR352271_bowtie_mapping.sam
     ├── O2rep2_SRR352263_bowtie_mapping.out
     └── O2rep2_SRR352263_bowtie_mapping.sam

Post-processing of alignment files

In order to facilitate alignement manipulation, SAM files have to be converted into BAM files (a binary version) and alignements “sorted” according to their localisation on the genome and files indexed in order to speed up their access. We will use the Samtools suite to perform these steps.

What you have to do:

Convert SAM files into BAM files

Sort and index BAM files

Sort and convert .sam into .bam files

 ## "samtools sort" sort alignments by genomic coordinates
 # "|" "pipe" the output of samtools sort to the next program
 ## "samtools view" will convert sam into bam
 # option "-b" specify the output to be in BAM format
 # ">"" write the output in the bam file

 module load samtools/1.9

 # Sort and convert O2 condition
 srun samtools sort O2rep2_SRR352263_bowtie_mapping.sam | srun samtools view -b  > O2rep2_SRR352263_bowtie_sorted.bam

 # Sort and convert noO2 condition
 srun samtools sort noO2rep3_SRR352271_bowtie_mapping.sam | srun samtools view -b  > noO2rep3_SRR352271_bowtie_sorted.bam

Create indexes for the bam files

The index of a bam file is name .bam.bai

 # Index the O2 condition
 srun samtools index O2rep2_SRR352263_bowtie_sorted.bam

 # Index the noO2 condition
 srun samtools index noO2rep3_SRR352271_bowtie_sorted.bam

Your directory should now look like this :

 tree
 .
 ├── 1-QC
 │   ├── noO2rep3_SRR352271_fastqc.html
 │   ├── noO2rep3_SRR352271_fastqc.zip
 │   ├── O2rep2_SRR352263_fastqc.html
 │   └── O2rep2_SRR352263_fastqc.zip
 └── 2-Mapping
     ├── noO2rep3_SRR352271_bowtie_mapping.out
     ├── noO2rep3_SRR352271_bowtie_mapping.sam
     ├── noO2rep3_SRR352271_bowtie_sorted.bam
     ├── noO2rep3_SRR352271_bowtie_sorted.bam.bai
     ├── O2rep2_SRR352263_bowtie_mapping.out
     ├── O2rep2_SRR352263_bowtie_mapping.sam
     ├── O2rep2_SRR352263_bowtie_sorted.bam
     └── O2rep2_SRR352263_bowtie_sorted.bam.bai

Alignments visualization using a genome browser

The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large genomic datasets. It supports a wide variety of data types, including array-based, high throughput sequence data and genomic annotations. In this practical, we will use IGV to visualize mapping results.

Download the necessary files on your computer

 # To download files from the cluster to your current directory (on your own computer), **open a new shell and run**

 # First bam and bai alignement files
 scp  <your login>@core.cluster.france-bioinformatique.fr:~/RNAseq/2-Mapping/*.bam* .

 # Next the reference genome sequence and gene annotation files
 scp  <your login>@core.cluster.france-bioinformatique.fr:/shared/projects/ens_hts_2021/data/rnaseq/C_parapsilosis_CGD.fasta .
 scp  <your login>@core.cluster.france-bioinformatique.fr:/shared/projects/ens_hts_2021/data/rnaseq/C_parapsilosis_ORFs.gff .

Visualize mapping results with IGV

On linux type “igv” in a terminal window to launch the program

You can also download IGV and follow the instalation instructions according to your OS

Once the IGV program is launched, it is necessary to import the reference genome “Genomes/Create .genome File…” (see below). Select the FASTA file with the genomic sequence of C. parapsilosis “C_parapsilosis_CGD.fasta” (“Browse / FASTA file”) and enter information regarding ORFs positions, GFF file “C_parapsilosis_ORFs.gff” (“Browse Gene file”).
Finally, give a name to your genome (“Unique identifier” and “Descriptive name”) and press “OK”. Save the genome file in your home.

Warning! In order to IGV to create an index of your genome, you need to copy the reference genome FASTA file in writable directory.

C. parapsilosis genome is now loaded into IGV and can be selected from the top/left menu (see 1 below). The genomic sequence can be therefore explored, choosing for instance, a particular chromosome (see 2 below) or a genomic region (see 3). Note that gene annotations (ORF positions) are shown at the bottom of the window (see 4, blue lines) and you can obtain a more detailed view of the sequence using the cursor located on the top/right of the window, see 5).

Mapping results (“.sorted.bam” files) can now be imported (“File / Load from File”). Zoom in genomic regions in order to visualize read alignements.

What you have to do:

Compare your results with those presented in the original publication

Did the authors use stranded-specific protocols?

Can you observe differences between hypoxic and normoxic conditions?

Gene counts estimation

To identify genes whose expression is different between hypoxic and normoxic conditions, we will count and compare the number of reads mapped to each ORF. A program available in the Bedtools suite will be used for this purpose.

What you have to do:

Calculate for each ORF the number of reads that were aligned (normoxic and hypoxic conditions).

Create a new directory to store the ORF count matrix

 # Go to the parental directory "RNAseq"
 cd ../

 # Create a new directory to store results of the alignment
 mkdir 3-Counts

Your directory should now look like this :

 tree
 .
 ├── 1-QC
 │   ├── noO2rep3_SRR352271_fastqc.html
 │   ├── noO2rep3_SRR352271_fastqc.zip
 │   ├── O2rep2_SRR352263_fastqc.html
 │   └── O2rep2_SRR352263_fastqc.zip
 ├── 2-Mapping
 │   ├── noO2rep3_SRR352271_bowtie_mapping.out
 │   ├── noO2rep3_SRR352271_bowtie_mapping.sam
 │   ├── noO2rep3_SRR352271_bowtie_sorted.bam
 │   ├── noO2rep3_SRR352271_bowtie_sorted.bam.bai
 │   ├── O2rep2_SRR352263_bowtie_mapping.out
 │   ├── O2rep2_SRR352263_bowtie_mapping.sam
 │   ├── O2rep2_SRR352263_bowtie_sorted.bam
 │   └── O2rep2_SRR352263_bowtie_sorted.bam.bai
 └── 3-Counts

Go to the newly created directory
```
 cd  3-Counts
```

Calculate for each ORF the number of reads that were aligned to it

 module load bedtools/2.27.1

 # Counting matrix for the O2 condition
 srun bedtools multicov -bams ../2-Mapping/O2rep2_SRR352263_bowtie_sorted.bam -bed /shared/projects/ens_hts_2021/data/rnaseq/C_parapsilosis_ORFs.gff > O2rep2_SRR352263_gene_counts.gff
 # Output formating
 srun sed 's/^.*ID=//' O2rep2_SRR352263_gene_counts.gff > O2rep2_SRR352263_gene_counts.tab

 # Counting matrix for the noO2 condition
 srun bedtools multicov -bams ../2-Mapping/noO2rep3_SRR352271_bowtie_sorted.bam -bed /shared/projects/ens_hts_2021/data/rnaseq/C_parapsilosis_ORFs.gff > noO2rep3_SRR352271_gene_counts.gff
 # Output formating
 srun sed 's/^.*ID=//' noO2rep3_SRR352271_gene_counts.gff > noO2rep3_SRR352271_gene_counts.tab

Take a look at the final counting matrices to see how the files are organised.

At the end of RNA-seq data analysis your directory should look like this :

 tree
 .
 ├── 1-QC
 │   ├── noO2rep3_SRR352271_fastqc.html
 │   ├── noO2rep3_SRR352271_fastqc.zip
 │   ├── O2rep2_SRR352263_fastqc.html
 │   └── O2rep2_SRR352263_fastqc.zip
 ├── 2-Mapping
 │   ├── noO2rep3_SRR352271_bowtie_mapping.out
 │   ├── noO2rep3_SRR352271_bowtie_mapping.sam
 │   ├── noO2rep3_SRR352271_bowtie_sorted.bam
 │   ├── noO2rep3_SRR352271_bowtie_sorted.bam.bai
 │   ├── O2rep2_SRR352263_bowtie_mapping.out
 │   ├── O2rep2_SRR352263_bowtie_mapping.sam
 │   ├── O2rep2_SRR352263_bowtie_sorted.bam
 │   └── O2rep2_SRR352263_bowtie_sorted.bam.bai
 └── 3-Counts
     ├── noO2rep3_SRR352271_gene_counts.gff
     ├── noO2rep3_SRR352271_gene_counts.tab
     ├── O2rep2_SRR352263_gene_counts.gff
     └── O2rep2_SRR352263_gene_counts.tab

Unload the tools you used

 module unload fastqc/0.11.9 bowtie/1.2.2 samtools/1.9  bedtools/2.27.1

Search for differentially expressed genes

In their article (Guida et al., 2011), the authors repeated the experiment 6 times for normoxic condition (with O2) and 4 times for hypoxic conditions (without O2). Results obtained for all experiments are combined in the file “rnaseqcount_data_diffAnalysis.txt”. This file will be used to search for differentially expressed genes using the DESeq2 (Love et al. 2014) method.

The DESeq package provides methods to test for differential expression by use of the negative binonial distribution and a shrinkage estimator for the distribution’s variance.

What you have to do:

Search for differentially expressed genes using DESeq R package

How many genes are selected with different adjusted p-value thresholds (5%, 1%, etc.)?

Check your results with IGV and use GOtermFinder (see practical on microarrays) to analyse the function of the selected genes

Connect to Rstudio server of the IFB Look at the tutorial on how to connect to IFB-core Rstudio server to see how to proceed.
Save the working notebook in your personal environment
- In File > Open File… enter the path /shared/projects/ens_hts_2021/data/rnaseq/DEseq2.Rmd to open the notebook containing all the code needed for the practical
- Save it into your personal folder using File > Save As
Follow the instruction of the notebook to conduct the analysis. You can also visualize the final report version.

You can find help on how to use R markdown on the R markdown project webpage.