SARS-CoV-2 Genome Assembly from Illumina Reads¶

GitHub last commit

Welcome to the documentation for the nf-illumina2lineage pipeline.

Overview¶

This repository provides a complete pipeline for assembling and analyzing the genome of SARS-CoV-2 using Illumina paired-end sequencing data. It includes steps for quality control, mapping, variant calling, primer clipping, consensus sequence generation, lineage annotation, and phylogenetic analysis.

Key Features¶

Automated environment setup using mamba and conda
Comprehensive quality control with fastqc and fastp
Mapping and visualization using minimap2, samtools, and IGV
Primer sequence clipping for clean alignments
Variant calling with freebayes.
Consensus sequence generation and lineage assignment with pangolin
Phylogenetic analysis and multiple sequence alignment with mafft and iqtree
Clear documentation and modular structure

System Requirements¶

Operating System: Linux (tested on Fedora 38)
Processor: Intel i5 or higher, with multithreading support
Memory: Minimum 8 GB
Software: Anaconda/Miniconda, and the listed bioinformatics tools

Dependencies¶

The pipeline requires the following tools, managed via mamba:

QC: fastqc, fastp, multiqc
Mapping: minimap2, samtools, bamclipper
Variant Calling: freebayes, vcftools, bcftools
Sequence Analysis: mafft, iqtree, pangolin
Visualization: IGV

Installation¶

Clone the repository: bash git clone https://github.com/bibymaths/nf-illumina2lineage.git cd nf-illumina2lineage
Install mamba and create the environment: bash wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh" bash Mambaforge-Linux-x86_64.sh conda update -y conda mamba env create -p ./envs/projectSARS --file environment.yaml mamba activate ./envs/projectSARS

Pipeline Workflow¶

Environment Setup: Install dependencies and configure the environment.
Data Preparation: Download input datasets and reference genomes.
Quality Control: Evaluate and preprocess raw sequencing reads.
Mapping: Align reads to the reference genome.
Primer Clipping: Remove primer sequences from alignments.
Variant Calling: Identify variants in the genome.
Variant Normalization: Normalize variants and handle multiallelic sites using bcftools norm
Consensus Generation: Generate consensus sequences from filtered variants.
Lineage Annotation: Assign SARS-CoV-2 lineages using pangolin.
Phylogenetic Analysis: Perform multiple sequence alignment and build phylogenetic trees.

Input Data¶

Illumina paired-end sequencing data
SARS-CoV-2 reference genome (NCBI accession: NC_045512.2)

Output¶

Quality control reports (.html, .json)
Aligned sequences in BAM and VCF formats
Consensus sequences in FASTA format
Lineage annotations
Phylogenetic trees and visualizations

Usage¶

Edit the config.sh file to specify input data paths and parameters.
Run the pipeline: bash bash scripts/run_pipeline.sh
View results in the results/ directory.

Pipeline Diagram

File Description

License¶

This project is licensed under the MIT License. See the LICENSE file for details.

Contact¶

For questions or issues, please contact:

Abhinav Mishra
Email: mishraabhinav36@gmail.com