Process Details

This section describes each Nextflow process in the pipeline, including their inputs, outputs, and core functionality.


downloadData

Purpose: Download Illumina sequencing data archive and extract contents. - Input: None - Output: Raw .fastq.gz files


referenceGenome

Purpose: Download SARS-CoV-2 reference genome (NC_045512.2) using NCBI Entrez Direct. - Input: None - Output: reference.fasta


qc

Purpose: Quality control and cleaning of raw sequencing reads. - Input: Paired-end FASTQ files - Output: - Cleaned FASTQ files (pair*.R1/2.clean.fastq.gz) - FastQC reports (.html, .zip) - Fastp reports (.json, .html) - MultiQC summary


mapping

Purpose: Align reads to reference genome and produce sorted, indexed BAM files. - Input: Cleaned FASTQ files, reference.fasta - Output: Sorted and indexed BAM files (.sorted.bam, .bai)


primerClipping

Purpose: Clip primer sequences using a CleanPlex BEDPE file and bamclipper. - Input: Sorted BAM files - Output: Primer-clipped BAM files (.primerclipped.bam)


variantCalling

Purpose: Call variants from aligned BAM files using freebayes. - Input: Primer-clipped BAM files, reference.fasta - Output: Raw VCF files (freebayes-illumina*.vcf)


consensusGeneration

Purpose: Generate consensus sequences using bcftools from VCFs. - Input: VCF files, reference.fasta - Output: FASTA files for consensus sequences (consensus-*.fasta, consensus-seqs.fasta)


pangolinLineage

Purpose: Assign SARS-CoV-2 lineages using pangolin. - Input: Combined consensus FASTA file - Output: lineage_report.csv, optional TSV and summary files


consensusQC

Purpose: Assess consensus quality using president. - Input: Consensus FASTA, reference.fasta - Output: QC summary in output/ folder


phylogeny

Purpose: Perform multiple sequence alignment and build phylogenetic tree. - Input: Consensus FASTA file - Output: alignment.fasta, IQ-TREE outputs (e.g., .treefile, .log)

For a visual overview, see the Workflow section.