Process Details¶
This section describes each Nextflow process in the pipeline, including their inputs, outputs, and core functionality.
downloadData¶
Purpose: Download Illumina sequencing data archive and extract contents.
- Input: None
- Output: Raw .fastq.gz files
referenceGenome¶
Purpose: Download SARS-CoV-2 reference genome (NC_045512.2) using NCBI Entrez Direct.
- Input: None
- Output: reference.fasta
qc¶
Purpose: Quality control and cleaning of raw sequencing reads.
- Input: Paired-end FASTQ files
- Output:
- Cleaned FASTQ files (pair*.R1/2.clean.fastq.gz)
- FastQC reports (.html, .zip)
- Fastp reports (.json, .html)
- MultiQC summary
mapping¶
Purpose: Align reads to reference genome and produce sorted, indexed BAM files.
- Input: Cleaned FASTQ files, reference.fasta
- Output: Sorted and indexed BAM files (.sorted.bam, .bai)
primerClipping¶
Purpose: Clip primer sequences using a CleanPlex BEDPE file and bamclipper.
- Input: Sorted BAM files
- Output: Primer-clipped BAM files (.primerclipped.bam)
variantCalling¶
Purpose: Call variants from aligned BAM files using freebayes.
- Input: Primer-clipped BAM files, reference.fasta
- Output: Raw VCF files (freebayes-illumina*.vcf)
consensusGeneration¶
Purpose: Generate consensus sequences using bcftools from VCFs.
- Input: VCF files, reference.fasta
- Output: FASTA files for consensus sequences (consensus-*.fasta, consensus-seqs.fasta)
pangolinLineage¶
Purpose: Assign SARS-CoV-2 lineages using pangolin.
- Input: Combined consensus FASTA file
- Output: lineage_report.csv, optional TSV and summary files
consensusQC¶
Purpose: Assess consensus quality using president.
- Input: Consensus FASTA, reference.fasta
- Output: QC summary in output/ folder
phylogeny¶
Purpose: Perform multiple sequence alignment and build phylogenetic tree.
- Input: Consensus FASTA file
- Output: alignment.fasta, IQ-TREE outputs (e.g., .treefile, .log)
For a visual overview, see the Workflow section.