Skip to content

Output Structure

Files Content Description

1. Raw Data & Reference

downloadData/Content: Raw Sequencing Reads

  • <sample_id>_R1_001.fastq.gz / _R2_001.fastq.gz: The original, unprocessed sequencing data from the sequencer or repository (e.g., SRA/ENA).
  • Format: FASTQ (gzip compressed).
  • Use: The starting point for the entire pipeline.

referenceGenome/Content: Viral Reference

  • reference.fasta: The standard SARS-CoV-2 genome sequence (NC_045512.2).
  • Use: Used as the template for mapping reads and calling variants.

2. Pre-processing & Alignment####qc/Content: Quality Control & Trimming

  • <sample_id>.fastp.html: An interactive HTML report visualizing read quality before and after trimming. Look here to check adapter removal and read quality scores.
  • <sample_id>.R1/R2.clean.fastq.gz: The "clean" sequencing reads. Adapters have been trimmed, and low-quality bases removed. These are used for mapping.

mapping/Content: Genome Alignment

  • <sample_id>.sorted.bam: The Binary Alignment Map (BAM) file. It contains the clean reads aligned to the reference genome, sorted by coordinate.
  • Use: Essential for visualization in tools like IGV (Integrative Genomics Viewer) to see coverage depth and mutations.

primerClipping/Content: PCR Primer Removal

  • <sample_id>.primerclipped.bam: A modified BAM file where PCR primer sequences have been "soft-clipped" (masked).
  • Why this matters: Amplicon sequencing uses synthetic primers to amplify the virus. If these synthetic sequences aren't removed, they can mask real mutations or introduce false ones.

3. Consensus & Analysis####consensusGeneration/Content: Individual Viral Genomes

  • <sample_id>.consensus.fasta: The reconstructed viral genome for a single specific sample. This sequence represents the specific variant of the virus found in that patient/sample.

mergeConsensus/Content: Aggregate Genomes

  • consensus-seqs.fasta: A Multi-FASTA file containing the consensus sequences of all processed samples combined into one file.
  • Use: This is the primary input for downstream phylogenetic analysis and lineage assignment.

4. Classification & Quality Assessment####pangolinLineage/Content: Variant Classification

  • lineage_report.csv: A spreadsheet generated by the Pangolin tool.
  • Key Columns to check:
  • lineage: The assigned Pango lineage (e.g., B.1.1.7, BA.5).
  • scorpio_call: The WHO variant name (e.g., Alpha, Omicron).
  • ambiguity_score: Quality metric indicating how many "N" (unknown) bases affected the call.

consensusQC/Content: Genome Quality Metrics (Tool: President)

  • consensus_report.tsv: A tabular summary of genome quality. Checks for "N" content (ambiguous bases) and frameshifts.
  • consensus_valid.fasta: Genomes that PASSED quality control checks.
  • consensus_invalid.fasta: Genomes that FAILED checks (usually due to low coverage resulting in too many Ns).

phylogeny/Content: Evolutionary Tree

  • phylo.treefile: A phylogenetic tree file generated by IQ-TREE.
  • Use: Can be opened in tree viewers like FigTree or Dendroscope to visualize how the samples are related to each other evolutionarily.