Process Details

1. Matching Reads

Script: scripts/match.pl

Inputs:

Input Description
reads Sample FASTA (e.g. data/reads/illumina_reads_<length>.fasta.gz)
reference Genome FASTA (e.g. data/ref/hg38_partial.fasta.gz)

Output:
- results/{sample}.matched (tab-delimited list of matched regions)

Description:
It takes each input read set and searches against the reference, emitting one line per successful match with coordinates and basic alignment metrics.


Benchmarking

[match.pl] (Nov 2023)

Read Length Query Size Markers Found Runtime Memory Usage
40 1,000 511 2.94 min 334 MB
60 1,000 446 3.27 min 336 MB
80 1,000 415 3.83 min 337.5 MB
100 1,000 439 3.23 min 341 MB
100 10,000 4,280 33 min 341 MB

[match.pl] (Jan 2025)

Read Length Query Size Markers Found Runtime Memory Usage
40 1000 2590 0.03 min 198.16 MB
40 10000 39694 0.23 min 202.75 MB
40 100000 400556 2.18 min 248.42 MB
40 1000000 400556 2.22 min 248.42 MB
60 1000 1395 0.03 min 248.42 MB
60 10000 12669 0.23 min 248.42 MB
60 100000 129502 2.12 min 250.07 MB
60 1000000 129502 2.10 min 250.07 MB
80 1000 1092 0.03 min 250.07 MB
80 10000 9866 0.22 min 250.07 MB
80 100000 101426 2.11 min 253.08 MB
80 1000000 101426 2.10 min 253.08 MB
100 1000 867 0.03 min 253.08 MB
100 10000 8822 0.23 min 253.08 MB
100 100000 90293 2.14 min 256.41 MB
100 1000000 90293 2.13 min 256.41 MB

2. Annotating Matches

Script: scripts/annotate.pl

Inputs:

File Description
results/{sample}.matched Matched read output from previous step
hg38_chr1_geneannotation.gff3.gz Gene features
hg38_chr1_tss.txt.gz Transcription start sites
hg38_cpg.txt.gz CpG locations
hg38_repeatmasker.bed.gz Repetitive elements

Output:
- results/{sample}.annotated (tab-delimited, with annotation columns)

Description:
It enriches each match record by querying overlap with gene models, TSS positions, CpG sites, and repeat regions.
The output adds columns for:

  • Gene ID and feature type
  • Distance to nearest TSS
  • CpG count within the matched interval
  • Overlap with repeat elements