Process Details¶
1. Matching Reads¶
Script: scripts/match.pl
Inputs:
Input | Description |
---|---|
reads |
Sample FASTA (e.g. data/reads/illumina_reads_<length>.fasta.gz ) |
reference |
Genome FASTA (e.g. data/ref/hg38_partial.fasta.gz ) |
Output:
- results/{sample}.matched
(tab-delimited list of matched regions)
Description:
It takes each input read set and searches against the reference, emitting one line per successful match with coordinates and basic alignment metrics.
Benchmarking¶
[match.pl] (Nov 2023)¶
Read Length | Query Size | Markers Found | Runtime | Memory Usage |
---|---|---|---|---|
40 | 1,000 | 511 | 2.94 min | 334 MB |
60 | 1,000 | 446 | 3.27 min | 336 MB |
80 | 1,000 | 415 | 3.83 min | 337.5 MB |
100 | 1,000 | 439 | 3.23 min | 341 MB |
100 | 10,000 | 4,280 | 33 min | 341 MB |
[match.pl] (Jan 2025)¶
Read Length | Query Size | Markers Found | Runtime | Memory Usage |
---|---|---|---|---|
40 | 1000 | 2590 | 0.03 min | 198.16 MB |
40 | 10000 | 39694 | 0.23 min | 202.75 MB |
40 | 100000 | 400556 | 2.18 min | 248.42 MB |
40 | 1000000 | 400556 | 2.22 min | 248.42 MB |
60 | 1000 | 1395 | 0.03 min | 248.42 MB |
60 | 10000 | 12669 | 0.23 min | 248.42 MB |
60 | 100000 | 129502 | 2.12 min | 250.07 MB |
60 | 1000000 | 129502 | 2.10 min | 250.07 MB |
80 | 1000 | 1092 | 0.03 min | 250.07 MB |
80 | 10000 | 9866 | 0.22 min | 250.07 MB |
80 | 100000 | 101426 | 2.11 min | 253.08 MB |
80 | 1000000 | 101426 | 2.10 min | 253.08 MB |
100 | 1000 | 867 | 0.03 min | 253.08 MB |
100 | 10000 | 8822 | 0.23 min | 253.08 MB |
100 | 100000 | 90293 | 2.14 min | 256.41 MB |
100 | 1000000 | 90293 | 2.13 min | 256.41 MB |
2. Annotating Matches¶
Script: scripts/annotate.pl
Inputs:
File | Description |
---|---|
results/{sample}.matched |
Matched read output from previous step |
hg38_chr1_geneannotation.gff3.gz |
Gene features |
hg38_chr1_tss.txt.gz |
Transcription start sites |
hg38_cpg.txt.gz |
CpG locations |
hg38_repeatmasker.bed.gz |
Repetitive elements |
Output:
- results/{sample}.annotated
(tab-delimited, with annotation columns)
Description:
It enriches each match record by querying overlap with gene models, TSS positions, CpG sites, and repeat regions.
The output adds columns for:
- Gene ID and feature type
- Distance to nearest TSS
- CpG count within the matched interval
- Overlap with repeat elements