Sequence Tools¶
Sequence tools span two implementations: the original Perl modules in seqtools/ and the pure Python ports in src/bio_sea_pearl/seqtools_py/. The API layer transparently selects the Python implementation when available, falling back to Perl if needed.
Capabilities¶
| Function | Python port | Perl module | Description |
|---|---|---|---|
| Hamming distance | Positional mismatch count for equal-length strings | ||
| Levenshtein distance | Edit distance (insertions, deletions, substitutions) | ||
| K-mer counting | Frequency table of all k-length substrings | ||
| Boyer–Moore search | — | Fast pattern matching with skip heuristics | |
| FASTA parsing | — | Read FASTA files into sequence records |
Python implementations¶
Located in src/bio_sea_pearl/seqtools_py/. These are the preferred runtime path.
hamming.py¶
Single-pass comparison. O(n) time, O(1) space.
Raises ValueError if the two strings have different lengths.
levenshtein.py¶
Classic dynamic programming with space optimisation — uses two rows instead of the full matrix. O(n×m) time, O(min(n,m)) space. Handles empty strings correctly.
kmer.py¶
Sliding-window hash-based counting. O(n) time.
Raises ValueError if k ≤ 0 or k > len(seq).
Import paths¶
# Via the public API (recommended)
from bio_sea_pearl.api import hamming_distance, levenshtein_distance, kmer_counts
# Direct access to the Python implementations
from bio_sea_pearl.seqtools_py import hamming_distance, levenshtein_distance, kmer_counts
Perl implementations¶
Located in seqtools/SeqTools/. These are the original implementations and serve as fallbacks.
hamming.pm¶
Computes Hamming distance using Inline C for performance. The C kernel compares characters in a tight loop, avoiding Perl interpreter overhead. Falls back to pure Perl if Inline C is not available.
levenshtein.pm¶
Edit distance via Inline C with a space-optimised DP approach — O(min(n,m)) working space. Similar fallback behaviour as hamming.pm.
kmer.pm¶
K-mer counting using Perl hash aggregation. Outputs tab-separated kmer\tcount pairs.
boyermoore.pm¶
Boyer–Moore string matching algorithm. Uses the bad-character heuristic to skip ahead on mismatches, achieving sublinear average-case performance. This module is available only through Perl; there is no Python port.
fasta.pm¶
Minimal FASTA parser. Reads header lines (starting with >) and concatenates sequence lines into single records.
Perl CLI dispatcher¶
seqtools/run.pl is a unified entry point for Perl-based operations:
perl seqtools/run.pl kmer input.fa 3
perl seqtools/run.pl hamming seq1.fa seq2.fa
perl seqtools/run.pl levenshtein seq1.fa seq2.fa
perl seqtools/run.pl motif input.fa PATTERN
Fallback strategy¶
The API layer in seqtools_api.py implements a try-Python-first pattern:
hamming_distance("ACGT", "AGGT")
│
├─ _PY_PORT_AVAILABLE is True?
│ ├─ Yes → call seqtools_py.hamming.distance()
│ │ ├─ ValueError → re-raise (validation error)
│ │ └─ Other exception → fall back to Perl
│ │
│ └─ No → call perl_wrappers.distance_hamming_perl()
│
└─ Return integer result
This means:
- No Perl installed? Hamming, Levenshtein, and k-mer functions still work via the Python ports.
- Python port broken? The system degrades gracefully to Perl.
- Genuine input errors (e.g. unequal-length strings for Hamming) are raised immediately regardless of which backend is used.
Tip
If you are deploying without Perl (e.g. a minimal Python container), the seqtools commands will work correctly as long as the seqtools_py module is importable. Only alignment and Markov commands strictly require Perl.