Skip to content

Markov Chain Module

The markov/ directory contains Perl modules and scripts for training Markov chain models on sequence data and generating random walks. The implementation supports both first-order and arbitrary higher-order chains, with two sampling strategies and optional pseudocount smoothing.


Components

Perl modules

MarkovChainFirstOrder.pm

Object-oriented first-order Markov chain implementation (markov/lib/).

Key methods:

Method Description
new(pseudocount) Constructor; accepts an optional pseudocount for smoothing
build_transitions(sequences) Reads sequences (via read_fasta) and counts all adjacent symbol pairs
generate(N, start) Returns a string of length len(start) + N starting from start

Sampling methods:

  • Alias method — constant-time O(1) sampling using a precomputed alias table. Preferred for large generation tasks.
  • Binary search — O(log k) sampling by binary searching cumulative probability distributions. Simpler to implement but slower per sample.

Pseudocount smoothing:

When a pseudocount is provided, it is added to every possible transition before normalising. This prevents zero-probability events and ensures the chain can always make a transition, even for states unseen in the training data.

MarkovChainHigherOrder.pm

Extends the first-order model to support arbitrary-order chains. Instead of single symbols, higher-order chains use k-mers as states. A chain of order k groups k consecutive symbols to form each state and predicts the next symbol based on that context.

The interface mirrors MarkovChainFirstOrder.pm:

  • Same constructor, build_transitions, and generate signatures.
  • Same sampling methods (alias, binary search).
  • Same pseudocount support.

Note

The generate(N, start) method returns a string of length len(start) + N. The start string is included as a prefix of the output. For a higher-order chain, the start string should be at least as long as the chain order.


Command-line scripts

randomwalk.pl

The primary entry point for Markov chain generation (markov/bin/).

perl markov/bin/randomwalk.pl \
    --fasta INPUT.fa \
    --length 100 \
    --start A \
    --order 1 \
    --method alias \
    --pseudocount 0.01
Option Type Default Description
--fasta Path required Training FASTA file
--length Integer required Characters to generate
--start String A Starting state
--order Integer 1 Chain order
--method String alias Sampling method: alias or binsrch
--pseudocount Float 0 Smoothing pseudocount

Output: a single line containing the generated DNA sequence.

validate.pl

Compares a Markov model against observed sequences by computing log-likelihoods. Useful for evaluating model fit and tuning the order and pseudocount parameters.


Wrapper integration

The Markov module is accessed through the Python wrapper layer:

bio_sea_pearl.api.generate_walk(fasta, length, start, order, method, pseudocount)
  └─ perl_wrappers.markov_wrapper.run_markov_walk(...)
       ├─ Locates repo root
       ├─ Constructs: perl markov/bin/randomwalk.pl --fasta ... --length ...
       ├─ Sets PERL5LIB to include markov/lib/
       ├─ Filters stdout for valid DNA sequences
       └─ Returns the generated sequence string

The wrapper filters the output to extract only valid DNA character lines, discarding any diagnostic messages or warnings from the Perl script.


Usage

biosea markov --fasta training.fa --length 200 --start A --order 2 --method alias
curl -X POST http://localhost:8000/markov \
    -H "Content-Type: application/json" \
    -d '{
        "fasta": ">train\nACGTACGTACGT",
        "length": 50,
        "start": "AC",
        "order": 2,
        "method": "alias",
        "pseudocount": 0.01
    }'
from bio_sea_pearl.api import generate_walk
walk = generate_walk("training.fa", length=100, start="A", order=1)
print(walk)

Choosing parameters

Parameter Guidance
Order Higher orders capture longer-range dependencies but require exponentially more training data. Start with order 1 or 2.
Pseudocount Use a small value (0.001–0.1) to smooth rare transitions. Set to 0 if your training data is large enough to cover all transitions.
Method alias is faster for large generation tasks. binsrch is simpler and sufficient for small outputs.
Start For higher-order chains, the start string must be at least order characters long.