Mathematical Foundations and Implementation Model¶
This page describes the actual mathematics implemented in SequenceAligner, based on the codebase rather than a generic
alignment summary. The implementation combines classical dynamic programming with indexed seed discovery and chaining.
The main algorithmic parts are:
- suffix array construction
- FM-index construction and backward search
- k-mer seed generation
- seed chaining with affine-style gap costs
- affine-gap global alignment
- affine-gap local alignment in candidate windows
- longest common subsequence (LCS)
The code supports both DNA and protein scoring modes through EDNAFULL and BLOSUM62-style substitution tables, respectively.
1. Sequences, notation, and scoring¶
Let
- \(X = x_1x_2\cdots x_m\) be the query sequence
- \(Y = y_1y_2\cdots y_n\) be the target sequence
with lengths \(m = |X|\) and \(n = |Y|\).
The implementation defines a substitution score function
which depends on the selected mode:
- DNA mode: EDNAFULL-derived scoring
- protein mode: BLOSUM62-derived scoring
In the code, this is dispatched through:
Gap penalties are affine:
- gap open: \(g_o = \texttt{GAP\_OPEN} = -5\)
- gap extend: \(g_e = \texttt{GAP\_EXTEND} = -1\)
so that a gap of length \(L \ge 1\) has score
These constants are explicitly defined in the code.
2. Suffix array construction¶
For a string \(T = t_0t_1\cdots t_{N-1}\), the suffix array is the permutation
such that
in lexicographic order.
The implementation uses a rank-doubling construction. Initially, each suffix starting at position \(i\) is ranked by the character code of \(T_i\). At iteration \(k=1,2,4,\dots\), suffixes are sorted by the pair
where \(r_k(i)\) is the current rank of the suffix beginning at \(i\). If \(i+k \ge N\), the second component is treated as \(-1\).
After sorting, new ranks are assigned:
and for \(j \ge 1\),
The process stops once all ranks are unique. This is the standard \(O(N\log N)\) rank-doubling method implemented in the code.
3. FM-index construction¶
To index the target sequence, the implementation builds an FM-index on
where \(\$\) is a sentinel smaller than all biological alphabet symbols.
3.1 Burrows-Wheeler transform¶
Given the suffix array \(\mathrm{SA}\), the Burrows-Wheeler transform \(B\) is
3.2 C-table¶
For each character \(c\), the FM-index stores
the total number of symbols in \(T\) that are lexicographically smaller than \(c\).
3.3 Occ table¶
For each character \(c\), the occurrence table stores
that is, the number of occurrences of \(c\) in the prefix \(B[0:i-1]\).
3.4 Backward search¶
For a pattern \(P = p_1p_2\cdots p_\ell\), matching intervals are computed right-to-left. Start with
Then for \(i=\ell,\ell-1,\dots,1\),
If at any step \(l \ge r\), the pattern does not occur. Otherwise, after the last character, the suffix-array
interval \([l,r)\) identifies all matching locations of the pattern in the target sequence. The code then maps that
interval back to text coordinates using sa[i].
4. Seed generation from query k-mers¶
The code generates exact-match seeds by sliding a k-mer window across the query. For a chosen seed length \(k\), define query k-mers
For each \(K_i\), FM-index backward search returns all matching target positions
Each match yields a seed
where
- \(q\) = query start position
- \(t\) = target start position
- \(k\) = seed length
So the raw seed set is
This is exactly what generate_raw_seeds constructs.
5. Seed chaining¶
A seed \(s_i\) has coordinates
with query end and target end
The seeds are sorted by query position, then target position. The chaining step computes a best monotone chain under overlap, gap, and diagonal-consistency constraints.
5.1 Feasibility constraints¶
A seed \(s_j\) may precede \(s_i\) only if:
-
no overlap in query [ q_j^{\mathrm{end}} + \delta_{\min} < q_i ]
-
no overlap in target [ t_j^{\mathrm{end}} + \delta_{\min} < t_i ]
-
bounded inter-seed gaps [ 0 \le d_q \le \delta_{\max}, \qquad 0 \le d_t \le \delta_{\max} ] where [ d_q = q_i - q_j^{\mathrm{end}} - 1, \qquad d_t = t_i - t_j^{\mathrm{end}} - 1 ]
-
diagonal consistency [ \left| (q_i-t_i) - (q_j-t_j) \right| \le \Delta_{\max}. ]
These correspond to min_diag_gap_val, max_diag_gap_val, and max_offset_dev_val in the code.
5.2 Chain score¶
Each seed contributes its length:
The gap cost between two consecutive seeds is computed independently in query and target using the same affine penalty parameters:
and the total inter-seed cost is
The DP recurrence implemented in find_best_seed_chain is
Equivalently, in the code’s update form,
The best chain score is
Backtracking through the predecessor array reconstructs the ordered chain of anchors. {index=9}
6. Affine-gap global alignment¶
Global alignment is implemented in align_segment_globally using three DP states.
For prefixes \(X_{1:i}\) and \(Y_{1:j}\), define:
- \(S_{i,j}\): best score ending in a match/mismatch state
- \(E_{i,j}\): best score ending with a gap in sequence 1 relative to sequence 2 in the code’s traceback convention
- \(F_{i,j}\): best score ending with a gap in sequence 2 relative to sequence 1
The code uses row-compressed score arrays and a full traceback matrix.
6.1 Initialization¶
The affine boundary conditions are:
Along the first row:
Along the first column:
6.2 Recurrence¶
For \(i \ge 1, j \ge 1\),
[ M_{i,j} = \max{S_{i-1,j-1}, E_{i-1,j-1}, F_{i-1,j-1}}
- \sigma(x_i,y_j). ]
Gap state updates are
Then
The final global score is
6.3 Traceback¶
The traceback matrix stores one of:
Mfor match/mismatch transitionEorefor gap-open or gap-extend in one directionForffor gap-open or gap-extend in the other direction
Backtracking from \((m,n)\) reconstructs aligned strings
with insertions represented as -. This is exactly the mechanism implemented by align_segment_globally.
7. Affine-gap local alignment in windows¶
Local alignment is implemented in perform_sw_in_window. It uses a Smith–Waterman-style local objective with affine
gaps and three score matrices \(S,E,F\). The local alignment is not necessarily run on the full sequences; instead, the
code uses candidate windows derived from seeds/anchors.
7.1 State definition¶
For a query substring \(X'\) and target substring \(Y'\),
- \(S_{i,j}\): best local score ending at \((i,j)\)
- \(E_{i,j}\): best local score ending with a horizontal affine gap state
- \(F_{i,j}\): best local score ending with a vertical affine gap state
7.2 Initialization¶
All local matrices start at zero:
7.3 Recurrence¶
For \(i,j \ge 1\),
[ M_{i,j} = \max{S_{i-1,j-1}, E_{i-1,j-1}, F_{i-1,j-1}}
- \sigma(x_i,y_j). ]
Gap states are floored at zero:
The local score state is also floored at zero:
The best local alignment score is
7.4 Local traceback¶
Traceback starts from the cell \((i^\*,j^\*)\) where \(S_{i,j}\) is maximal and stops once the running state reaches 0. This produces the optimal local alignment within that window. The implementation then maps window-relative coordinates back to original sequence coordinates using stored offsets.
8. Longest common subsequence (LCS)¶
The code also implements exact longest common subsequence using standard dynamic programming in
compute_lcs_for_segment. This is distinct from score-based alignment: it maximizes subsequence length, not
substitution score.
8.1 DP table¶
Let \(L_{i,j}\) be the LCS length of prefixes \(X_{1:i}\) and \(Y_{1:j}\). Then
For \(i,j \ge 1\),
8.2 Backpointer matrix¶
The implementation stores directions:
Dfor diagonal matchUfor upLfor left
Backtracking reconstructs:
- the LCS string itself
- a gapped representation of both sequences consistent with the chosen traceback
Thus the final outputs are:
lcs_stringgapped_seq1gapped_seq2
with
8.3 Anchored LCS¶
For long inputs, the code optionally uses FM-index-derived anchors first. The full problem is decomposed into inter-anchor segments:
LCS is computed on each segment independently, while the anchor substrings themselves are inserted directly into the final result. If anchor \(a_k\) has length \(\ell_k\), then the final total LCS length is
[ L_{\mathrm{total}}= \sum_{s=0}^{r} L^{(s)}
+
\sum_{k=1}^{r} \ell_k, ]
where \(L^{(s)}\) is the LCS length of segment \(s\). This is exactly how reconstruction is performed in the anchored LCS routine.
9. Anchored decomposition of alignment problems¶
For both global and LCS workflows, the code may first find a best seed chain and then split the full sequences into inter-anchor segments.
Suppose the chosen anchor chain is
with anchor \(a_k=(q_k,t_k,\ell_k)\).
Define segment boundaries recursively:
- segment 0: from sequence start to just before \(a_1\)
- segment \(k\): between \(a_k\) and \(a_{k+1}\)
- final segment: after \(a_r\)
These segment pairs are aligned independently, then concatenated with the exact anchor substrings themselves. If \((\hat X^{(s)},\hat Y^{(s)})\) is the alignment of segment \(s\), the final reconstructed global-style alignment is
where \(A_k\) is the exact anchor substring, copied identically into both aligned outputs. {index=16}
10. Matrix output and serialization¶
The code supports exporting DP and traceback-style matrices in both text and binary form.
10.1 Integer DP matrix¶
If \(D \in \mathbb{Z}^{r \times c}\) is a DP matrix, the binary writer stores:
- number of rows
- number of columns
- row-major matrix entries
so the serialized payload is conceptually
10.2 Character matrix¶
For traceback matrices \(T \in \Sigma^{r \times c}\), the character writer stores the same leading dimensions and then the row-major flattened character data, padding rows if needed. This matches the matrix export utilities in the code.